NCAS-CMS / cfa-conventions

NetCDF Climate and Forecast Aggregation (CFA) Conventions
https://github.com/NCAS-CMS/cfa-conventions/blob/main/source/cfa.md
1 stars 1 forks source link

Recast as a stand alone document #20

Closed davidhassell closed 3 years ago

davidhassell commented 3 years ago

Previously the document silently assumed that it was part of the CF conventions, which is not to be the case (at this time, at least). This PR aims to make the document independent.

sadielbartholomew commented 3 years ago

@davidhassell I see you are adding new commits, please let us know when this is stable for a review. Thanks.

davidhassell commented 3 years ago

Sorry! All done, now.

davidhassell commented 3 years ago

Hi Sadie,

Perhaps a simple high-level diagram showing the relationships and/or roles of the main concepts defined in the 'Aggregation variables' section (the variables themselves, aggregated data, aggregated dimensions, fragments and their dimensions, etc.)? And/or something that emphasises that CFA sits alongside CF and the relation to netCDF data and metadata in each case?

Yes - I think that would be very useful. It may be that a modified combination of @bnlawrence's UML diagram and @nmassey001's implementation UML could work. Without reference to either (simply because I couldn't find them!) maybe something like, which is probably a bit wrong, being done on the back of an envelope in 2 minutes:

20210624_093518

sadielbartholomew commented 3 years ago

@davidhassell fantastic (and I love an old-school prototype), that general form of UML diagram looks ideal to me (assuming you check and tweak it is 100% right for the final version of course, as you would do I am sure) since it shows the main concepts discussed in the document and how they relate to each other without having too much detail as to potentially overwhelm.

Though as part of their review others might have different thoughts on a diagram, so I am equally happy for you to follow their suggestions with something a bit (or a lot) different, and maybe they can share their own UML that you wanted to refer to in case that allows additions.

Thanks, I will finish my review of the text shortly.

davidhassell commented 3 years ago

Take 2 on the UML:

20210625_113913

davidhassell commented 3 years ago

It occurs to me, learned from previous data modelling exercises, that whatever UML we might think is best/correct in the very short term, probably isn't! I suggest moving this discussion to an issue, and when settled a diagram can always be added to make version 0.6.1.

This would hopefully mean that we can finalise the new text and advertise it sooner rather than later.

sadielbartholomew commented 3 years ago

I suggest moving this discussion to an issue

Sounds very wise, please go ahead.

davidhassell commented 3 years ago

Thanks, Sadie. I've fixed those braces and, more weightily, removed a a CF dependency ( https://github.com/NCAS-CMS/cfa-conventions/pull/20/commits/b5d0eb351f31c4fd2fa46f2ff8ffe84b620c5c45).

davidhassell commented 3 years ago

Thanks, @nmassey001.

Good point about versioning. I wonder if we should await the community feedback (that will hopefully come to a head in September after CF workshop) before moving to v1.0. I don't know.

davidhassell commented 3 years ago

Dear @JonathanGregory

Thank you for reviewing this. Here are some responses to your points, which are largely meant to explain why it has been done like it has, not to necessarily imply that we won't change it! We can take from here with this background in mind.

  • Since this is now a separate convention, will you put its attributes in Appendix A? Perhaps that would be a good idea, like for UGRID.

We're currently assuming that the CF conventions do not recognise CFA. If/when this is the case, we'd either move the text into CF, or link to like we do for UGRID. In either case, appendix would indeed get some entries.

  • I am confused about the location variable. The data is contained in an orthogonal array of fragments. I understand that to mean that the decomposition of any given aggregated dimension into fragments is broadcast along all the other fragment dimensions. That means the decomposition needs to be specified only once for each fragment dimension, and I'd therefore expect the location variable to have dimensions (number of fragment dimensions, largest fragment dimension size, 2).

I see. Example 2 gives us:

    // Existing. Location has shape (2, 1, 1, 1, 4, 2)
    // Location has shape (4, 2) = shape of fragment array + (number of fragment dims, 2)
    location = 0, 5,
               0, 0,
               0, 72,
               0, 143,
               6, 11,
               0, 0,
               0, 72,
               0, 143 ;

I think you are suggesting:

    // Jonathan's proposal.
    // Location has shape (4, 2, 2) = (number of fragment dims, largest fragment dimension size, 2)
    location = 0, 5, 6, 11,
               0, 0, _, _, 
               0, 72, _, _,
               0, 143, _ , _;

I might also suggest

    // Another "dask-like" idea: record the sizes, not the ranges.
    // Location has shape (4, 2) = (number of fragment dims, largest fragment dimension size)
    location = 6, 6,
               1, _,
               73, _
               144, _ ;
  • The file variable could be omitted if all fragments are in the same file as the aggregation variable.

  • The format variable could be omitted if every fragment is either in the same file as the aggregation variable or is a netCDF file i.e. assume by default than an external file is netCDF.

This was touched on in https://github.com/NCAS-CMS/cfa-conventions/issues/7. We felt that a key feature of these conventions was the ability to aggregate arbitrary file formats, so making a special case for netCDF fragments doesn't really fit in with that. Admittedly the conventions currently only specify how to address netCDF files, as that is our primary use case, but we have wondered about including UM/PP at this time (possibly even decided to do so - can't remember!).

The case when all fragments are in the parent file is certainly possible, but would be unusual - so creating a special case for address in this circumstance didn't seem worth it.

  • I am confused about the appearance of the heading "Example 1". This doesn't seem to begin an example.

I have assumed CFA does not need to be used with CF, and this example was meant to to show how they can play nicely together.

I now think that perhaps this is a mistake, and CFA must be used with CF, in which case the italicised text of example 1 would become main body text (like it was in previous versions). What do @bnlawrence @nmassey001 @sadielbartholomew think?

  • Future-proofing would replace Gregorian with standard.

Cutting edge stuff! Done.

JonathanGregory commented 3 years ago

Dear David

    // Jonathan's proposal.
    // Location has shape (4, 2, 2) = (number of fragment dims, largest fragment dimension size, 2)
    location = 0, 5, 6, 11,
               0, 0, _, _,
               0, 72, _, _,
               0, 143, _ , _;

Yes, that's right. Isn't there a lot of redundancy in your earlier version? Your dask-like version is even better, since they must be contiguous.

The format variable could be omitted if every fragment is either in the same file as the aggregation variable or is a netCDF file i.e. assume by default than an external file is netCDF.

We felt that a key feature of these conventions was the ability to aggregate arbitrary file formats, so making a special case for netCDF fragments doesn't really fit in with that.

Alternatively, you could allow the format variable to be a scalar if all the entries in it are the same (all nc or pp, for instance), which is the most likely case, I would say - it feels wasteful to have a variable all full of repetition. Even more compactly, you could specify the unique value in the aggregate_data attribute, rather than supplying a separate variable.

Cheers

Jonathan

davidhassell commented 3 years ago

Dear @JonathanGregory,

I like the idea of using a scalar format value when all fragments are either in external files of common format (as will be the common case), or are all in the parent file (in which case the scalar format value would be missing data).

I'm not so keen on encoding a common format in the aggregation_data attribute, since it would be hard to discern if the format name was a format identifier or a variable name.

@nmassey001 - what do you thank about these format suggestions, and the location suggestions. I personally am quite taken by the dask-like version, now ...

bnlawrence commented 3 years ago

(You may not be surprised to learn that my take on the UML is the text is more ambiguous and if the UML is correct, I personally don't care about the text :-) ... however, I'm in a minority of one on this, but you can bet that I care about 0.6.1!)

Before getting into the detail I think we need to address the first paragraph :-).

The existing readme states:

The CFA (Climate and Forecast Aggregation) conventions are designed for storing a netCDF file which does not contain the data of selected variables ("aggregation variables"), rather those variables contain special attributes that provide instructions on how to create their data as an aggregation of data from other sources, which may be self-describing datasets in their own right.

But I don't think that's quite right because they don't start from the problem. I'd say:

The CFA (Climate and Forecast Aggregation) conventions describe how a netCDF file can be used to describe a dataset distributed across multiple other data files. A CFA compliant aggregation can be described in NetCDF in such way that the describing file does not contain the data of selected variables ("aggregation variables"), rather it contains variables with special attributes that provide instructions on how to create the aggregated variable data as an aggregation of data from other sources, each of which may be self-describing datasets in their own right.

In doing so we allow the possibility of a CFA aggregation sitting alongside a normal variable and avoid using the phrase "master" file (which I know wasn't in your version, but I am conscious we need to avoid it.)

I believe I am otherwise content with things now and agree with using a scalar for format!

davidhassell commented 3 years ago

Thanks, @bnlawrence.

I have updated the Introduction as you suggest - much better.

Do you have an opinion on whether CFA must be used with CF? I'm now of the opinion that there is no use-case for creating a CFA dataset that is not otherwise CF-compliant, and so CFA should be cast as as "add-on" to CF, rather than an add-on to vanilla netCDF.

I'll put together a commit for allowing a scalar format variable. (edited)

davidhassell commented 3 years ago

Scalar formatvariable implementation: https://github.com/NCAS-CMS/cfa-conventions/pull/20/commits/1c50acc79489a2c03793712a6506be657b153e95?short_path=ef074b2#diff-ef074b2373736cf397d3e2e3613197bc8190ed058f661b1d62713b6c4cbfd577

bnlawrence commented 3 years ago

I think there is a use case for using CFA conventions for the aggregation of non-CF compliant netcdf files, but they would need to have enough compliance around coordinates for it to work. Specific examples :

  1. Could we use the CFA conventions to persist the view of an aggregation that could be constructed by other tools on the fly (e.g. xarray) ... i.e. how much coordinate information is the minimal needed to do this aggregation?
  2. We already know how to aggregate PP files, which are by definition not CF compliant.
davidhassell commented 3 years ago

Hi Bryan, I think the CF-compliance of the fragments is not the issue, rather the CF-compliance of the of the CFA file. Does that make sense?

nmassey001 commented 3 years ago

@nmassey001 - what do you thank about these format suggestions, and the location suggestions. I personally am quite taken by the dask-like version, now ...

I don't understand how the indexing works here:

 location = 6, 6,
               1, _,
               72, _
               143, _ ;

Should the first 6 in your version of the location array be a 0?

As an aside, I don't think I've picked up on this before, but can we switch to Python / Numpy / C / every sensible programming language style slice definitions? i.e. there are five elements between 0,6 , not six elements.

davidhassell commented 3 years ago

Hi @nmassey001,

The first 6 maps to [0, 5], and the second 6 maps to [6, 11].

The 72 maps to [0, 71]

For example, the current:

        location = 0, 5,
                   0, 0,
                   0, 35,
                   0, 143,
                   0, 5,
                   0, 0,
                   36, 72,
                   0, 143,
                   6, 11,
                   0, 0,
                   0, 36,
                   0, 143,
                   6, 11,
                   0, 0,
                   36, 72,
                   0, 143 ;

would become:

        location = 6, 6
                   1, _,
                   36, 37,
                   143, _;

I'm not sure the Fortran and Julia folks would agree with you, but moving to sizes would make the slice definition go away, anyway!

nmassey001 commented 3 years ago

@davidhassell Aha, they are sizes. Got it! I'm all for minimal and compact representations, as you probably know, so I'd be happy with this change

bnlawrence commented 3 years ago

Hi Bryan, I think the CF-compliance of the fragments is not the issue, rather the CF-compliance of the of the CFA file. Does that make sense?

Yes, that's fine, other software can ignore the CF'ness and just use aggregation goodies. Sorry for the misunderstanding.

davidhassell commented 3 years ago

Thanks @bnlawrence and @nmassey001 - I'll address the CF compliance and location issues in separate commits ...

davidhassell commented 3 years ago

New CF integration: https://github.com/NCAS-CMS/cfa-conventions/pull/20/commits/6b9296cb821f0c6720bbb80eb17e1ccef95b4ea0?short_path=ef074b2#diff-ef074b2373736cf397d3e2e3613197bc8190ed058f661b1d62713b6c4cbfd577

New location variable: https://github.com/NCAS-CMS/cfa-conventions/pull/20/commits/ac815cb56b07fbdd2c66981ecabe99e083c8d038?short_path=ef074b2#diff-ef074b2373736cf397d3e2e3613197bc8190ed058f661b1d62713b6c4cbfd577

davidhassell commented 3 years ago

Hi @bnlawrence,

(You may not be surprised to learn that my take on the UML is the text is more ambiguous and if the UML is correct, I personally don't care about the text :-) ... however, I'm in a minority of one on this, but you can bet that I care about 0.6.1!)

I look forward to developing this over at https://github.com/NCAS-CMS/cfa-conventions/issues/21 ... :)

davidhassell commented 3 years ago

Following an off-line discussion on 2021-07-27, merging this now and tagging master at v0.6