SciTools / iris

A powerful, format-agnostic, and community-driven Python package for analysing and visualising Earth science data
https://scitools-iris.readthedocs.io/en/stable/
BSD 3-Clause "New" or "Revised" License
629 stars 283 forks source link

NetCDF global attributes vs data variable local attributes #3325

Closed bjlittle closed 10 months ago

bjlittle commented 5 years ago

Update Mon 3rd July: this effort is now handled in it's own project please see there for existing task breakdown + progress


At the moment iris takes a rational but naive approach to dealing with the local attributes of a NetCDF variable (a variable that becomes a cube) and the global attributes of the NetCDF file that the said variable comes from.

That is, the resultant cube.attributes will be a combination of both the local and global attributes, where the local attributes will take precedence, and overwrite, common global attributes.

From the inception of iris, and in the light of no use cases, this seemed like a reasonable thing to do. However, such an approach prevents preservation of the local and global attributes metadata. This is a major issue for many users, who require to preserve all attribute metadata.

We require to resolve this issue now in iris once and for all :smile:

Note that, if a solution to this issue was implemented, then it would most likely be a breaking change - caution is needed here.

This is somewhat tangential related to https://github.com/SciTools/iris/issues/2352

bjlittle commented 5 years ago

Ping @zklaus :+1:

zklaus commented 5 years ago

A non-breaking way to implement that may be to introduce two new class members global_attributes and local_attributes (or var_attributes) responsible for the respective attributes with a property replacing the current attributes that can mimic the current behaviour for reading, overwrite existing entries in the respective dictionary, and otherwise defaulting to the local one.

bjlittle commented 5 years ago

@zklaus I was thinking along the same lines. In summary:

So for the case where there is a common attribute shared between cube.global_attributes and cube.local_attributes, the local value is shown in the cube.attributes. When saving such a cube to NetCDF, then the common local and global attribute is preserved (hoo-rah) i.e. the local on the NetCDF data variable of the cube, and the global in the global scope of the NetCDF file.

To ensure that there is a non-breaking behaviour here, then I think that I'm right in saying that if a user writes to the cube.attributes then this state is captured in the cube.global_attributes. Note that, cube.attributes is stateless, in the sense that it is simply derived on the fly at run-time from both cube.local_attributes and cube.global_attributes, with local having priority over global. This means that if a user wants to associate an attribute to the NetCDF data variable of the cube, then they must explicitly add the attribute to the cube.local_attributes, and not the cube.attributes.

Hmmm.... this make sense to me. Thoughts?

zklaus commented 5 years ago

Exactly what I was thinking!

jonseddon commented 5 years ago

We bumped into this ticket while looking at a related project. Can I check what would happen when you save a cubelist rather than a cube? If cubes in the cubelist had different values of the same global attribute, how would this be saved? If this saved netCDF file was loaded back into Iris would we get the same cubelist and if we didn't, would this matter?

bjlittle commented 5 years ago

@jonseddon Good question...

Clearly, if there is a global attribute conflict across Cubes in the CubeList, then it wouldn't be possible to save any such conflicted attribute to the netCDF file.

However, it begs the question whether a CubeList should also have state for overriding attributes on save. Again, careful consideration is required here to understand the overall behaviour and whether that's appropriate. In particular, consideration is required also for loading from multiple different netCDF files...

To be honest, I'd opt to separate concerns here. I'd see the debate about CubeList having attributes state as an extension to this proposal - but there should certainly be clarity for the behaviour when there is a conflict for global attributes of Cubes in a CubeList as it stands here.

ehogan commented 5 years ago

I had a question about this :) Does it make sense to add global attributes and variable attributes, which I would argue are netCDF-specific concepts, to the cube, which is meant to be format-agnostic?

zklaus commented 5 years ago

Re cubelists: The question is certainly a good one. Note that right now there is no guarantee that loading a single cube from a netcdf file and saving it again will give you the same file. For example,

Seeing as it seems to me that a lot (most?) data has one variable per file (notably of course all cmip and cordex data) I am not sure I would be worried about consistency in cubelist storing so much, at least until we have better consistency in cube storing. Though it is certainly a good idea to keep this in mind so as not to make unnecessary outright contradictory decisions.

Re format agnosticism: That is certainly a nice goal, but maybe it needs better definition? It seems that attributes in and of themselves are unsupportable in, eg grib and derived formats. Surely we don't want to abandon them completely. So, is there a format that has attribute support, is supported by iris, and could not be made to work with this model?

bjlittle commented 5 years ago

@ehogan From a purely idealistic perspective, I'd agree with you. A Cube should be format agnostic. However, in reality, that's not really the case.

For me it's an intention at best, rather than a hard and fast rule. Consider the special way that we handle PP STASH and NetCDF var_name. These fileformat specifics have crept into the way that we deal with cubes and coordinates, along with other CF-isms that may only make sense for NetCDF. The reason this has happened is that there is tangible utility or benefit behind it - so it's a common sense compromise in my opinion. However, we do try hard not to dilute our attempts to be as agnostic as possible.

I don't know if this helps answer your question...

pp-mo commented 5 years ago

Consider the special way that we handle PP STASH and NetCDF var_name

Actually we should have something in GRIB space too, but we don't. Just plugged a suggestion here : https://github.com/SciTools/iris-grib/issues/153

larsbarring commented 3 years ago

This issue will be relevant to #3388 if a data producer adds the same non-standard attribute (such a comment) to multiple coordinate system variables.

wjbenfold commented 2 years ago

In terms of making this happen, Bill's summary seems a good and straightforward way to go.

Outstanding questions of mine:

Edge case

(in user's views, what would be the best / least surprising behaviour? @ehogan @zklaus @jonseddon @larsbarring)

What happens if we save a cubelist that contains cubes with conflicting global attributes? We could demote them to local, but then what do we do when they conflict with local ones?

I'd favour a warning, combined with demoting them if that wouldn't cause an overwrite of a local attribute and dropping them if it would.

Other questions

zklaus commented 2 years ago

Thanks for picking this up again, @wjbenfold!

As I stated earlier, for me the top priority is to get the one cube case working as described in @bjlittle's excellent summary.

As for your questions on the cubelist, I think it does make sense to demote the attributes to local on conflict where possible. But perhaps it's best to make the user do it? I.e. provide a function like consolidate_attributes that finds the conflicting global attributes, tries to demote them to local, and requires manual intervention where that is not possible. I am a bit on the fence about whether that should also be the default behaviour on save. On the one hand, some users may perceive it as convenient to forego the extra step, but on the other hand, this is bound to lead to surprises where global attributes end up unintentionally as local, which might be worse because there would be nothing in the storage process that indicates a problem and this kind of thing may be discovered only much later.

This brings to my mind also the question of how CubeList.merge and CubeList.concatenate should deal with this kind of conflict.

Perhaps it's best to focus purely on the single cube level for now and to have a deeper discussion about cubelist semantics separately.

pp-mo commented 2 years ago

Just musing ...

We could demote them to local, but then what do we do when they conflict with local ones?

I don't favour this. It would make more sense to raise an error, and forcer the use to resolve such a conflict, which I expect would probably be an unexpected occurrence.

I'm also not sure whether we should allow the same attribute to exist in both local + global. But, it does make some logical sense as a possiblitiy.

I think, logically, a local attribute "overrides" a global one, but also a global one is in some sense "inherited" by everything in the file. In netcdf-4 groups (which however CF still rejects), that is exactly how variables reference dimensions within groups (i.e. it always refers to the nearest enclosing group where a dimension of that name is defined)

Just searching for a concrete example... Of the CF controlled attributes in their table (appendix A), there are a few (but only few) which are valid for both global and data-variable (i.e. "local") usage. An example is "institution". So, if you load data from two files with different 'institution', and save both to a single file, you will get a conflict. This can logically be resolved by demoting one of them.
However it also points up a more subtle problem : If one dataset contained a global 'institution' attribute, and the other did not, then the other data is now (incorrectly) tagged as belonging to that institution. Also, the global definition will again, rather theoretically, "apply" to all auxiliary-data, coordinates and aux-coords, even those which actually came from the 'other' file.

I'm not sure that is the limit of 'logical problems' here, but it certainly illustrates one possible. I conclude that we definitely need the user to be responsible for resolving some of these resulting problems.


And maybe, to avoid 'making up' metadata that applies to specific data, we will have to restrict what we allow? Ideally, we want to enable the addition of a global attribute by just adding it to one of the cubes.
But the strictest interpretation of the above seems to indicate that to be "safe", we can only save a global attribute if it is the same for all saved cubes ???

pp-mo commented 2 years ago

This brings to my mind also the question of how CubeList.merge and CubeList.concatenate should deal with this kind of conflict.

The lenient interpretation is that a combined result loses any attributes that differ, and retains those which appear in only one of the sources. https://scitools-iris.readthedocs.io/en/latest/further_topics/lenient_metadata.html#lenient-behaviour

I think that is a reasonably useful mechanism to apply, but it could lead to attributes incorrectly applied to the whole of the output data, which are only really true for part of it -- as I just pointed out.

Perhaps we can implement a strict/lenient strategy here + let the user choose ?? If strict is default, then it is "safe", unless the user says otherwise - i.e. a context manager enables "lenient save"

pp-mo commented 2 years ago

@zklaus As I stated earlier, for me the top priority is to get the one cube case working ... Perhaps it's best to focus purely on the single cube level for now and to have a deeper discussion about cubelist semantics separately.

Unfortunately, I don't see that we have that choice. We have to be able to save multiple cubes, and we have to avoid "inventing" potentially-invalid metadata at all costs. So we at least need a "safe" solution. But we could anyway make the new strategy user-selectable (and warn of possible overinterpretation), which for backwards compatibility we will probably need anyway -- either maybe a Future flag, if the new behaviour is to become the only way, or some other context control if it needs to remain optional.

zklaus commented 2 years ago

The most important case where this actually plays a role that I know of is CMIP. CMIP6 refers to this CMIP5 document (CMIP6 Global Attributes, DRS, Filenames, Directory Structure, and CV’s, p 9, note 6). The CMIP5 document particularly speaks about history and comment:

From page 8:

▪ Optional global attributes → comment = A character string containing additional information about the data or methods used to produce it. The user might, for example, want to provide a description of how the initial conditions for a simulation were specified and how the model was spun-up (including the length of the spin- up period). → history = A character string containing an audit trail for modifications to the original data. Each modification is typically preceded by a "timestamp". The "history" attribute provided here will be a global one that should not depend on which variable is contained in the file. A variable-specific "history" can also be included as an attribute attached to the output variable.

From page 15:

▪ Optional attributes for variables: → comment = a character string providing further information concerning the variable (e.g., if the variable is mrso (soil_moisture_content), the comment might read "includes subsurface water, both frozen and liquid, but not surface water, snow or ice"). → history = a character string containing an audit trail for modifications to the original data (e.g., indicate what calculations were performed in post- processing, such as interpolating to standard pressure levels or changing the units). Each modification is typically preceded by a "timestamp". Note that this history attribute is variable-specific, whereas the global history attribute defined above (see "Optional attributes" under "Requirements for global attributes") provides information concerning the model simulation itself or refers to processing procedures common to all variables.

It is indeed relatively common for modelers to include both and sometimes it is important information.

pp-mo commented 2 years ago

If it's at all common to have data which specifies a significant attribute both globally and locally, then it isn't truly possible to save both of those to the same file. Unless you use netcdf groups, which CF refuses to countenance >sigh< :cry:

zklaus commented 2 years ago

Actually, you may be interested in cf-convention/cf-conventions#144, cf-convention/cf-conventions#145 ;)

wjbenfold commented 2 years ago

Given that

I think I like the suggestion of allowing a variable to be set (either within a context manager or at the top of the file, as suits the user at the time) that requires global attributes to match and causes these operations to fail otherwise. There could also be a keyword argument to save/merge/concatenate that enforced it for a single operation, though giving that many ways to do the same thing is potentially confusing and kind of non-pythonic.

Without that variable set, pre-existing behaviour could be maintained

wjbenfold commented 2 years ago

cube.attributes is a (stateless) combination of the cube.global_attributes and cube.local_attributes, akin to now, where the cube.local_attributes have priority over the cube.global_attributes

To ensure that there is a non-breaking behaviour here, then I think that I'm right in saying that if a user writes to the cube.attributes then this state is captured in the cube.global_attributes. Note that, cube.attributes is stateless, in the sense that it is simply derived on the fly at run-time from both cube.local_attributes and cube.global_attributes, with local having priority over global. This means that if a user wants to associate an attribute to the NetCDF data variable of the cube, then they must explicitly add the attribute to the cube.local_attributes, and not the cube.attributes.

I think that these two statements are contradictory and that the latter should say that adding to cube.attributes should write to cube.local_attributes (otherwise you can write to cube.attributes then read from it and the attribute you just wrote (that was actually written to global) can be hidden by a local one such that you don't see your change happen when you do the read).

larsbarring commented 2 years ago

Sorry for crashing into this conversation without having digested the technical discussion in any detail. But here is my two pennies worth of user perspective: 1p: The most important thing is that both the local and global attributes are preserved and available after reading the data. If there are clashes when merging or concatenating individual file cubes, that can be handed over to the user to decide. I.e. pretty much as now is done if there are metadata problems -- but with good and enough detailed information regarding what is causing the problem so that the user can find and handle the issue. 1p: In CF Conventions 1.9 section 2.6.2. Description of file contents the following sentence appears

When an attribute appears both globally and as a variable attribute, the variable’s version has precedence.

which, if literally interpreted, of course means what is says. But in my experience it is more relevant to think about this in terms of the actual content of the attributes. Consider these two mock-up examples:

         my_var: comment = "this data is bad"
// global attributes:
                 :comment = "this data is good"

and

         my_var: comment = "CMOR changed units from X to Y"
// global attributes:
                 :comment = "CMOR made the coordinate system information conformant with ZMIP requirements"

If the same attribute appears both in the variable and as global it typically is along the lines of the latter example.

wjbenfold commented 2 years ago

It turns out it's more complicated than expected...

I'm pretty sure that implementing an attributes property that combines local_attributes and global_attributes at runtime will fail to behave as expected when users assign into attributes. I've tried thinking about a couple of ways around this with very little progress.

In terms of other approaches, I think we can

I intend to implement the bolded option. Does anyone think this is either broken or can be improved upon?

pp-mo commented 2 years ago

I think we need to think about how to preserve existing behaviour within the "new" structure, so we should first make a clear analysis of what that actually is.

A couple of sticky points seem clear to me (even in advance of a full analysis) : (1) at the very least in "legacy mode", which we want to preserve (at least for interim backwards compatibility), we can't decide absolutely on load whether a given attribute is to be written as local or global, as this can also depend on the content of other cubes (which we don't know about at load-time).

(2) in future, we definitely do want cubes to possess distinct 'local' and 'global' attributes. And the default naturally needs to be to save these as recorded, so we can definitely reproduce input from a single file. But even in an "ideal" scheme, several cubes being saved together may create conflicts. IMHO it seems better to resolve at least some of these possibilities automatically (as per "legacy" approach) than to raise errors (though warnings would be appropriate).

TODO: my attempt at describing what it currently does, maybe coming shortly ? ....

pp-mo commented 2 years ago

my attempt at describing what it currently does

Oh gosh, not simple. I may be some time ...

pp-mo commented 2 years ago

my attempt at describing what it currently does

Sorry for delay, the code is more baroque than I knew ! Here's my best account ( but may still be wrong in places : maybe worth testing some of these ideas, rather than trying to deduce truth from code )

The code divides possible attributes into various different classes of interest

  1. any "Conventions" attribute
  2. those "specially handled", listed in '_CF_ATTRS'
  3. those "variable-only", listed in '_CF_DATA_ATTRS' and '_UKMO_DATA_ATTRS'
  4. those "global-only", listed in '_CF_GLOBAL_ATTRS'
  5. those which are the same in all cubes saved at one time
  6. everything else (attributes not on all cubes, or not the same on all cubes)

(1) Class 1 includes either "conventions" or "Conventions" . These are always ignored; we never write a "local" conventions attribute; we always write a global Conventions = "CF-1.7".

(2) Class 2 have interpreted CF/netcdf meaning handled specially by the loader, e.g. "_FillValue", "valid_min", "standard_name". All of these can never appear as attributes of loaded cubes.
We also prohibit setting them via a Cube.attributes being a LimitedAttributeDict. Unfortunately the 'forbidden' list here is not exactly the same, which in some cases clearly makes sense but I fear this scheme could have some holes in it. So, these will get written as data variable attributes as required by save rules, but not from being in an attributes dict (e.g. "standard_name", "_FillValue")

(3) Class 3 is those which only make sense as local attributes applying to a particular data-variable, e.g. "flag_meanings" It does not include things like "valid_min", "_FillValue" : those are in class 2 (broadly?) So, these are always local

(4) class 4 are handled like any other non-local-only (non-class 3) attribute, except that if they get written as local attributes we also raise a warning about it. So, these are expected to always be global, but will be created as local attributes if required.

(5) attributes which are not in class 2/3, and appear on all saved cubes. --> Saved as file global attributes.

(6) all others --> Saved as local attributes, i.e. as attributes of the (cube) data variable, instead of file global attributes

NOTE if there is only 1 cube, class (6) does not really happen, since (5) dominates, except for the specific cases in classes 1/2/3 : This is the same as all cubes having the same attribute. So the general principle is that everything that can be global will be.

pp-mo commented 2 years ago

IMHO, what comes out of this ... My main take-homes are :

  1. loading will naturally assign attributes into the new global/local categories
  2. when reproducing the existing save behaviour, it certainly isn't possible to follow those initial assignments exactly
  3. .. but ... even in the new world "non-legacy mode", possible collisions between attributes recorded as "global" can still reasonably occur
    • either by combining results loaded from different sources, or from other actions like creating cubes from scratch
    • In such cases, we should revert to storing them as "local", with an appropriate warning
  4. We probably also need to retain the existing notions that some specific attributes are expected to be global-only, and some are local-only.
    • probably still worth warning about unexpected usages
    • but also, there may be scope for some changes here to be treated as "bug fixes"
    • In this context, we should definitely also consider @larsbarring comment above, that having the same attribute recorded as both local and global, can be useful, and is seen in practice -- despite possibly being contra to the intentions of CF. possibly even for recognised attributes like "history". That could also suggest possible changes to the existing behaviours
      1. when assigning any-old generic attributes via the legacy interface (e.g. cube.attributes['odd'] = 1), I guess that 'local' should be the default, that seems sensible. However, this will be at odds with existing behaviour, at least for a single cube. I guess in the new world, we would expect global attributes to be the exception -- you have to ask for that explicitly. Which at least gives less opportunity for conflicts.
larsbarring commented 2 years ago

Following up on what @pp-mo just wrote

... having the same attribute recorded as both local and global, can be useful, and is seen in practice -- despite possibly being contra to the intentions of CF.

Some time ago I had a conversation with some CF persons regarding what was/is actually meant with the sentence I was citing

When an attribute appears both globally and as a variable attribute, the variable’s version has precedence.

Whether it is the actual existence of same attribute [name] at both places, or the content of the of the two attributes being contradictory. While we did not came to a definitive conclusion we agreed that it was not the intention to only consider the attribute names and totally disregard their content.

This would not be the only place where CF is not crystal clear regarding the distinction between "name" and "content" or "spelling" and "meaning". I think that it is useful to think back what was the common situation about 20 years ago (or more), when the first version CF was produced. Then the common practice was that analysts manually looked at files and their content (ncdump style) before doing analyses. Then it was just obvious what to do and how to interpret the data, e.g. that standard and gregorian calendars were the same, and that `proleptic_gregorian`` also is the same for recent centuries, and that an attribute existing both as local and global may either contradict or complement each other (or even both) depending on the content, and how to make use of that information. I believe there are many other places where the same kind of ambiguity may exist in the CF conventions document.

wjbenfold commented 2 years ago

Question to the users in the room: what behaviour do you want from merge / concatenate / other operations that make a new cube at the end? When we would make the choice as to whether (for example) a mismatch of these attributes should block merge, we won't know if they're intended to be used (and if they're not then users shouldn't be blocked on that merge).

We could

There's also the option (which is simpler, and therefore more quickly implemented, but correspondingly less useful) of setting up a way that the user can access the full local/global attribute information at load time, hold it themselves and then specify its application to the resultant cubes at save time. Would that appeal?

larsbarring commented 2 years ago

As one "user in the room" I venture some some thoughts:

  1. It seems useful to keep a "legacy mode" as an alternative to a "new mode" (or maybe even several alternative new modes, like simple/advanced...)
  2. Roundtrip consistency: loading a simple cube and then saving it should give the same file and then the same cube when again loading from the new file (cf. @zklaus's comment). The same for cube lists based on one file.
  3. If I try to come up with a general (and ugly) use case. Assume that we have three files each containing the same two variables, that eventually would be merged into two cubes, or a cubelist of two cubes:

File 1 (covering time period 1):

        my_var1 // has no comment attribute
        my_var1: history = "CMOR changed units from X to Y"
        my_var2: comment = "this data is very good"
        my_var2: history = "CMOR changed units from Z to W"
 // global attributes:
                 :comment = "generally all this data is good"
                 :history = "CMOR made the coordinate system information conformant with ZMIP requirements"

File 2 (covering time period 2):

        my_var1: comment = "this data is good"
        my_var1: history = "CMOR changed units from X to Y"
        my_var2 // has no comment attribute
        my_var2: history = "CMOR changed units from Z to W"
 // global attributes:
                 :comment =  "generally data for this period is not so good"
                 :history = "CMOR made the coordinate system information conformant with ZMIP requirements"

File 3 (covering time period 3):

        my_var1: comment = "this data is good"
        my_var1: history = "CMOR changed units from X to Y"
        my_var2: comment = "this data is good"
        my_var2: history = "CMOR changed units from Z to W"
 // global attributes:
                 :comment = "generally all this data is good"
                 :history = "CMOR made the coordinate system information conformant with ZMIP requirements"

That is, someone first made an overall assessment of data quality and found that "generally all this data is good". Later someone looked in more detail and found that that some periods were less reliable and furthermore that this varied between variables.

While data quality preferably should be recorded as ancillary quality flags, this example illustrates that it is possible (and perhaps reasonable) to have some kind of complementary (or perhaps hierarchical) global and local attributes. Furthermore it illustrates that an [advanced?] user might want to know the (possibly multidimensional) order in which files were concatenated/merged. I touched on this in a previous comment on another (now closed) issue.

Anyway, these are some thoughts that I realise are maybe not so easy to implement .

pp-mo commented 2 years ago

Update : my view, as-at 2022-09-07

I feel we already reached a point here were we mostly agree on principles, but then dropped the ball in that we (Iris devs) should have produced a proposal.
I'm now aiming to do that soon ...
I'll either get stuck on something requiring further discussion, or I'll produce an actual PR (which however could well prove contentious!)

I've been prompted too by discussions elsewhere, on iris/xarray inter-conversion, which again turned up some of the known roundtrip problems that Iris has.

trexfeathers commented 1 year ago

The level of outstanding work means this isn't going to make it into Iris v3.4. We will however keep up the momentum over the next few weeks to see this gets finished without going stale.