cf-convention / discuss

A forum for any discussion about interpretation, clarification, and proposals for changes or extensions to the CF conventions.
43 stars 6 forks source link

New optional variable attribute: "user_unit" #190

Open larsbarring opened 1 year ago

larsbarring commented 1 year ago

Sometimes the formal requirements associated with the units attribute is not fully aligned with what a data producer/user is used to. Examples of this is ppmv and related "by-volume" units that were recently discussed,(see here), and unit 1 vs. days discussed in association with the standard name number_of_days_with_X_above|below_threshold (see here). Also, the unit required by units for salinity vs. what is used in practice has been debated on several occasions over the years.

To alleviate this situation I suggest to add a new optional variable attribute, where the attribute value is not managed by CF.

If we think of the long_name as something like a succinct plot title or table header, this new attribute would provide a [kind-of] associated "unit" users would be familiar with in the case the [usual] units attribute is felt to be too much restricted by formal requirements.

This new attribute could be called "user_unit", "alternative_unit" or something similar, or maybe "long_unit" if we want to more closely link it to the long name. It is expected to be used only when there is a widely recognized difference between the units and what is in common use.

DanHollis commented 1 year ago

I'm aware of one example of this. The controlled vocabularies for UKCP (UK Climate Projections) include a label_units attribute, which I believe is intended for the purpose you describe (e.g. labelling a plot). My hunch is that other projects may have solved the same problem using a different attribute name.

I'm not against CF recommending an attribute name for 'user units', but if there are already a variety of alternatives in use then adoption of a new name may be slow.

larsbarring commented 1 year ago

Thanks @DanHollis, from this I take it that the concept as such is useful.

Finding an attribute name that suits everyone might as you say be more difficult. It would be good to know if there are other more or less well established alternatives already in use.

@cameronsmith1: in another issue you hinted at similar needs for labeling plots. And, @Dave-Allured, in the same issue you were not entirely satisfied (here and here) with formal restrictions to how/when "by-volume" and other recognised units can be used. Would adding such an optional attribute address your concerns?

cameronsmith1 commented 1 year ago

Hi @larsbarring . I am sure many people would find this convenient. However, there is a general aversion within CF to allowing equivalent information to appear more than once. The concern is that file generators may alter one of the metadata items and forget about the other, leaving an inconsistency within the file. This issue has been discussed in other contexts before, and I think this concern has always won the argument.

ngalbraith commented 1 year ago

Using a standard name without units isn't allowed in CF, so if the canonical units specified in the standard name table are not ... useful, we'd need a new standard name. I dislike the idea of a second units attribute, because, as others have said, providing multiple fields with the same information can cause problems - when they don't agree, and/or when code doesn't know to check multiple fields.

DocOtak commented 1 year ago

@ngalbraith If the standard name describes a dimensionless quantity, you may omit the unit attribute:

From section 3.1:

Units are not required for dimensionless quantities. A variable with no units attribute is assumed to be dimensionless. However, a units attribute specifying a dimensionless unit may optionally be included.

Thus far, every time I've personally encountered someone who thought they could not use the units they were accustomed to, it was due to misunderstanding between the relationship of canonical units and their actual units.

Dave-Allured commented 1 year ago

@larsbarring, sorry, I missed your ping. Your proposal is constructive and generous. However I think it is moving in the wrong direction. I feel that alternate labeling mechanisms are symptoms of a larger problem, the deficiency of the units attribute as currently defined in CF. I believe that restrictions on units are creating unwanted and ongoing side effects, some of which are highlighted in this new discussion. I do appreciate the external information reported by you and other commenters here.

I would like to remove some of the traditional restrictions on units, in particular the no redundancy rule. This has been tested several times before, and is seemingly unpopular, as recently mentioned.

ngalbraith commented 1 year ago

@ngalbraith If the standard name describes a dimensionless quantity, you may omit the unit attribute: ...

Units are not required for dimensionless quantities. A variable with no units attribute is assumed to be dimensionless. However, a units attribute specifying a dimensionless unit may optionally be included.

It didn't occur to me that the proposal was only for dimensionless variables, but, if so, then I have no objection at all. Thanks!

davidhassell commented 1 year ago

Hello, it is certainly good to try to minimise redundancy, but CF does already embrace attributes that are often overlapping in information content with others. The axis attribute is one (its value can usually be inferred from standard_name, units, positive, etc.); and the standard_name and long_name often contain the same information but with different words. The danger of inconsistency comes, I think, less from data writers, rather from data readers who may make in-memory changes after reading the data. For example, when combining fields they (or more likely their software!) may reasonably remove/change the standard name from the result, but might overlook the long_name. They/it might convert the data to "metres" by multiplying by 1000, but forget to change the units from "kilometres", etc. I am actually quite comfortable with this situation. CF has allowed the creator to describe their data, and the reader has understood it. For me, what the reader (or their software) does next comes under the principle of "buyer beware".

So, that all said. I'm OK in principle with the possibility of redundancy in the this case of standardised and non-standardised units. This is independent of whether of not the new attribute is considered a good idea!

So, on the question of whether or not may be a good idea, I think that it is fine if some groups will find it useful, and I don't think that an extra optional attribute of this nature would be a burden on the conventions.

With regards the name of the proposed attribute, I'm not keen on the it including the word "user" - a common software term but not very descriptive for a metadata standard. I quite like "long_units", for the connection with long_name, and that attribute the value will often have more characters (not a great consideration, I admit!).

The original proposal suggests that the attribute value would be wholly unstandardised, which is fine, but it occurs to me that you could do an inverse standardisation by saying that it's value can not be a valid CF value. This is easily checked, and would prevent the attribute being accidentally used in place of the real units attribute.

Thanks, David

DocOtak commented 1 year ago

The axis attribute is not redundant in that any given variable is only allowed to have a single axis of a given type (XYZT). My data often have multiple lon/lat variables from a few sources. We only consider one of these to be the canonical variable and that gets the X/Y axis designation. All the other CF attributes are the same and you would not be able to divine the axis variables you are supposed to use without this extra information.

In some other datasets I've seen, usually the long_name of a variable is the plot axis or data label, not the title for the entire plot. I've also seen the convention "var name [var units]" (e.g. "sea water temperature [deg C]") used in the long_name attribute.

Unless CF wants to control the contents/values of the attribute, CF should not define or recommend anything other than the usual "a file may also contain non-standard attributes" and you can do whatever you want with them. You can add a "user_units" attribute to your data right now and continue to be CF compliant. I guess my feeling on this is, defining an attribute and not controlling it with the conventions is basically the same as not doing anything with the existing conventions, which allows whatever the user wants already.

JonathanGregory commented 1 year ago

If the "other units" are intended as information to be read by humans, rather than processed by programs, they could be recorded in the long_name, couldn't they?

davidhassell commented 1 year ago

Hello Jonathan, we could of course put anything in the long_name, but I feel that that would make the other units concept less tractable, because I think that humans and software may want to access the other units independently of the identity of the quantity. For example, if the reader legitimately changed the units (e.g. kg to g), they may well want to change the non-standardised units too, either by modifying its value or deleting the attribute. The latter action is also something that could be done automatically by helpful software . Thanks, David

JonathanGregory commented 1 year ago

Since the interest is in recording a unit which isn't standardised and may not be udunits- or SI-compliant, I don't think one would expect generic software to touch it. Whenever someone does an operation on a field, they should pay attention to whether the long_name ought to be updated, and that would include any unit string it might contain, I think.

davidhassell commented 1 year ago

Hello Jonathan, what do you think about the case of being able to delete the other units attribute, leaving the long name alone? Software could easily do this, but it could never modify the long_name attribute in this way. Cheers, David

taylor13 commented 1 year ago

I think we're probably talking about non-existing software here that might be able to automatically adjust all the metadata automatically whenever a CF-compliant netCDF variable is modified (in ways that can't all be anticipated) and then rewritten. If such software did exist, it would probably have to routinely eliminate the "long_name" and any "comment" (which might no longer apply after the data have been modified) and eliminate any non-CF attributes recorded in the original file (which also might need to be changed in ways unknown to the software). If the new units attribute were made part of CF (as proposed above), it too would have to be eliminated in the case of a change in actual units. More useful software would probably ask the user whether any of the metadata it couldn't interpret should be modified before rewriting the modified data.

If that's what's envisioned for software of this kind, then adding another attribute (alternative units) would just be added to the list of attributes the software would have to eliminate.

I guess one could therefore justify adding alternative units if enough folks would find it helpful. On the other hand, I think it unlikely this attribute would be defined except in very rare cases. For those cases, I would suggest that the data provider propose the "new units" be added to udunits as a valid unit, rather than modifying the CF conventions. Alternatively, for a sub-community of users who find this attribute helpful, they could mandate it be used (as a non-CF required attribute) for any data they exchange. For CMIP there are more than a dozen non-CF standard attributes that are required to be included in the netCDF files shared under that project (e.g., experiment_id, source_id, among others listed here).

I therefore vote against adding an "alternative units" attribute unless there is more evidence provided showing that it will find broad use across the climate and weather forecast community.

DanHollis commented 1 year ago

This discussion made me wonder why we have long_name in the first place i.e. an attribute whose name is specified but whose content is uncontrolled. As far as I can see it appears in the conventions purely because it is in the NUG. Would CF have invented it had it not already existed?

I guess the benefit of prescribing an attribute name is that it encourages all data writers to use the same attribute for similar information (which helps data readers know where to look for such information). However this only really works if it is introduced on day 1. To introduce something similar now (such as long_units) is going to struggle to get adopted - those that don't need it don't care, and those that have some use for it will have already invented their own attribute name and won't wish to upset their user community by switching.

I mentioned at the start of this thread that UKCP have a label_units attribute. I've just scanned through one of their controlled vocabularies and the only examples I can see where it differs from the units attribute are for temperature variables (where the units are degC and the label uses a degree symbol) and one instance where the label is 'degrees latitude' and the units are simply 'degrees'.

cofinoa commented 1 year ago

IMO long_name and comment attributes can provide the required free-text information, to data users, about non-standard names and/or non-standard units potentially been used.

Different issue/question it's about replacing/extending UDUNITS as standard reference (with exceptions) for the units attribute.

davidhassell commented 1 year ago

Hello,

It occurs to me that CF has, in some sense, always had this feature in that it allows level, layer, and sigma_level as non-Udunits values of the units attribute. These are non-standardised units that were useful to one particular community at one time, and it seems to me plausible that an alternative units attribute could have been chosen to represent them in way back in CF-1.0.

Some potential uses for a non-standardised attribute will, quite rightly, probably never be acceptable to Udunits, nor CF, such as kg CO2e (kg of carbon dioxide equivalent).

Karl's point about software is a good one. Which properties should be modified or deleted after field has been modified or combined with another field is subjective, and there are various approaches to how a software library behaves by default. For example, in cf-python, if you divide "air_temperature" by "time" the result will have modified units, be stripped of standard and long names; but (e.g.) comment, history, etc will be remain as those that were present on the left hand side operand. I thought that that was probably OK most of the time, but that will not always be the case. Adding a the potential removal of another attribute is, for me, just a another line or two of library code, so is not really a burden. Similarly, if you were doing the aforementioned operation without the benefit of a CF-aware library, then you will either not worry about the metadata because you just need a numbers, or else else you will in which case one more on a small list if concerns seems OK to me.

Perhaps a guidance list (website/appendix ?) of standardised attributes that may need attention after field manipulation would be a useful resource.

Thanks, David

larsbarring commented 1 year ago

Thanks David @davidhassell --- your comment captures much of my thinking behind the initial post. I just the other day learned about the kg vs. kg CO2e debate, else I would have included it as an example in the first post. And this is a clear example where the CF strictness vs. specific communities' requirements/views perhaps could be resolved to the satisfaction of all parties by having a long_unit available. From other issue conversations it may be clear that I fully support that CF takes a rather strict and restrictive stance on units and their relation to standard names representing well defined quantities. But I also recognise that there are justified needs to have something more permissive.

Just to briefly respond to some of the earlier comments:

@danhollis writes:

However this only really works if it is introduced on day 1. To introduce something similar now (such as long_units) is going to struggle to get adopted - those that don't need it don't care, and those that have some use for it will have already invented their own attribute name and won't wish to upset their user community by switching.

Yes, ideally this should have been solved day 1, but new needs arrive from time to time. And the fact that one -- or several -- subgroup(s) have solved it, possibly in inconsistent ways, do not decrease the usability of a common attribute name. New users will be guided by the recommendation, and there is even a chance that groups with their own solution will at some future time find it useful to switch to what is recommended. After all, this is how standards and conventions arise in the first place.

@taylor13 writes:

... non-existing software ...

The particular use case I am involved with is standard name number_of_days_with_X_above|below_threshold with canonical unit 1, where users do expect days. Here we have a large user base using XCLIM (ping @huard) and ICCLIM (ping @pagecp). While we have not discussed the specifics of this attribute and possible implementation in the codes, users' frequently ask about this particular unit, and we have recognised this as something needing a solution.

@DocOtak and @JonathanGregory suggests that the alternative unit should be included in the long name instead of having a separate attribute. But the idea behind the proposal is exactly as what @davidhassell suggests: it will be easier to manipulate if the need arises. Take, for example the situation where unit = "kg", and long_unit = "kg CO2e". If the unit then is changed to g it will be easy to delete the long_unit, or a more sophisticated software may even make the corresponding translation to the long_unit. But the responsibility for doing this is totally on the user. If the "long unit" is instead part of the long_name the required parsing is far from trivial (and would probably depend on some rules, which I imagine CF does not want to become involved with.

@ngalbraith: No, the proposed enhancement is not limited to dimensionless variables only.

JonathanGregory commented 1 year ago

Dear @larsbarring, @davidhassell et al.

David suggests that when the units are changed, sensible generic software could justifiably delete the standard_name and long_name. I think that's reasonable, because they won't be applicable any more. I'm not convinced it would make things easier to have a separate attribute for the user-preferred units string, because in general they both have to change. The same generic software should also delete this new attribute if the units are changed. That's not hard for clever software, as David says, but it's one more thing for a user to forget who is not using such software, and thus a source of inconsistency, arising from redundancy. I think this goes against our principles 6 and 7.

On the contrary, I think it would be easier to put the user-preferred units in the long_name, so they get dropped as part of it. In Lars's example, if we have a quantity with long_name="UK annual GHG emissions (GtCO2e/yr)" and units="Pg yr-1", and we multiply it by a time in yr, we should change the units to Pg. The long_name is wrong in this case (because it's now cumulative emissions) and should be deleted, along with the user-preferred unit, if any. Generic software could not do any better than that. If you wrote a program to deal with this application specifically, you'd probably know what long_name to expect on input, and hence what to replace it with. If your program looked at a separate long_units attribute to identify the input and update it as required, it might forget to replace the long_name at the same time.

In Lars's own user case of number_of_days_with_X_above_threshold, you could put long_name="days", for example, if the standard name is otherwise self-explanatory i.e. use the long_name for the user-preferred units. Are there operations you might typically perform with these quantities which would change the units?

David also commented

Perhaps a guidance list (website/appendix ?) of standardised attributes that may need attention after field manipulation would be a useful resource.

I think that's a good idea, and it could perhaps conveniently be done with an extra flag column in Appendix A.

Best wishes

Jonathan

davidhassell commented 1 year ago

Hello,

In @JonathanGregory's example of changing the units from "Pg yr-1" to "Pg", I agree that deleting the long_name is sufficient, because the physical nature of the quantity has changed. What if, however, we were to express the quantity as "Pkg yr-1", by dividing by 1000? If the long_name was "UK annual GHG emissions" and long_units were "GtCO2e/yr", then we could delete the long_units and keep the identity. It would be wrong to delete the long_name in this case, I think.

Thanks, David

huard commented 1 year ago

Hi, I'd like to chime in as a software developer of a package (xclim) that relies heavily on CF-Conventions. I'm not enthused by the idea of mixing units in the long_name attribute. The fact that long_name would includes units in some cases but not others is going to be a pain to handle gracefully. I'd much prefer an explicit rule.

Could we use cell_methods to embed that type of information, e.g. something like

If there's a regex pattern I can match it to, even better.

taylor13 commented 1 year ago

@huard I think xclim should rely on the regular "units" attribute and ignore any additional (and optional) "long_units" attribute, which will likely be included in in less than 0.01% of variables written (that is to say "hardly ever included"). I think adding a new attribute that is unneeded nearly all the time is embellishing CF in a way that makes it less approachable for new users.

huard commented 1 year ago

@taylor13 xclim computes a number of indices in the "count_events_above_threshold" category, so there would be a legitimate case for us to support this type of feature if it went into the convention. Our team has had discussions with @larsbarring and his team about this over the last years and we look forward to a clean mechanism to include such information in the metadata.

taylor13 commented 1 year ago

Thanks for chiming in. For your use case, how are the units actually going to be used? As labels or titles on a plot? Will your software convert the units to some other units? Could you provide a bit more explanation about how software would use the new attribute?

huard commented 1 year ago

Indeed, we do have downstream utilities that use long_name [units] to label plot axes. We also run automatic unit conversions to align multivariate input data and parameters, and to output results to standard units.

I think our main issue with respect to this topic is that we currently define indicators of the "count_events_above_threshold" with units set to "days", even though we know it's not CF-compliant. We'd like our output to be fully CF-compliant, but not at the expense of leaving out information that we feel is essential to interpret the results.

larsbarring commented 1 year ago

In response to both @JonathanGregory's question

In Lars's own user case of number_of_days_with_X_above_threshold, you could put long_name="days", for example, if the standard name is otherwise self-explanatory i.e. use the long_name for the user-preferred units. Are there operations you might typically perform with these quantities which would change the units?

and to @taylor13's

For your use case, how are the units actually going to be used? As labels or titles on a plot? Will your software convert the units to some other units? Could you provide a bit more explanation about how software would use the new attribute?

there is since at least a year ongoing work related to the WMO ET-CID for expanding the capabilities of the widely used software package CLIMPACT (ping @heroldn) to calculate trends in the supported indices. And if users already now find the canonical unit 1 uninformative and confusing, then the canonical unit 1 year-1 is of course even more difficult.

EDIT: Ahh, and the canonical unit for the trend is 1 day-1, which of course is even more problematic for users.

JonathanGregory commented 1 year ago

Dear all

David writes

What if we were to express the quantity as "Pkg yr-1", by dividing by 1000? If the long_name was "UK annual GHG emissions" and long_units were "GtCO2e/yr", then we could delete the long_units and keep the identity. It would be wrong to delete the long_name in this case, I think.

Suppose a quantity in Pg has long_name="UK cumulative GHG emissions (GtCO2e)". An application program which is written specifically for this case could be made clever enough to replace GtCO2e with TtCO2e in the long_name when the numbers are divided by 1000. The units could change from Pg to Eg. This is not a familiar unit to me, but it's an SI unit. The simplest thing for generic software is to delete the long_name attribute. An alternative is that it could append something, like for a history attribute. Appending to a string does not require parsing or interpreting it. In this case, it could make the long_name="UK cumulative GHG emissions (GtCO2e) divided by 1000". I don't think that generic software should be expected to do anything fancy with non-standardised attributes.

I appreciate the wish for a more familiar unit to describe the quantity, but I continue to feel that putting it in the long_name would meet the need. A human looking at the dataset can see and understand the long_name if they don't recognise the units or standard_name. The long_name may be used by software to supply plot labels, for example, as mentioned by @huard. The CF standard says, "The long_name attribute is defined by the NUG to contain a long descriptive name which may, for example, be used for labeling plots." I think it's fine to include familiar units in such "descriptive" information, although there is the danger of creating mistakes because of including redundant information.

Best wishes

Jonathan

JonathanGregory commented 1 year ago

Dear all

@davidhassell and I have just talked about this over a cup of tea. As a result, we feel that a critical question is the one Karl @taylor13 asked: What is the alternative units string going to be used for?

If it's intended for labelling plots, the long_name would be a good place for it, I believe. There must be lots of software which already uses the long_name to construct labels for plots. Is there a use-case in which putting the alternative units in the long_name, or in another attribute whose content is not standardised e.g. comment, could not be used to provide the metadata the data-writer would like to record in a form the data-reader can use?

In the use-case of @larsbarring and @huard, with quantities having standard names like number_of_days_with_X_above|below_threshold, should we define new quantities, of equivalent purpose, whose canonical unit is s i.e. time, so the units could be written as days, as preferred by the users? I expect we did not do that before because we found the existing standard name a short and simple way to express the idea.

Best wishes

Jonathan

larsbarring commented 1 year ago

Dear Jonathan,

Thanks for this comment/questions. I agree that the question @taylor13 asked is at the heart of this issue. And as you in you comment there are [at least] two answers:

  1. As I tried to explain in the initial post the aim with the alternative units is to enable users that find the rules governing units values too restrictive. Examples of how/why the current rules can be too restrictive were given, also in the following posts. I agree given the current state of affairs (i.e. software tools) this alternative string can indeed be placed in the long name string, as might already be the case in many situations. However, for our own particular use case (1 vs. days) we are in fact contemplating how an extended unit string could be handled by a "thin" software layer sitting on top of UDUNITS. And then it would be substantially easier to have the alternative unit string in a separate attribute rather than to have first to extract it from the long name string , which may contain almost anything. Even if we do not actually implement this ourselves, having this separation between long_name and "long_unit" would keep the door open for future work in this direction. Merging both into the long_name would make this much more difficult. Hence, I still think that having such an optional attribute would serve its purpose.
  2. But for our particular use case (1 vs. days --- which indeed is a problem for a large community as @huard writes) I do agree that it would be better to come up with new standard names as you suggest. There are already two open issues in that direction (#107 and cf-convention/vocabularies#19), but as these are already rather complex and far-reaching I am not sure they are the best place to move this forward. Let me think this over for a while.
JonathanGregory commented 1 year ago

Dear @larsbarring

By all means, let us think of new standard names if that would solve the immediate problem. I remember your other open issues cf-convention/vocabularies#31 and cf-convention/vocabularies#19, which are both productive discussions that seem near to an outcome. Thanks

Best wishes

Jonathan