Remove restrictions on netCDF object names

Dave-Allured commented 4 years ago

Title: Remove restrictions on netCDF object names

Moderator:

Moderator Status Review: New issue, 2020 January 23

Requirement Summary: None.

Technical Proposal Summary: Remove CF 1.7 section 2.3 restrictions on characters in names of variables, attributes, etc. Resolve ambiguous use of such restrictions.

Benefits

Support international usage.
Allow special characters in names.
Remove ambiguity over requirement versus preference.
Simplify CF rules.
Simplify conformance checking.
Improve compliance for some existing data sets.

Caveats

Breaks compliance with COARDS name rules, but is a superset of them.
Some existing softwares can not handle non-traditional characters. They would need upgrades, but only when presented with new files using expanded character set.

Status Quo: Object names are now restricted to a traditional yet limited character set which does not accommodate many non-western languages, nor other desired naming patterns.

Detailed Proposal: Change the first paragraph of 2.3 Naming Conventions as follows. The remainder of 2.3 is left unchanged.

Current version (1.8 draft):

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores. Note that this is in conformance with the COARDS conventions, but is more restrictive than the netCDF interface which allows use of the hyphen character. The netCDF interface also allows leading underscores in names, but the NUG states that this is reserved for system use.

Proposed:

Variable, dimension, attribute, and group names are not generally restricted by this convention. Any names that are acceptable to the netCDF library may be used. The most notable rules from netCDF are ASCII or UTF-8 character set, forward slash "/" not allowed, and names should not begin with underscore or certain other special characters. Refer to file format specs in the NUG for more details.

(Edit: Added forward slash "/" after following comments were posted.)

JimBiardCics commented 4 years ago

While I generally approve of relaxing the character set restrictions, I think we may need to consider certain patterns that should either be reserved or restricted. As an example, the use of slashes ('/') in names wreaks havoc with group path formalisms that are already in place outside of CF. In addition to the prohibition on having leading underscores that is mentioned in the proposal, the netCDF-LD project (@marqh) is making use of doubled underscores within a name as a mechanism for marking namespaces. There may be other cases "in the wild" where certain patterns are in use, and I think we should be careful to avoid causing problems by being overly loose here.

I suggest that, at minimum, we should disallow the use of slashes ('/') or backslashes ('\') in names, and should call out two or more sequential underscores ('__') as reserved.

steingod commented 4 years ago

I support the constraint indicated above. Especially allowing slashes and backslashes in names will be confusing.

erget commented 4 years ago

Agreed, I think it would be best if the restrictions were presented in a table for readability.

marqh commented 4 years ago

We may get some benefit form considering other standardisation activity in this domain?

RFC3986 defines the generic syntax for the Universal Resource Identifier (URI) https://tools.ietf.org/html/rfc3986

As netCDF variables are resources that are being identified within the domain of a netCDF file, could we benefit from just adopting RFC3986?

This has a reserved character section: https://tools.ietf.org/html/rfc3986#section-2.2

Disclaimer: I have not cross referenced this in detail with the NUG to examine consistency or problem areas (potential for contribution if useful) First glance, these look pretty similar.

If these are consistent, then adopting the NUG definition unchanged looks sensible to me. It already mandates against the use of a '/' character, which is the most problematic one for me, given groups and variable identity within groups.

I'd like to see an explicit reference to the relevant NUG section in the text or linked, as I had to search a bit and I know what I'm looking for I think: https://www.unidata.ucar.edu/software/netcdf/docs/netcdf_data_set_components.html#Permitted is stable enough for a standards document (@ethanrd do you agree this is a stable URI for the resource please?)

mark

JimBiardCics commented 4 years ago

@marqh I like the overall suggestion of RFC3986. I think we should not adopt the "% encoding" concept of RFC3986. And, again, I think we should reserve leading "" characters (per NUG) and multiple sequential "" characters (per netCDF-LD). Are there any other special character sequences in the wild that anyone is aware of — in UGRID or Radial perhaps?

I notice that the NUG section you referenced implies that space characters are allowed as long as they are not at the end of a variable name. Do we want to allow internal spaces?

marqh commented 4 years ago

@marqh I like the overall suggestion of RFC3986. I think we should not adopt the "% encoding" concept of RFC3986. And, again, I think we should reserve leading "" characters (per NUG) and multiple sequential "" characters (per netCDF-LD). Are there any other special character sequences in the wild that anyone is aware of — in UGRID or Radial perhaps?

I agree, @JimBiardCics, that adoption of %encoding is not a path I would want to walk. it's perhaps a useful cross reference, but points like this suggest against including some specific use of RFC3986 within CF

I notice that the NUG section you referenced implies that space characters are allowed as long as they are not at the end of a variable name. Do we want to allow internal spaces?

internal spaces!?!? really

if we can stop that, then that is a good thing. Why would the NUG allow variable names with spaces in them??

my reading of

The names of dimensions, variables and attributes (and, in netCDF-4 files, groups, user-defined types, compound member names, and enumeration symbols) consist of arbitrary sequences of alphanumeric characters, underscore '_', period '.', plus '+', hyphen '-', or at sign '@', but beginning with an alphanumeric character or underscore. However names commencing with underscore are reserved for system use.

lead me to view space as not allowed. However the following:

Beginning with versions 3.6.3 and 4.0, names may also include UTF-8 encoded Unicode characters as well as other special characters, except for the character '/', which may not appear in a name. Names that have trailing space characters are also not permitted.

Could someone from a Unidata background confirm or deny that in netCDF4, a space may be used within a variable name?

zklaus commented 4 years ago

I have zero Unidata authority, but I'd like to state the obvious: Unicode is complicated. This may already account for the somewhat vague formulation in the NUG if one takes a look at the list of whitespace characters in unicode. Indeed, whether one wants to go with a blacklist or a whitelist approach, it may be a good idea to think and write in terms of Unicode character categories (cf here or here).

ngalbraith commented 4 years ago

I'm afraid I'm the odd man out here - I don't think the list of benefits in the original issue stacks up against the costs; in fact some of them don't seem to BE benefits. Maybe some use cases would be helpful ... Could you elaborate on how this change would support international usage?

Is improved compliance for some existing data sets really a goal? What's in these data sets that needs to be described with a name that begins with a number or contains spaces or special characters?

Maybe this is a selfish concern - we use Matlab's built-in netCDF library, and I'm not sure how that would deal with this change. If it's really needed for some specific reason, we'll deal with it, but absent that explanation, this is just a headache for a lot of CF users.

ethanrd commented 4 years ago

Is there a user asking for this extension, a particular use case that needs addressing? CF has generally tried to avoid extensions that seem like a good idea but don’t have a current use case.

Having said that, if we do move forward, I think we should be very cautious. Not only is Unicode very complicated as @zklaus points out, so are the rules around reserved character sets in URLs (and in which part of the URL) and file systems. Extending the set of characters allowed to include those reserved characters means they will need to be properly encoded when used in URLs (e.g., OPeNDAP and OGC WCS). Which, it turns out, isn’t as easy as it might seem.

Also, this or similar proposals/discussions have come up before, I think several times but so far I've only found these two:

A 2014 discussion on the email list (the initial email is here) focused mainly on expanding the set of characters allowed to include ‘@’, ‘+’, ‘-’, and ‘.’ with some mention of Unicode coming fairly late in the discussion.
Trac Ticket #157 suggested moving from “should” to “must” on the current set of allowed characters.

ethanrd commented 4 years ago

@WardF and @lesserwhirls - Could you address the question of whether whitespace characters are allowed in netCDF variable names?

MTG-Formats commented 4 years ago

Having blank spaces in names would break other CF conventions like use of the ancillary variables attribute.

"The attribute ancillary_variables is used to express these types of relationships. It is a string attribute whose value is a blank separated list of variable names. "

How to parse this? float q_error_limit(time) q_error_limit:standard_name = "specific humidity standard error" ; q_error_limit:units = "g/g" ;

taylor13 commented 4 years ago

I must be missing something, but if a variable is named, for example, "a-b", and one uses that in a computer code, how is it interpreted? How is that variable distinguished from the operation: subtract variable "b" from variable "a"? Don't "+", "-", "/", "*", " " all have this problem?

JimBiardCics commented 4 years ago

@taylor13 Your code would have to parse the variable name into code. Until you did something like that, it is just a string.

taylor13 commented 4 years ago

As a user of data, I usually like the names of my variables (in my codes) to be the same as their names in the netCDF file. With the current naming convention for CF, this is always possible, I think. If, however certain restrictions were removed, as suggested above, this would no longer be true.
I would echo others and ask what particular use cases are driving this?

Dave-Allured commented 4 years ago

Well, thank you for all yout thoughtful responses. I see that we are rehashing the 2014 discussion, and probably others. Thanks @ethanrd for finding that. There are good arguments pro and con there, and it is worth reading.

The difference is that only 4 extra characters were proposed in 2014. I simply want to legalize all the other 137 thousand!

Is there a user asking for this extension, a particular use case that needs addressing? CF has generally tried to avoid extensions that seem like a good idea but don’t have a current use case.

No, I do not have a current use case. This is a recurring issue, so I thought this comprehensive approach would be beneficial. Past use cases were mentioned or implied in the 2014 discussion, and in trac 157.

NetCDF developers put some care into expanded name capability, 12 years ago. However, CF restrictions are copied virtually unchanged from 25 year old COARDS rules, which were probably based on ASCII only. CF is overdue to allow the full naming range for creative purposes by all scientific users.

Name quoting is generally easy and well supported in most modern programming languages. This takes care of UTF-8, math symbols, and other active characters. IMO, naming freedom should outweigh exactly matching names of program variables.

ngalbraith commented 4 years ago

@taylor13 Your code would have to parse the variable name into code. Until you did something like that, it is just a string.

Not everyone writes their own netCDF translators, and some packages no doubt take the variable and attribute names from the netCDF variable and attribute names. Those who use these packages are least likely to be in a position to accommodate this change.

When I have a minute I'll give it a try with the Matlab netCDF interface. I'd be much happier to spend the time on it if there was more than 'creative purposes' for a reason. The trac ticket has an example of isotopes with names that begin with a number, which has some weight, but the work around for that seems simple compared to what would be needed by someone using code that auto-assigns variable names.

On the other hand, most folks probably work with multiple standards; OceanSITES would no doubt maintain the variable name restriction, if CF doesn't.

zklaus commented 4 years ago

I agree that it would be good to have use cases.

@ngalbraith is also right that not everyone is writing their CF code based on naked netCDF access. Indeed, I consider such an approach foolish, since CF is far too rich by now to stand a series chance of getting it right.

However, while using the netCDF variable name as a program variable name might be excused in small, not reused code that only ever will deal with, say tas, it is inexcusable in general-purpose library code. How would such a variable enter the namespace without the program knowing its name beforehand? Ultimately, the only way is via the equivalent of eval(var_name). Such code is prone to breakage no matter what restrictions we put on the character set since it would always leave open the possibility of having reserved words of the particular programming language as variable names. Another serious problem is that it opens the possibility to maliciously crafted variable names: How about var_name='system("rm -rf .")'?

Hence, I don't think the argument that all netCDF variable names should be permissible program variable names in all programming languages should guide the design of CF.

DocOtak commented 4 years ago

I had the same thoughts as @zklaus when thinking about the security implications of what I could only imagine was an eval(var_name). I've even seen some of the matlab code which does exactly this to load all the variable into a matlab namespace. I'd even go so far as to recommend that the CF document itself warn against doing this...

martinjuckes commented 4 years ago

I agree that some use cases would be helpful. I'm not sure about the specific proposal that initiated the discussion, but I do agree with the thought behind it that we should have a considered and reasoned policy on this, rather than just having a frozen-in rule based on past library constraints.

One reason that we might want to depart from the full freedom allowed in NetCDF is that we have, in CF, a range of different attributes to describe a variable. The long_name is designed to hold human readable text, the standard_name and units which both have strongly constrained values.

Some application libraries need, in places, identifiers with a restricted character set. For example, I can construct a collections.namedtuple with name tas, but not with name tas.Amon because, in python "Type names and field names can only contain alphanumeric characters and underscores" (cited from an error message generated by collections.namedtuple). Could this be considered as a use case for having place in the convention to specify, for CF objects, an identifier which is composed of "alphanumeric characters and underscores"? The variable name is the de facto place which many people use for this kind of identifier (perhaps because of legacy packages).

Note that the standard_name fits the character restriction, but does not fit the use case because different variables may have the same standard_name.

Another potential use case is for identifiers of concepts described in RDF Turtle which has a character restriction on object names, broader, I think, than "alphanumeric characters and underscores", but definitely narrower than 137 thousand available of UTF-8.

The desire to have a simple identifier is linked, in my mind at least, to the concept of a namespace, which is being discussed in the context of NetCDF (see NetCDF-ld and discussion on namespace delimiters). I don't this is simply a matter of upgrading software to make it accept generic strings: there is a wide range of applications that exploit identifiers constructed from a limited character set in order to enable the use of identifiers within an text string.

zklaus commented 4 years ago

One potential use-case that always came to my mind without an actual example at hand Is the native names of weather stations, say a temperature time-series from the Umeå station, where the variable name contains the station name.

What makes this particularly interesting is that it seems to be permitted already under current CF conventions, since under CF-1.8, Section 2.3 Naming Conventions it says:

Variable, dimension, attribute and group names should begin with a letter and be composed of letters, digits, and underscores. [...] Languages other than English are permitted for variables, dimensions, and non-standardized attributes.

martinjuckes commented 4 years ago

HI @zklaus : good point about the existing rules.

Regarding your use case; wouldn't that use case be covered by setting the long_name to "Temperature time-series from the Umeå station"? The current convention appears to permit "Umeå_station", but not "Umeå station" (blanks not allowed).

The cfchecker (4.0) takes a narrower view of what is allowed, restricting variable names to string matching the python regex: '^[a-zA-Z][a-zA-Z0-9_]*$'.

zklaus commented 4 years ago

Yes, that might be a good way to encode the information. What I wanted to say is this: I find it very plausible that in a national weather service a group sits together and decides to code their station data using variable names tas_station-name with a number of non ascii letters in the station names. Furthermore, that would appear to be perfectly valid CF.

So I think being more explicit about what is meant by "letter" would be good, even if that means saying that only ascii letters are allowed.

sval-dev commented 1 year ago

In case it is helpful to have a real use case, the desire to have variables and groups able to describe "PM2.5" is described in the related discussion at https://github.com/cf-convention/discuss/issues/256 and an example prototype product making use of these group and variable names can be found at: https://asdc.larc.nasa.gov/data/MAIA/L4_GFPM_VSIM001/2018/01/MAIA_L4_GFPM_20180101T000000Z_FB_NOM_R01_USA-Boston_F01_VSIM01p01p01p01.nc

ethanrd commented 1 year ago

Hi all - Just caught up on the conversation in discuss issue #256. I wanted to mention that the Zarr specification group had a related discussion last January or so (Issue #56 "Node name character set" and PR #196). It was pretty focused on expanding to include non-ASCII Unicode characters. There was some good discussion on Unicode normalization and how to restrict the set of allowed characters. At this time, a recommendation for the Unicode normalization was added to the Zarr v3 core specification (see "Node names" section) but any recommendation for a restricted set of allowed characters was put off to an extension. (The Zarr v3 specification is being developed with a core and extensions and conventions model.)

Along with the Python Langague syntax for identifiers/names (mentioned in discuss issue #256), the Zarr discussion also include the following references that might be useful:

Python PEP 3131 "Supporting Non-ASCII Identifiers"
Unicode General Category property of each character (code point) - (Wikipedia page)
Unicode Normalization FAQ
Unicode Standard Annex #31: Unicode Identifier and Pattern Syntax
Unicode Technical Report #36: Unicode Security Considerations

Edited by @JonathanGregory. Ethan had written #256 as text, but GitHub turns this into a link to PR 256 in this repo, whereas I believe Ethan means issue 256 in the discuss repo, so I have made those links explicitly.

ethanrd commented 1 year ago

My main take away from the Zarr discussion is that the Python Language syntax for identifiers/names (mentioned in discuss issue #256) seems like a good starting point (assuming we want to open things up as wide as possible, the other 137K characters as @Dave-Allured said). For maximum interoperability while supporting Unicode, I think further restrictions would be necessary since netCDF variable names often end up in URLs and file system names.

While encoding/escaping of characters/bytes (e.g., url-encoding) is well supported, when to encode, with which encoding, and by what software component can get confusing. So I don't think character encodings fully address interoperability affects of reserved characters in URLs and file system names.

Edited by @JonathanGregory as for the previous comment.

larsbarring commented 1 year ago

If this issue is now revived, it might be a good time to find a moderator (I am not volunteering because of limited expertise and experience).

sethmcg commented 1 year ago

I think that any form of whitespace should be disallowed in naming.

In addition to the aforementioned problem it causes with ancillary variables, it's not uncommon in my experience for a lot of netcdf processing to happen on the command line (rather than neatly encapsulated within the confines of a general-purpose library) by piping the output of ncdump -h through various commands, in particular cut and grep. Whitespace would play havoc with those kinds of workflows, which in my opinion makes it an absolute showstopper.

This use case also makes me very leery about the prospect of allowing any character that is a special character in the shell or a regular expression. Let's not set up a situation that demands lots of quoting.

Further, this suggests to me that if the list of allowed characters is to be expanded, it should be via a whitelist approach rather than a blacklist approach; i.e., the default should be that characters are disallowed unless they have been carefully vetted.

JonathanGregory commented 1 year ago

I agree with all of @sethmcg's points.

larsbarring commented 1 year ago

This is just to record a use case by @markusfiebig in discuss/#256 (that I am going to close as duplicate):

I would in fact propose to relax the character restrictions for CF names considerably since these restrictions limit the usability of the convention. I will soon have to propose names for the concentrations of PCBs, so we are looking at names of the type

2,2',3,3',4,4',5,5'-octachlorobiphenyl mass concentration

The commas and quotation marks in this name are essential to denote the chemical, so they can't be replaced. PCBs and brominated flame retardants are clearly a relevant area of atmospheric research and need to have a place in the CF naming convention.

To limit the character set of a vocabulary to meet the needs of programming languages is rather outdated. Programming languages should serve the use cases, not limit them.

Dave-Allured commented 1 year ago

Regarding program name spaces, quoting, whitespace, etc; it is the programmer's responsibility to be alert to special cases, avoid namespace conflict and code injection, and use quoting as needed. The role of CF should be to describe metadata, not to guard against names reasonably crafted from the rules of the underlying format.

Let's add something to deal with CF's blank-separated lists.

Variable names included in blank-separated lists such as ancillary_variables or coordinates must not include the ASCII space character.

ethanrd commented 1 year ago

I think CF should make recommendations regarding situations that may hinder interoperability. Allowing Unicode characters in variable names will add many challenges to interoperability.

The current language in section 2.3 says "names should begin with a letter and be composed of letters, digits, and underscores". In ASCII that translates to the '^[a-zA-Z][a-zA-Z0-9_]*$' mentioned by @martinjuckes above. I believe the Python Language syntax for identifiers/names I mentioned above is close to the same for Unicode, the underscore gets expanded to the Unicode Connector Punctuation (Pc) category and some mark categories as well.

I think each of those restrictions would be good to mention in CF as providing two different levels of interoperability.

ethanrd commented 1 year ago

The 2023 CF Workshop is next week, 3-5 Oct 2023 (agenda and registration info here). I have been thinking of trying to present an overview of some of the Unicode issues involved in this discussion as a lightning talk. If enough folks attend the workshop interested in further discussion on this topic, we could spin up a hackathon breakout.

Dave-Allured commented 1 year ago

I think CF should make recommendations regarding situations that may hinder interoperability.

Naming interoperability should be governed by the underlying file format, not CF.

larsbarring commented 1 year ago

I must say that the more I think about this issue the more uncertain I get about the intention and scope of the suggested changes. It might very well be that I am confusing myself or misunderstanding the intent, but here is my current thinking summarised into a few points:

The text in section 2.3 specifying the CF naming requirements only applies to what constitutes CF metadata, not everything in a netCDF file.
The netCDF file format accepts almost all unicode characters in names, so in that sense there is no restriction on creativity.
Just as a random example take the CF attribute long_name. I might want to use the Swedish language version långt namn, someone else might want to use the Amharic one ረጅም ጊዜ (from an online translation site), and so forth. There might be perfectly reasonable arguments for doing that at national levels, but does it help interoperability? Is this something that CF should cater for?
From the perspective of CF I have a hard time understanding how naming freedom would go hand in hand with interoperability. In a sense standards and conventions are always restricting "creativity"(*) just to create interoperability. Just think of USB connectors, or wall plugs when travelling. For these two examples I am happy with all the standardisation work that resulted in rather few variants, still I personally think that anything more than just one alternative is too many.
Currently there are use cases for expanding the set of allowed characters in variable names.
We need to clarify what is meant by the word letter in the CF text.

() I write "creativity" because the restrictions equally much creates* creativity, which arises from the fact that data analysts can think creatively about analyses rather wrestling with data harmonisation.

Dave-Allured commented 1 year ago

The text in section 2.3 specifying the CF naming requirements only applies to what constitutes CF metadata, not everything in a netCDF file.

@larsbarring, the opening paragraph of the existing section 2 clearly says "In this section we describe conventions associated with filenames and the basic components of a netCDF file". I think your interpretation "only applies to what constitutes CF metadata" is mistaken.

This proposal applies to the general name space, such as the user's variable and attribute names. It does not propose to change anything about any CF controlled vocabulary.

larsbarring commented 1 year ago

Ah, I now see what you mean --- thanks @Dave-Allured! Then I think that the core of the issue is how the text in section 2.3 interacts with the opening paragraph of section 2. It might be easier to change the opening paragraph to explain what is generally allowed in netCDF files and then state that the remaining text deals with CF specifications. But all this will be discussed at the CF workshop.

ethanrd commented 1 year ago

Hi all - There will be time to discuss this issue tomorrow (4 Oct) during the CF Workshops hackathon breakout session, currently scheduled for 16:40 UTC (10:40 US Mtn). I gave a very brief summary today during the hackathon introductions session (slides 1-3 here). Hope you can join us for the discussion.

Dave-Allured commented 1 year ago

@ethanrd, I took a quick look at your slides. I hope you are able to show and discuss some of the more interesting requested use cases.

From above: 2,2',3,3',4,4',5,5'-octachlorobiphenyl mass concentration

This name actually includes a count of four unique special characters. Comma and apostrophe were already mentioned. Notice that there are also a minus sign and spaces.

Dave-Allured commented 1 year ago

I will add title.fr-CA from the localization discussion, which in my opinion is the optimal solution. There have also been other suggestions on that issue, involving other special characters.

ethanrd commented 1 year ago

@Dave-Allured - Thanks. I have added another slide listing both these use cases.

There was a question from @JonathanGregory in issue cf-convention/discuss#256 about whether the intent for the octachlorobiphenyl use case was for variable names or for standard names. Do you know if that got answered? I'm not finding it in there.

Dave-Allured commented 1 year ago

Ethan, I do not know. I refer you to the original author, @markusfiebig.

larsbarring commented 1 year ago

I will be in another breakout group, so here is a tougth: Might it be useful to have different rulees for attribute names and variable names?

ethanrd commented 1 year ago

@Dave-Allured wrote:

Naming interoperability should be governed by the underlying file format, not CF.

While I agree much of this should probably be more carefully thought out at the netCDF level, there is I think a need to mention all this in CF. Perhaps it is more a backwards compatibility issue than interoperability, both for software and for new data.

turnbullerin commented 1 year ago

I think CF should make recommendations regarding situations that may hinder interoperability. Allowing Unicode characters in variable names will add many challenges to interoperability.

The current language in section 2.3 says "names should begin with a letter and be composed of letters, digits, and underscores". In ASCII that translates to the '^[a-zA-Z][a-zA-Z0-9_]*$' mentioned by @martinjuckes above. I believe the Python Language syntax for identifiers/names I mentioned above is close to the same for Unicode, the underscore gets expanded to the Unicode Connector Punctuation (Pc) category and some mark categories as well.

I think each of those restrictions would be good to mention in CF as providing two different levels of interoperability.

Here's some notes on interoperability with other file formats and some common tools (restrictions are on variable names and, if supported, attribute names - values tend to be more flexible)

DAP2 allows [0-9A-Za-z_!\~*'"-] and other US-ASCII if URL-escaped; Special Characters: =<>!+-/\*~%.[]
DAP4 UTF-8 characters (escaped if not US-ASCII); Special characters: /
HDF5 UTF-8 supported
ASCII, CSV, TSV are character-encoding dependent but all valid characters allowed (with proper escaping)
KML depends on coding, <& and either ' or " must be escaped and non-printable control characters and compatibility characters are discouraged: https://www.w3.org/TR/xml/#NT-Char
ESRI strongly recommends [A-Za-z0-9_-], explicitly not allowed: +*/!^%()[]{},~'":;><&|\=@#$
MATLAB files: must follow MATLAB naming rules A-Za-z0-9_

It seems the direction is towards full UTF-8 compatibility but not all tools/platforms are there yet.

My take on this is, at the very least, we should not allow UTF-8 control characters (so 0x00 to 0x0F, 0x7F, and 0x80-0x9F, general category Cc) along with the two slashes and following the W3 guidelines in section 2.2 of the above reference is a good starting place for more disallowed characters.

MaartenSneepKNMI commented 1 year ago

Keep in mind that the letter category in python is already very broad. The statement:

from math import pi as π

is legal in current python. That also means that all issues in URL's caused by the use of characters that have the same shape but are actually different codepoints will be included here. Spot the difference: "o" or "ο". So as a warning: be careful what you wish for.

larsbarring commented 1 year ago

I think this conversation is spreading out in many directions, and I think that it is useful to disentangle these.

1. To my understanding this issue was initiated because the current CF text is inconsistent in relation to the recent development of the netCDF library. Because NetCDF now accepts almost any Unicode character the conventions text would become clearer if the opening text of chapter 2 and the text in section 2.3 better distinguish between what is general netCDF requirements and what is CF requirements. (See here and here for background.) 2. We need to clarify what is meant by the word letter in section 2.3. (See here and here for background.) 3. There are requests to expand the set of allowed characters for variable names. (See here and the more extensive discussion in discuss/#256.) 4. To allow multilingual support via standard locale identifiers there is a specific request to expand the set of allowed characters for attributes. See here and the full discussion in discuss/#244.)

Merging these into a general conversation about how CF should handle Unicode would, I am pretty sure, be a complex and perhaps protracted effort with many aspects and diverse views. We could treat 1 and 2 simply as defects and just update the conventions text. But in the light of 3 and 4 this is not constructive. Moving 3 forward would require a thorough discussion of how CF handles variable names and how they are used in various software. By comparison, moving 4 forward might be comparatively simpler as it involves a very specific request for additional characters in attribute names. As there is some urgency to implement multilingual support I suggest that we deal with this separated from the more general question of opening up for a wider set of characters in attribute names. Hence, I suggest that we proceed as follows with the aim to get the outcome into version 1.11:

* First we deal with 4 (I can hardly see a more worthy use case than adding multilingual support to certain attributes). The suggested specific characters to be allowed in attribute names are . (period) and - (hyphen), see here, or alternatives discussed in discuss/#244. * Then we update the text in the opening paragraph of chapter 2 and in section 2.3.

~~After that, or in parallel, we can continue the the more general conversation on widening the character set allowed for variable names and what implications that may have.~~

larsbarring commented 1 year ago

Hm, having thought a bit more about all this, and in particular the implications of the diverse limitations imposed by different software and applications listed by @turnbullerin I retract much of my previous comment. Apologies for the confusion.

turnbullerin commented 1 year ago

@larsbarring honestly, for bullet 4, I would classify it as more of a want than a need. There are solutions that don't require anything but a-zA-Z_ in variable and attribute names.

In general though, I think we are seeing that there is a lot of complexity once you move away from just the NetCDF file itself and into various processing environments and conversions. That said, the domains where NetCDF is being used might be the best ones to manage that complexity rather than us trying to manage it for them. Just because they can do a thing doesn't mean they must do that thing.

Three approaches come to mind:

We reject this and limit everything to the current characters. We miss some use cases but we simplify interoperability.
We allow all sensible characters and start using them in CF standards (e.g. for multilingual cases). We add some use cases but break backwards compatibility with some tools that are relying on that standard. I would add a warning for that in the documentation that using the expanded character set may cause issues.
We allow all sensible characters but restrict CF standard attributes to the current characters. We add some use cases and backwards compatibility is broken but might be more easily fixed since (a) NetCDF allows those characters and (b) they are non-standard so they can be ignored.

2 and 3 also require a decision on "sensible characters".

In effect, this decision is about the proper representation of technical things in variable/attribute names vs. interoperability. For me, interoperability wins over trying to support every conceivable use case for naming. A variable name isn't the only way to document what is in a NetCDF variable (e.g. PM_2_5 can have PM_2_5:long_name = "PM 2.5";). A variable name is a convenient short-hand for accessing the data, not necessarily a fully complete description of its contents (though it is ideal if it is fairly descriptive). But I appreciate others might not agree and that's ok.

My personal thought is that approach 3 is the right one - expand the character set as far as reasonably possible, but with a big RECOMMENDED to use [A-Za-z0-9_] (not starting with underscore) for backwards compatibility and then not using other characters than those in standard attribute names to maximize interoperability with other standards (looking at ESRI and MATLAB as the two clearest examples that support these restrictions).

turnbullerin commented 1 year ago

I also checked Python out and accessing attributes that can't be represented with Python variable names is more complex:

import netCDF4 as nc

ds = nc.Dataset("./example.nc", "w")

ds.title = "Hello World"
print(ds.title)  # attributes that are variable names can be directly accessed and written

# ds.title-fr = "Bonjour le monde"  # this will error, the variable name is bad
setattr(ds, "title-fr", "Bonjour le monde")  # this works
print(getattr(ds, "title-fr"))  # printing also works

larsbarring commented 1 year ago

We now are approaching the deadline for changes to be included in CF-1.11 and it is unlikely that we will be able to resolve all aspects of this issue in time. But there are some ambiguities in the current text that should be clarified irrespective of if and how the set of allowed characters might be updated later on. I previously wrote:

To my understanding this issue was initiated because the current CF text is inconsistent in relation to the recent development of the netCDF library. Because NetCDF now accepts almost any Unicode character the conventions text would become clearer if the opening text of chapter 2 and the text in section 2.3 better distinguish between what is general netCDF requirements and what is CF requirements. (See here and here for background.)

We need to clarify what is meant by the word letter in section 2.3. (See here and here for background.)

I just saw that @Dave-Allured has made PR #465 that I believe essentially takes care of the first point. But I think that it easily could solve the second point as well by referring to the standard ASCII alphabet, or similar.

cf-convention / cf-conventions

Remove restrictions on netCDF object names #237