cf-convention / cf-conventions

AsciiDoc Source
http://cfconventions.org/cf-conventions/cf-conventions
Creative Commons Zero v1.0 Universal
86 stars 45 forks source link

Add support for attributes of type string #141

Open JimBiardCics opened 6 years ago

JimBiardCics commented 6 years ago

Attributes with a type of string are now possible with netCDF-4, and many examples of attributes with this type are "in the wild". As an example of how this is happening, IDL creates an attribute with this type if you select its version of string type instead of char type. It seems that people often assume that string is the correct type to use because they wish to store strings, not characters.

I propose to add verbiage to the Conventions to allow attributes that have a type of string. There are two ramifications to allowing attributes of this type, the second of which impacts string variables as well.

  1. A string attribute can contain 1D atomic string arrays. We need to decide whether or not we want to allow these or limit them (at least for now) to atomic string scalars. Attributes with arrays of strings could allow for cleaner delimiting of multiple parts than spaces or commas do now (e.g. flag_values and flag_meanings could both be arrays), but this would be a significant stretch for current software packages.
  2. A string attribute (and a string variable) can contain UTF-8 Unicode strings. UTF-8 uses variable-length characters, with the standard ASCII characters as the 1-byte subset. According to the Unicode standard, a UTF-8 string can be signaled by the presence of a special non-printing three byte sequence known as a Byte Order Mark (BOM) at the front of the string, although this is not required. IDL (again, for example) writes this BOM sequence at the beginning of every attribute or variable element of type string.

Allowing attributes containing arrays of strings may open up useful future directions, but it will be more of a break from the past than attributes that have only single strings. Allowing attributes (and variables) to contain UTF-8 will free people to store non-English content, but it might pose headaches for software written in older languages such as C and FORTRAN.

To finalize the change to support string type attributes, we need to decide:

  1. Do we explicitly forbid string array attributes?
  2. Do we place any restrictions on the content of string attributes and (by extension) variables?

Now that I have the background out of the way, here's my proposal.

Allow string attributes. Specify that the attributes defined by the current CF Conventions must be scalar (contain only one string).

Allow UTF-8 in attribute and variable values. Specify that the current CF Conventions use only ASCII characters (which are a subset of UTF-8) for all terms defined within. That is, the controlled vocabulary of CF (standard names and extensions, cell_methods terms other than free-text elements of comments(?), area type names, time units, etc) is composed entirely of ASCII characters. Free-text elements (comments, long names, flag_meanings, etc) may use any UTF-8 character.

Trac ticket: #176 but that's just a placeholder - no discussion took place there (@JonathanGregory 26 July 2024)

ChrisBarker-NOAA commented 6 years ago

+1

Honestly, I only recently learned that attributes could have types other than a single piece of text (leaving char vs string out it for now).

And there are current use cases of using delimited text to capture a similar concept.

So not allowing string arrays for now seems pretty darn straightforward.

CF2.0 can take advantage of more of these nifty features.

DocOtak commented 5 years ago

I've recently discovered that the netcdf4-python library will force attributes to use NC_STRING if passed a unicode array. See https://github.com/Unidata/netcdf4-python/pull/389

My own quick testing suggests this will be the case if (in python3), a string is passed with unicode points above 127.

martinjuckes commented 5 years ago

I've just scanned through this following an enquiry about some data which has been sent to us and which breals the cf-checker with string valued attributes.

I agree with @JimBiardCics that string value attributes should be allowed, and also the conclusion that string valued arrays should not be allowed in general.

On the other hand, I think we should be careful penalising an approach which may be beyond the control of many data providers: many people will be passing strings to software which puts them in the file. The HDF library also, through a command such as f.attrs['units'] = 'days since 1900', assigns a string value to an attribute Unidata/netcdf4-python/#448 -- do we really want to be giving advice which conflicts with default behaviour in major libraries?

On the UNIDATA pages I can't find out what a CDL statement of the form Conventions = "CF-1.7"; is meant to mean, I believe the ncgen utility is interpreting it as a string rather than char array. In NCML the default type is certainly string rather than char array.

If people really want compatibility with software which has been designed around NetCDF3 they are going to have to use the NetCDF4-classic model. Rather than making specific instructions, can we advise people to consider the interests of users and their software libraries before moving to full NetCDF4? I feel that we will get tied up in knots if we tie CF to a selection of NetCDF3 features.

@davidhassell : As far as the Convention is concerned, this issue should be labelled a DEFECT. The current text states that "NetCDF does not support a character string type, so these must be represented as character arrays", and this is clearly untrue. The ubtye data type should also be supported.

JonathanGregory commented 5 years ago

Dear Martin

I too still think that (a) for an attribute, a string should be allowed instead of, and regarded as equivalent to, a 1D char array, (b) this equivalence should be stated somewhere near the start of the convention, (c) we should not allow arrays of strings (for now - they might be allowed in future).

When you say, "We should be careful penalising an approach which may be beyond the control of many data providers", do you mean we shouldn't recommend one or the other? Earlier I had suggested making a recommendation, but I agree that it isn't necessary. However it would be useful to note (for the info of data-writers) that some users might not be able to interpret strings because the string data type didn't exist before NetCDF4.

As regards the label for this discussion, I think this should be an enhancement, not a defect. I agree that the existing convention text is wrong (because it's out of date, not because it was originally mistaken). However, allowing strings in CF is an enhancement. Moreover, it's an issue of sufficient seriousness that we are not willing to agree it by default, as the length of this discussion shows. The rule for defects is that they're accepted if no-one objects. Well, I objet to this being accepted by default. :-) However, I support it as an enhancement.

I think that character encoding, if we need to do something about it, should be treated as a different issue.

Best wishes

Jonathan

martinjuckes commented 5 years ago

Dear Jonathan,

My comment about penalising one approach was intended to be about cf-checker warnings: I don't think we should be issuing warnings for encoding choices which may be out of the users' control. According to your comment above, this does indeed mean that we should not recommend one approach.

I agree that we should make a statement about potential difficulties caused by string attributes, but I was suggesting that this should be placed in the context of a general statement about using the NetCDF classic model (which may be a way in which users can easily enforce character arrays for attributes) to ensure that data can be read with legacy code. Some of our community are putting a lot of effort into enabling the use of NetCDF4 groups ... we need to be consistent about the advice we give. As far as I can tell, it would not make sense to recommend using character arrays for attributes if the file contains group structures which require NetCDF4 aware software.

I hadn't realised that the defect label could result in curtailed discussion .. to my mind this is a counter-intuitive outcome from our procedural rules, but, as it stands, I agree this should be labelled as an enhancement.

regards, Martin

martinjuckes commented 5 years ago

Dear Jonathan,

with reference to your comment above about recommendations in the Convention being linked to warnings issued by the CF Checker: since my last post I've noticed this does not appear to apply to the phrase "We recommend that whenever possible ..." which is used in connection with the specification of bounds for spatial coordinate variables. Do you think this is intended (i.e. the "recommended whenever possible" is a somewhat weaker statement than "recommended", which would make sense)?

There is also at least one recommendation which clearly cannot be checked by the cd-checker (referring to use of meaningful names). It might be better to phrase such guidance as best practice notes rather than recommendations, and perhaps a similar approach could be adopted for character arrays vs. strings,

regards, Martin

JonathanGregory commented 5 years ago

Dear Martin

The CF-checker issues warnings for those recommendations which are in the conformance document. As you imply, not all recommendations can be checked, and those which can't aren't in the conformance document. I don't think it was intended that "where possible" should indicate a weaker recommendation.

I agree that it would be better not to say "recommend" for something that could be checked, but which isn't in the conformance document because we don't want to be warned about it. In that case we clearly don't feel strongly about it. The choice between strings and char arrays could be in that category, I agree.

Best wishes

Jonathan

martinjuckes commented 5 years ago

I've added a related issue (#174) on string valued dimensions which are a new feature in NetCDF4. I think this can be handled separately, but it shares a common starting point in that it arises from changes greater flexibility introduced in NetCDF4 which the Convention text does not reflect.

JimBiardCics commented 5 years ago

@martinjuckes That is covered in issue #139. It is specifically for string variables.

marqh commented 5 years ago

given my offer to moderate #139 and the connectedness of these two issues, I'm happy to offer to moderate this issue as well. Is this helpful?

@JimBiardCics Am I correct that this proposal does not yet have an associated Pull Request?

dblodgett-usgs commented 5 years ago

Thanks for the offer, @marqh -- I've assigned you to the issue.

JimBiardCics commented 5 years ago

@marqh That is correct. I have verbiage, but I haven't made a pull request yet. I was figuring to do it after the 'string variable' PR was done, as there is overlap between the two. I wanted to avoid the awkward merge that could result.

kenkehoe commented 5 years ago

Does this mean we will have no more discussion about using string array attributes?

JimBiardCics commented 5 years ago

@kenkehoe I don't think so. It is a purely logistical issue. There is so much overlap between the change sections that it seemed silly to make two independent branches of change from the original that would then require a super-awkward merge.

DocOtak commented 4 years ago

So now that string variables have landed, I want to bring some attention to this issue again. Some updates I've learned about:

Dave-Allured commented 4 years ago

@DocOtak, thanks for restarting this. In light of past difficulties, I move to split the issue.

I think it would be possible to break out the more difficult parts of this topic into new and separate issues. I suggest that this issue #141 be narrowed to only a single essential ingredient: scalar string-type attributes as an alternative for traditional character-type attributes.

Can we agree to move the following to new Github issues, and focus for now only on legalizing scalar string-type attributes?

DocOtak commented 4 years ago

@Dave-Allured That sounds OK to me, whatever does get adopted, should probably be pretty explicit about what is "not allowed".

JimBiardCics commented 4 years ago

@Dave-Allured I approve of your proposal. I think we pretty much have no choice but to allow UTF-8 as a baseline to start with, but there clearly are larger issues to be resolved. (I say "no choice" because, for example, constraining to ASCII in python 3 is a bit complicated.)

cf-metadata-list commented 4 years ago

On Fri, Mar 13, 2020 at 2:14 PM JimBiardCics notifications@github.com wrote:

@Dave-Allured https://github.com/Dave-Allured I approve of your proposal. I think we pretty much have no choice but to allow UTF-8 as a baseline to start with, but there clearly are larger issues to be resolved. (I say "no choice" because, for example, constraining to ASCII in python 3 is a bit complicated.)

Not really. Python does not use utf-8 internally, so you have to encode and decode when reading/writing a file anyway. Setting that encoding to ASCII is not easier or harder than setting it to utf-8.

But while we may have a choice -- it's not a good choice. We've all needed non-ascii charioteers for a LONG time. And Unicode and utf-8 are well established. And utf-8 is very compatible with old text processing software, so really, we should just do it.

The primary complication is that it's not always obvious how many bytes you need to store a given string, but that's more a problem with writing than reading, so we can hope that the software that writes non-ascii data is smart enough to do it right.

-CHB

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

Dave-Allured commented 4 years ago

Hmmm. Chris, I think you are implying a problem that does not exist. I do not think CF has ever restricted the use of UTF-8 in free text within attributes. I suspect there are many UTF-8 attribute examples in the wild, though I do not have one up my sleeve right now. Please correct me if I'm wrong.

JimBiardCics commented 4 years ago

Chris,

Python 3 is not the same as python 2. In Python 2 there were two types — str (ASCII) and unicode (by default UTF-8). In Python 3 there is only str, and by default it holds UTF-8 unicode (there's lots of subtly that I'm glossing over here, but this is what it boils down to). It bit me recently, so I'm sensitive to it.

https://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/ https://docs.python.org/3/howto/unicode.html

ChrisBarker-NOAA commented 4 years ago

I'm getting double messages -- I think we may have a feedback loop between gitHub and the list .....

But anyway:

Hmmm. Chris, I think you are implying a problem that does not exist.

I hope that's true, Sorry if I stirred up confusion.

But I was responding to a comment about ASCII vs UTF-8, so ....

I also picked this up in email, so was unsure of the context. I've now gone and re-read the issue, and I:m a bit confused about what's still on the table.

But way back, someone wrote: " two issues: the use of strings, and the encoding. These can be decided separately, can't they?"

and there was another one: arrays of strings vs whitespace separated strings.

(I'm also not completely clear about the difference between a char* and a string anyway. Either way, it's a bunch of bytes that need to be interpreted)

So I'll just talk about encoding here. A few points:

(I know you all know most of this, and most of it has been stated in this thread, but to put it all in one place...)

So that leaves one open question: what encoding(s) are allowed for a CF compliant file?

I'm going to be direct here:

THERE IS NO REASON TO ALLOW MORE THAN ONE ENCODING

It only leads to pain. Period. End of story. If there is one allowed encoding, then all CF compliant software will have to be able to encode/decode that encoding. But ONLY that one! If we allow multiple encodings, than to be fully compliant, all software would have to encode/decode a wide range of encodings, and there would have to be a way to specify the encoding. So all software would have to be more complex, and there would be a lot more room for error.

If there is only one encoding allowed, then there are really only two options:

UCS-4: because it handles all of Unicode and is the always the same number of bytes per code point. A lot more like the old char* days. However, no one wants to waste all that disk space, so that leaves:

UTF-8: which is ASCII compatible, handles all of Unicode, and has been almost universally adopted in most internet exchange formats (those that are sane enough to specify a single encoding :-) )

It is also friendly to older software that uses null-terminated char* and the like, so even old code will probably not break, even if it does misinterpret the non-ascii bytes. And old software that writes plain ascii will also work fine, as ascii ID utf-8.

All that's a long way of saying:

CF should specify UTF-* as the only correct encoding for all text: char or string. With possibly some extra restrictions to ASCII in some contexts.

If that had already been decided, then sorry for the noise :-)

ChrisBarker-NOAA commented 4 years ago

@JimBiardCics wrote:

Actually, I know a LOT more about Python than I do about netcdf, HDF, or CF. And I'm afraid you have it a bit confused. This is kind of off-topic, but for clarities sake:

Python 3 is not the same as python 2.

Very True, and a source of much confusion.

In Python 2 there were two types — str (ASCII) and unicode (by default UTF-8).

Almost right: there were two types:

str: which was a single byte per character of unknown encoding -- essentially a wrapped char -- usually ascii compatible, often latin-1, but not if you were Japanese, for instance.... It was also used a holder of arbitrary binary data: see numpy's "fromstring()" methods, or reading a binary file. Much like how char is used in C.

unicode: which was unicode text -- stored internally in UCS-2 or UCS-4 depending on how Python was compiled (I know, really?!?!) It could be encoded / decoded in various encodings for IO and interaction with other systems.

In Python 3 there is only str, and by default it holds UTF-8 unicode

Almost right: the Py3 str type is indeed Unicode, but it holds a sequence of Unicode code points, which are internally stored in a dynamic encoding depending on the content of the string (really! a very cool optimization, actually, if you have only ascii text, it will use only one byte per char https://rushter.com/blog/python-strings-and-memory/ ). But all that is hidden from the user. To the user, a str is a sequence of characters from the entire Unicode set, very simply.

(Unicode is particularly weird in that one "code point" is not always one character, or "grapheme" to accommodate languages with more complex systems of combining characters, etc, but I digress..)

And there are still two types -- in Python3 there is the "bytes" type, which is actually very similar to the old python2 string type -- but intended to hold arbitrary binary data, rather than text. But text is binary data, so it can still hold that. In fact, if you encode a string, you get a bytes object:

In [13]: s                                                                      
Out[13]: 'some text'

In [14]: b = s.encode("ascii")                                                  

In [15]: b                                                                      
Out[15]: b'some text'

Note the little 'b' before the quote. In that case, they look almost identical, as I encoded in ASCII. But what if I had some non-ASCII text?:

In [18]: s = "temp = 10\u00B0"                                                  

In [19]: s                                                                      
Out[19]: 'temp = 10°'

In [20]: b = s.encode("ascii")                                                  
---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)
<ipython-input-20-3930abba6989> in <module>
----> 1 b = s.encode("ascii")

UnicodeEncodeError: 'ascii' codec can't encode character '\xb0' in position 9: ordinal not in range(128)

oops, can't do that -- the degree symbol is not part of ASCII. But I can do utf-8:

In [21]: b = s.encode("utf-8")                                                  

In [22]: b                                                                      
Out[22]: b'temp = 10\xc2\xb0'

which now displays the byte values, escaping the non-ascii ones. So that bytes object is what would get written to a netcdf file, or any other binary file.

And Python can just as easily encode that text in any supported encoding, of which there are many:

In [28]: s.encode("utf-16")                                                     
Out[28]: b'\xff\xfet\x00e\x00m\x00p\x00 \x00=\x00 \x001\x000\x00\xb0\x00'

But please don't use that one!

So anyway, the relevant point here is that there is NOTHING special about utf-8 as far as Python is concerned. And in fact, Python is well suited to handle pretty much any encoding folks choose to use -- but it doesn't help a bit with the fundamental problem that you need to know what the encoding of your data is in in order to use it. And if Python software (like any other) is going to write a netcdf file with non-ascii text in it, it needs to know what encoding to use.

The other complication that has come up here is that, IIUC, the netCDF4 Python library (A wrapper around the c libnetcdf) I think makes no distinction between the netcdf types CHAR and STRING (don't quote me on that), but that's a decision of the library authors, not a limitation of Python.

Actually, it does seem to give the user some control:

https://unidata.github.io/netcdf4-python/netCDF4/index.html#netCDF4.chartostring

Note that utf-8 is the default, but you can do whatever you want.

In any case, the Python libraries can be made to work with anything reasonable CF decides, even if I have to write the PRs myself :-)

Sorry to be so long winded, but this IS confusing stuff!

ChrisBarker-NOAA commented 4 years ago

one small additional nota about Python and Unicode:

The post Jim pointed us to:

https://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/

Is now six years old -- and many of the issues brought up have been addressed.

And the the author of that post has another post on the dangers of refereeing back to such older opinions:

https://lucumr.pocoo.org/2016/11/5/be-careful-about-what-you-dislike/

Another issue with that discussion is that it's written from the perspective of what some folks in the community are calling "byte slingers": Those that write libraries and the like that deal with binary data and protocols. And the fact is that Python3's String model is NOT as well suited to those use cases. But it is massively better suited to most more "casual" use cases. In that post, he refers to "beginners", but it's not beginners, it's anyone that does not understand the subtleties of binary data, encodings, and the like. Which is most of us "scientific programers".

Bringing this back to CF: For CF, ideally we would choose an approach that is well suited to the "Normal scientific programmer", and leave the encoding/decoding to the libraries. And have confidence that the "byte slingers" will correctly write the libraries to match the standard, and make things "just work" for most users.

JimBiardCics commented 4 years ago

@ChrisBarker-NOAA My original observation was that we can absolutely split off some of these issues. I see two issues being peeled off from the base issue.

I think you've made a strong case for starting out by specifying ASCII and Unicode / UTF-8 as the only valid contents for string attributes, with one of the two spinoff issues addressing the question of broadening the options.

ChrisBarker-NOAA commented 4 years ago

My original observation was that we can absolutely split off some of these issues. Agreed.

Have these been started? I can't find them if they have.

There is also the question of what to do with CHAR types -- the same as STRING?

And what about encoding of CHAR and STRING variables? I can't find anything about that in the current CF document, so it doesn't seem to be settled.

Maybe this should go in a new issue, but for now, I had a (not well formed) thought:

CHAR variables and attributes should only be encoded in a 1-byte per character ascii compatible encoding: e.g. ascii, latin-1

STRING variables and attributes should only be encoded in utf-8 (of which ascii is a subset)

My justification is that there will be little software in the wild that supports Unicode, but does not support String. Setting this standard will make it less likely that older software that assumes a 1byte per character text representation will get handed something it can't deal with. And the string type is better suited to Unicode anyway, as the "length" of a string is less well defined.

JimBiardCics commented 4 years ago

@ChrisBarker-NOAA

Assuming we do spin off sub-issues related to encoding and string array attributes, I agree fully that we should, in this specific issue, propose making changes to the CF document to make it clear that CHAR attributes must be ASCII or latin-1 and STRING attributes should be unicode/utf-8.

ChrisBarker-NOAA commented 4 years ago

we should, in this specific issue, propose making changes to the CF document to make it clear that CHAR attributes must be ASCII or latin-1 and STRING attributes should be unicode/utf-8

+1 on that.

DocOtak commented 4 years ago

Re CHAR vs STRING, the netcdf C API method calls one text and other other string. Do we want to use that language at all in whatever text is developed?

Be aware that the netcdf python library will force the use of strings for netcdf4 files if it sees unicode points outside of ASCII.

Also be aware that LATIN-1 is not compatible with UTF-8 with code points above 127. The ISO working group maintaining these "legacy" standards (ISO-8859-n, where n=1 is LATIN-1) doesn't even exist anymore...

ChrisBarker-NOAA commented 4 years ago

Also be aware that LATIN-1 is not compatible with UTF-8 with code points above 127

Indeed. Which is why it should be clear that you should NOT put utf-8 in a CHAR array :-) We could say ASCII only for CHAR, but I'm not sure there is a good reason to be that restrictive.

It may be a implementation detail of the Python encodings, but at least there, latin-1 can decode ANY string of bytes (Other the the null byte) without error, and write it out again with no changes. So if consuming code uses the latin-1 encoding for all CHAR arrays, it may get garbage for the non-ascii bytes, but it won't raise an error, or mangle the data if it is written back out.

the netcdf python library will force the use of strings for netcdf4 files if it sees unicode points outside of ASCII.

which is the right thing to do, and compatible with this proposal, I think. (hmm, unless latin-1 is allowed). But you could probably send a latin-1 encoded bytes object in yes?

Anyway, if we codify this, and the netCDF4 lib (or any other) can't support it, it can be fixed. And yes, I am volunteering to do a PR for a fix to netCDF4-python.

DocOtak commented 4 years ago

Additionally, the netcdf standard itself has support for UTF-8 variable names, requires them to be NFC, and specifically excludes bytes 0x00 to 0x1F and 0x7F to 0xFF (see the "name" part of that document).

I think this matters because at least one of the standard attributes needs to be able to refer to variable names. Basically, allowing anything other than UTF-8, especially things that allow bytes 0x7F to 0xFF (like the ISO-8859 series encodings do), would probably cause actual problems.

ChrisBarker-NOAA commented 4 years ago

Thanks! Yup -- then attributes really do need to be UTF-8 and the STRING type (for text) only.

I suppose they don't ALL HAVE to be the STRING type, but the ones that might contain variable names should be.

after all, any software that doesn't support the STRING type probably doesn't support Unicode variable names, either ...

JimBiardCics commented 4 years ago

@DocOtak I couldn't find the direct restriction on the 0x80 to 0xFF characters. Is this a side effect of utf-8 using the high bit to signal multibyte characters? Or is it a more general prohibition against using the characters in latin-1 that fall in that range?

DocOtak commented 4 years ago

@JimBiardCics It's the "not match" group in that regex that is doing it ([^\x00-\x1F/\x7F-\xFF]|{MUTF8}), at least, I'm pretty sure that is what is going on. I rarely use regex myself, so I could be wrong, but I'm quite sure that the ^ is "not match".

JimBiardCics commented 4 years ago

I missed the regex. Yep, that's what it says. 0x7F is the "del" char, so it's non-printing. I think the characters from 0xC0 - 0xFF are out because they would all be interpreted in UTF-8 as signaling the start of a multi-byte character. 0x80 - 0xBF can all be interpreted as trailing elements of a multibyte character, so I guess it's a bad plan to have one lying around loose. This Wikipedia article was informative.

ChrisBarker-NOAA commented 4 years ago

remember that utf-8 is ascii compatible for the first 127 (7 bits). So:

0x00 to 0x1F are the control codes from ASCII

0x7f is the DEL (not sure why that wasn't in the first set..., but there you go.

and 0x80 to 0xFF is the rest of the non-ascii bytes -- (128-255), which you have to be able to use in order to do utf-8. But frankly, I have not sure what a regex is with regard to bytes. But if I had to guess, I'd pull it apart this way (which is almost what's in the footnote:

first: MUTF8 means: multibyte UTF-8 encoded, NFC-normalized Unicode character However, Unicode doesn't quite use "characters", but rather "Code Points", so that means:

Which means any Unicode code point >= 128 (0x80) and above.

([a-zA-Z0-9_]|{MUTF8})([^\x00-\x1F/\x7F-\xFF]|{MUTF8})

The first character has to be: ([a-zA-Z0-9_]|{MUTF8}): ASCII letter, number or underscore OR any other code point over 128

All the other characters have to be: Any code point other than: \x00-\x1F and \x7F-\xFF OR any code point above 128.

Which is an odd way to define it, as the codepoints \x7F-\xFF are valid Unicode, so you're kind of excluding them, and then allowing them again .... strange.

I suspect that this started with the original pre-Unicode definition, and they added the UTF8 part, and got an odd mixture. In particular, there is really no reason to treat the single byte or multibyte UTF codepoints separately, that's just odd.

I think I'd write this as:

Names are UTF-8 encoded. The first letter can be any of these codepoints:

x30 - x39. (digits: 0-9)
x41 - x5a (upper case letters: A-Z)
x61 - x7a (lower case letters: a-z)
c5f (underscore)
>= xx80
The rest can be any code point other than:
\x00-\x1F or \x7F

However, there is a key missing piece: a number of Unicode code points are used for control character and whitespace, and probably other things unsuitable for names. Which may be why they used the term "character". But it would be better if they had clearly defined what's allowed and what s not. For instance, Python3 uses these categories: (https://docs.python.org/3/reference/lexical_analysis.html#identifiers) Lu - uppercase letters Ll - lowercase letters Lt - titlecase letters Lm - modifier letters Lo - other letters Nl - letter numbers

I have no idea if those are defined by the Unicode consortium anywhere. But it would be good for netcdf (and or CF) to define it for themselves.

I will say that it's kind of nifty to be able to do (in Python):

In [17]: π = math.pi                                                            
In [18]: area = π * r**2

But I'm not sure I need to be able to assign a variable to 💩 -- which Python will not allow, but does the netcdf spec allow it?

zklaus commented 4 years ago

I think there is some confusion here.

First, this whole regex stuff is only about the physical byte layout of the netcdf classic file format. I would in principle suggest to completely focus on netcdf4 files instead.

Second, I think CF should not concern itself with encodings and byte order stuff at all. Leave that to netcdf4/hdf5 and just work at the character level. And yes, unicode has code points, but also a concept of characters (see here).

Third, looking at the regex in question

([a-zA-Z0-9_]|{MUTF8})([^\x00-\x1F/\x7F-\xFF]|{MUTF8})*

notice that it is only an explanatory comment, but apart from that the overwhelmingly likely way to parse this, thanks to the "|" alternatives, is as either

([a-zA-Z0-9_])([^\x00-\x1F/\x7F-\xFF])*

ie an ascii string starting with a character, digit, or underscore, limited to the first 128 bytes without control characters and excluding "/" everywhere or

({MUTF8})({MUTF8})*

ie any unicode string encoded as normalized UTF-8.

zklaus commented 4 years ago

@ChrisBarker-NOAA wrote:

I have no idea if those are defined by the Unicode consortium anywhere.

They do indeed. See here.

JimBiardCics commented 4 years ago

@DocOtak @zklaus Sorry if I've pulled the discussion off track. The question of exactly why NUG worded things the way they did is intriguing, but I think Klaus is right that we shouldn't get wrapped around that particular axle in this issue — particularly if we are going to split encoding off into a different issue. I think the take-away is that our baseline is "sane utf-8 unicode" for attributes of type NC_STRING and ASCII for attributes of type NC_CHAR (those created with the C function nc_put_att_text.)

zklaus commented 4 years ago

I agree and would go one small step further: UTF-8 is only an encoding, so we should just say "unicode" for strings. If we need to restrict that, say to disallow underscore in the beginning or to save a separation character like space in attributes right now, we should do so at the character level, possibly using categories as introduces by @ChrisBarker-NOAA above.

ChrisBarker-NOAA commented 4 years ago

UTF-8 is only an encoding, so we should just say "unicode" for strings.

We could do that if and only if netcdf itself was clear about how Unicode is encoded in files. Which it is for variable names, though not so sure it is anywhere else.

But even so, once the encoding has been specified, then yes, talking about Unicode makes sense.

Agreed, it's not for this discussion, but:

MUTF8 is not quite (In that doc): "any unicode string encoded as normalized UTF-8." because I think they are specifically trying to exclude the ASCII subset, so they can handle that separately. i.e characters that are excluded, like "/" are indeed unicode strings.

But it's a pretty contorted way to describe it -- but that's netcdf's problem :-)

zklaus commented 4 years ago

Ah yes, I see what you mean, you are right: Always speaking about UTF-8, multi-byte here isn't referring to the possibility of having several bytes encode one code point, but to actual code points with more than one byte, thus excluding the one-byte code points which are exactly the first 128 ASCII characters. Then they allow back in specific ASCII characters.

JonathanGregory commented 2 months ago

Dear all

The issue was opened in 2018 and has seen a long discussion, but no further contributions since 2020. It has been partly superseded, in that CF now permits string-valued attributes to be either a scalar string or a 1D character array (see Sect 2.2). Apart from that, it seems to me that the discussion was mostly concerned with three subjects:

  1. Should CF allow arrays of strings in attributes? We are currently discussing that question in https://github.com/orgs/cf-convention/discussions/341, which refers back to this issue. Therefore I propose we don't discuss this any further here.

  2. What encoding should be used in string attributes. The consensus was that it should always be Unicode. One reason for this is that netCDF variable names are in Unicode, and many CF attributes contain the names of netCDF variables. CF recommends that only letters, digits and underscores should be used for variable names, but does not prohibit other Unicode characters. Should we insert a statement in the CF convention about strings being Unicode?

  3. Whether to restrict the characters allowed in string-valued attributes. The majority of CF attributes contain the names of netCDF variables and strings which come from a CF controlled vocabulary or a list in an Appendix. The set of characters that can be used in those attributes is thus dictated already by the convention. This question therefore applies only to the attributes that CF defines but whose contents it does not standardise, namely comment history institution references source title and long_name. Does anyone wish to pursue this third question? For instance, @ChrisBarker-NOAA, @zklaus and @DocOtak all contributed in 2020.

I propose that this issue should be closed as dormant if no-one resumes discussion on Q2 or Q3 within the next three weeks, before 14th September.

Cheers

Jonathan

ChrisBarker-NOAA commented 2 months ago

Thanks for trying to close this out :-)

Should we insert a statement in the CF convention about strings being Unicode?

I jsut looked, and all I see is this under naming:

"...is more restrictive than the netCDF interface which allows almost all Unicode characters encoded as multibyte UTF-8 characters """

So yes, I think it's good to be clear there -- maybe it's well defined by netcdf, but it doesn't hurt to be explicit, if repetivive.

I am correct to say that all strings in netCDF are Unicode, encoded as UTF-8 ?

Whether that's true or not for netcdf -- I think it should be true for CF, and we should say that explicitly in any case.

... This question therefore applies only to the attributes that CF defines but whose contents it does not standardise,

I would say that we should not restrict these otherwise not-restricted atributes.

I'm not sure if that's pursuing it or not pursuing it -- I presume the default is no restrictions?

ChrisBarker-NOAA commented 2 months ago

Hmm -- not sure where this fits, but it's related:

IIUC, CF now allows either the new vlen strings, or the "tradoitional" char arrays.

The trick is that UTF-8 is not a a one-char-per-codepoint encoding.

Could we say that you can only use Unicode (UTF-8) with vlen strings, and char arrays can only hold ASCII? or is the cat way to far out of the bag for that?

Probably - could we at least encourage vlen strings for non-ascii text?

JonathanGregory commented 2 months ago

@ChrisBarker-NOAA

Am I correct to say that all strings in netCDF are Unicode, encoded as UTF-8 ?

I don't know either.

Whether that's true or not for netcdf -- I think it should be true for CF, and we should say that explicitly in any case.

I think so as well. That would go sensibly in Sect 2.2 "Data types".

We've already said in 2.2 that scalar vlen strings and 1D char arrays are both allowed and are equivalent in variables. We did not say so for attributes, but I expect everyone would assume that the same applies, in which case we should make it explicit. I don't think there's a problem with storing a multi-byte character codes in a char array, is there? It would be clearest if we said that a 1D char array should always be interpreted as a Unicode string. An ASCII string is a special case of that, so it's backwards-compatible.

Cheers

Jonathan

JonathanGregory commented 1 month ago

No-one said they wanted to resume Q1 or Q3 within three weeks, but @ChrisBarker-NOAA and I agreed that it would be useful to clarify that strings stored in variables or attributes should be Unicode characters (Q2). To do that, I propose that we replace the first 1.5 sentences of the second para of sect 2.2 "Data Types", which currently reads

Strings in variables may be represented one of two ways - as atomic strings or as character arrays. An n-dimensional array of strings may be implemented as a variable of type string with n dimensions, or as a variable of type char with n+1 dimensions, where the most rapidly varying dimension ...

with

A text string in a variable or an attribute may be represented either in Unicode characters stored in a string or encoded as UTF-8 and stored in a char array. Since ASCII 7-bit character codes are a subset of UTF-8, a char array of m ASCII characters is equivalent to a string of m ASCII characters. Unicode characters which are not in the ASCII character set require more than one byte each to encode in UTF-8. Hence a string of length m generally requires a UTF-8 char array of size >m to represent it.

An n-dimensional array of strings may be implemented as a variable or attribute of type string with n dimensions (where n<2 for an attribute) or as a variable (but not an attribute) of type char with n+1 dimensions, where the most rapidly varying dimension ...

Also, I suggest inserting the clarification "which has variable length", in this sentence in the first paragraph:

The string type, which has variable length, is only available in files using the netCDF version 4 (netCDF-4) format.

Does that look all right, @ChrisBarker-NOAA, @zklaus, anyone else? I believe this is no change to the convention, just clarification of the text, so I'm going to relabel this issue as a defect. Please speak up if you disagree. Thanks.

ChrisBarker-NOAA commented 1 month ago

As PR #543 attempts to clarify a bit about Unicode, I thought I'd post here. I stared commenting on the PR, but realized I had way too much to say for a PR, so I'm putting it here.

NOTE: maybe this should be a different issue -- specifically about Unicode in CF -- but I'm putting it here for now -- we can opy to a new issue if need be.

First some definitions/descriptions discussion about Unicode and strings.

1) there is no such thing as a Unicode "character". Unicode defines "code points", and each code point is assigned a value. However: "Code points are the numbers assigned by the Unicode Consortium to every character in every writing system." -- so interchanging "code point" and "character" is probably OK and will lead to little confusion. (one difference is how Unicode handles accented charioteers and the like, so it's not quite one-to-one code point-to-charactor).

2: There is no such thing as a Unicode String (except where defined by a programming language, e.g. Python). When stored in memory or a file strings, Unicode or not, are stored as bytes, and the relationship between the bytes and the code points is defined by an encoding. Without an encoding, there is no clear way to define what a bunch of bytes means, or in reverse, how to store a particular set of code points.

UTF-8 is the most common Unicode encoding for sotage of text in files, or passing over the internet (via https, or ...). UTF-16 is used internally by Windows and Java (I think).

Anyway -- unless one wants to use UCS-32 (which is what the numpy Unicode type uses), which most folks don't want to use for file storage -- it's pretty wasteful of space for virtually all text) then a variable-length encoding is required. And a char array is not ideal for variable length encodings -- because a char array requires a fixed size, and you don't know what size is needed until you encode the text. So a variable length string array is the "right" choice for Unicode (non-ansi) text.

Char arrays and strings in netcdf.

So this brings us to the topic at hand -- in netcdf3 the only what to store text was in arrays of type char. This maps directly to the char* used to store text in C. So a pretty direct mapping to C (and other languages).

With netcdf4, a string type was introduced: Strings are variable length arrays of chars, while char arrays are fixed length.

So: as far as the netcdf spec is concerned, the only difference between a char array and a string is that the length of char array is fixed. Once you read it -- you have a char*.

That's all I could find in the netCDF docs. Nothing about Unicode or encodings, or ... Which means that as far as the netcdf spec is concerned, you can put anything in either data type.

Note that a char* in C, while used for text (hence the name) is really a generic array of bytes -- it can be used to store any old collection of data.

So enter Unicode: as above, to store a "Unicode String" i.e. collection of code points, requires that the string be encoded, resulting in a set of bytes that can be stored in, you guessed it a char. (on Windows, the standard encoding is UTF-16, so a `wchar` ("wide char") is used. But a wchar can be cast to a char -- it's still an array of bytes (unsigned eight bit ints).

So as far as netcdf is concerned, you can stuff Unicode text into either a char array or a string in netcdf.

Note that I did find this discussion: https://github.com/Unidata/netcdf-c/issues/402 from May-June 2017 and not closed yet. From the netCDF docs, I don't think it was ever resolved. But it does contain a proposal for using an _Encoding attribute, and it may be kinda-sorted adopted by netCDF4 Python lib (it does respect the _Encoding atribute of char arrays), but I can't find documentation for how it handles the netcdf string type. and it looks like utf-8 is the default:

def chartostring(b, encoding='utf-8')

def stringtochar(a, encoding='utf-8')

I also don't know what it does for attributes, because they can't have another attribute to store the _Encoding. So ?? However, it does seem to "just work" -- at least if you write the file with Python -- e.g. you can ncdump it and it will correctly show a non-ascii character (on my terminal, which may be utf-8?).

Anyway -- as this doesn't seem to be defined by any published spec, I hope we can define it for CF. My proposal:

In pretty much any context:

That's it -- pretty simple, really :-)

Points to consider: 1) Should we restrict char arrays to ascii, or latin-1? (or allow other 1-byte encodings with an _Encoding attribute? 2) Should we allow the _Encoding attribute? or just say "thou shalt use only UTF-8"

My thought -- as much as I'd love to be fully restrictive to make things simpler for everyone, the cat's probably out of the bag. So we may have to require as few restrictions as possible (e.g. _Encoding), but recommend either ASCII or UTF-8.

So -- enough words for you?

-- back in the day, a char* would be an ASCII or ANSI encoding string (null terminated), and all was good an simple.

JonathanGregory commented 1 month ago

Dear Chris

Thanks for the research and your useful exposition of the complexity of the issue. I was hoping that we could add a couple of sentences on this subject, rather than a new chapter. :-)

NetCDF allows Unicode characters in names (of dimensions, variables and attributes). The relevant text from the NUG v1.1 is as follows. (By the way, this quotation indicates that Unidata also regard it as OK to refer to Unicode "characters" instead of "codepoints", in the interest of easy understanding.)

Beginning with versions 3.6.3 and 4.0, names may also include UTF-8 encoded Unicode characters as well as other special characters, except for the character '/', which may not appear in a name. Names that have trailing space characters are also not permitted.

We've agreed that CF should not prohibit characters permitted by the NUG, although we recommend a more restricted list of characters in sect 2.3:

It is recommended that variable, dimension, attribute and group names begin with a letter and be composed of letters, digits, and underscores. By the word letters we mean the standard ASCII letters uppercase A to Z and lowercase a to z. By the word digits we mean the standard ASCII digits 0 to 9, and similarly underscores means the standard ASCII underscore _.

In the previous discussion on this issue, an important point was made, that many CF attributes identify netCDF variables or attributes by name e.g. coordinates="lat lon". Therefore

On your final point 2, in my text above I proposed that we should require UTF-8 encoding for char arrays. We haven't said anything about this before, and we didn't provide a way to record the encoding, so for existing char data the only possibility is to guess what encoding was used, if it's not ASCII. I think we could justifiably do either of the following, but we must do one or the other in order for char data to be properly usable:

Which of these should we do?

For string data, I suppose the encoding isn't our concern, is it? I assume that netCDF strings support Unicode. Any interface to netCDF must therefore do likewise, and we can leave it to the netCDF interface of whatever language we use to deal with the encoding of the string data the user provides in that language.

Best wishes

Jonathan

ChrisBarker-NOAA commented 1 month ago

I was hoping that we could add a couple of sentences on this subject, rather than a new chapter. :-)

There's still hope :-)

NetCDF allows Unicode characters in names (of dimensions, variables and attributes). The relevant text from the [NUG v1.1]

Darn that google! -- I could have saved a lot of writing if I'd found that.

names may also include UTF-8 encoded Unicode characters

OK -- very good -- UTF-8 it is -- whew!

We've agreed that CF should not prohibit characters permitted by the NUG,

That's clear then.

By the word letters we mean the standard ASCII letters uppercase A to Z and lowercase a to z. By the word digits we mean the standard ASCII digits 0 to 9

So CF recommends, but does not require, ASCII-only for names -- OK then, that helps, but doesn't avoid the issue :-).

Any valid character in a netCDF name might appear in one of these CF attributes.

Hence CF must allow any Unicode character in a string-valued attribute.

Darn -- but it is what it is.

Since we allow char arrays as equivalent to strings, we can't restrict char arrays to ASCII only (your final point 1).

Also darn. :-)

On your final point 2, in my text above I proposed that we should require UTF-8 encoding for char arrays.

Makes sense to me. And, in fact, there is a very strong justification for this:

This is critical, as many (most?) programming environments (C, FORTRAN) only work natively with raw binary data (e.g. char*). So it's pretty critical that all char (and string) data are encoded the same way.

the only possibility is to guess what encoding was used, if it's not ASCII.

And guessing is never good :-(

I think we could justifiably do either of the following, but we must do one or the other in order for char data to be properly usable: -(a) Require UTF-8. -(b) Recommend UTF-8, but provide a new attribute to record the encoding.

Which of these should we do?

Requiring UTF-8 is the best way to go -- see the point above about raw char* data.

However, as I noted, an _Encoding attribute was proposed (but not accepted?) years ago, and it seems the Python library is using that attribute [1] (while defaulting to utf-8). So that cat may be out of the bag. Whether there are files out in the wild with _Encoding set, I don't know -- but if there are we probably don't want to make them invalid.

So, as much as I would like to simply require UTF-8, we probably need to say it's preferred, and the default, but other encodings can by used if defined in the _Encoding attribute.

However, for (global only?) attributes, rather than variable data, there is no way to set an _Encoding attribute. So UTF-8 in that case?

So:

For variables:

UTF-8 is preferred, and the default, but a different encoding can be used if the _Encoding attribute is set

For attributes: UTF-8 is required.

As for the content of an _Encoding attribute, it would be nice to standardize that -- the best I could find for encodings is:

https://www.unicode.org/reports/tr17/#:~:text=The%20Unicode%20Standard%20has%20seven,32BE%2C%20and%20UTF%2D32LE.

Do we want to specify only those encodings? and only those spellings?

What about non-unicode encodings -- e.g. latin-1 ? If we can, it would be nice to keep it simple and only allow Unicode encodings (which gives you ascii, as a subset of utf-8).

Here's a list of what Python supplies out of the box:

https://docs.python.org/3/library/codecs.html#standard-encodings

The ones in there that are "all languages" (Unicode) I think is the same as the official Unicode list :-).

Note that there are big and little endian versions of the multi-byte encodings -- as netcdf "endianness is solved by writing all data in big-endian order" -- I think only the big endian forms should be allowed.

Finally, are the encoding spellings case-sensitive? e.g. the official spelling is "UTF-8" -- but Python, for instance, will accept: "utf-8", "UTF_8", etc.

For string data, I suppose the encoding isn't our concern, is it?

Unfortunately, it is :-(

I assume that netCDF strings support Unicode.

AFAICT, the only difference between a char array and a string is that the length of a char array is fixed -- that is, at the binary level, you get a char* (array of bytes) either way.

Turning that char* into a meaningful string requires that the encoding be known (unless you don't care what it means, and just want to pass it along, which is fine). If you want to compare it with other values you don't need to know the encoding, but you do need to know that the two you are comparing are in the same encoding. Hence why utf-8 everywhere would be easiest.

we can leave it to the netCDF interface of whatever language we use to deal with the encoding of the string data the user provides in that language.

Unfortunately, no -- there is no language-independent concept of a "Unicode String" there is only a string of bytes, and an encoding. So netcdf strings are no easier (but also no harder) than char arrays in that regard. The encoding must be specified.

The good news is that we can use exactly the same rules for char arrays and strings.

-Chris

[1] -- a note about Python -- internally, Python (v3+) uses a native "Unicode" string data type - a "string" of Unicode code points. The encoding is an internal implementation detail (and quite complex). This makes Unicode very easy to work with in Python, but there is no way to create a Python str from binary data without knowing the encoding. This created a LOT of drama around filenames in Python3 on nix. On Unix, a filename is a `char`, with very few restrictions -- the encoding may not be known (and may be inconsistent within a file system (!). Folks writing file processing utilities for Unix wanted to be able to work with these filenames without decoding them -- and if all you need to do is pass them around and compare them, then there is no need to know the encoding. It got ugly, and Python 3.4(?)+ finally introduced a work around.