Open JimBiardCics opened 6 years ago
+1
Honestly, I only recently learned that attributes could have types other than a single piece of text (leaving char vs string out it for now).
And there are current use cases of using delimited text to capture a similar concept.
So not allowing string arrays for now seems pretty darn straightforward.
CF2.0 can take advantage of more of these nifty features.
I've recently discovered that the netcdf4-python library will force attributes to use NC_STRING if passed a unicode array. See https://github.com/Unidata/netcdf4-python/pull/389
My own quick testing suggests this will be the case if (in python3), a string is passed with unicode points above 127.
I've just scanned through this following an enquiry about some data which has been sent to us and which breals the cf-checker with string valued attributes.
I agree with @JimBiardCics that string value attributes should be allowed, and also the conclusion that string valued arrays should not be allowed in general.
On the other hand, I think we should be careful penalising an approach which may be beyond the control of many data providers: many people will be passing strings to software which puts them in the file. The HDF library also, through a command such as f.attrs['units'] = 'days since 1900'
, assigns a string value to an attribute Unidata/netcdf4-python/#448 -- do we really want to be giving advice which conflicts with default behaviour in major libraries?
On the UNIDATA pages I can't find out what a CDL statement of the form Conventions = "CF-1.7";
is meant to mean, I believe the ncgen
utility is interpreting it as a string
rather than char
array. In NCML the default type is certainly string rather than char array.
If people really want compatibility with software which has been designed around NetCDF3 they are going to have to use the NetCDF4-classic model. Rather than making specific instructions, can we advise people to consider the interests of users and their software libraries before moving to full NetCDF4? I feel that we will get tied up in knots if we tie CF to a selection of NetCDF3 features.
@davidhassell : As far as the Convention is concerned, this issue should be labelled a DEFECT. The current text states that "NetCDF does not support a character string type, so these must be represented as character arrays", and this is clearly untrue. The ubtye
data type should also be supported.
Dear Martin
I too still think that (a) for an attribute, a string should be allowed instead of, and regarded as equivalent to, a 1D char array, (b) this equivalence should be stated somewhere near the start of the convention, (c) we should not allow arrays of strings (for now - they might be allowed in future).
When you say, "We should be careful penalising an approach which may be beyond the control of many data providers", do you mean we shouldn't recommend one or the other? Earlier I had suggested making a recommendation, but I agree that it isn't necessary. However it would be useful to note (for the info of data-writers) that some users might not be able to interpret strings because the string data type didn't exist before NetCDF4.
As regards the label for this discussion, I think this should be an enhancement, not a defect. I agree that the existing convention text is wrong (because it's out of date, not because it was originally mistaken). However, allowing strings in CF is an enhancement. Moreover, it's an issue of sufficient seriousness that we are not willing to agree it by default, as the length of this discussion shows. The rule for defects is that they're accepted if no-one objects. Well, I objet to this being accepted by default. :-) However, I support it as an enhancement.
I think that character encoding, if we need to do something about it, should be treated as a different issue.
Best wishes
Jonathan
Dear Jonathan,
My comment about penalising one approach was intended to be about cf-checker
warnings: I don't think we should be issuing warnings for encoding choices which may be out of the users' control. According to your comment above, this does indeed mean that we should not recommend
one approach.
I agree that we should make a statement about potential difficulties caused by string attributes, but I was suggesting that this should be placed in the context of a general statement about using the NetCDF classic model (which may be a way in which users can easily enforce character arrays for attributes) to ensure that data can be read with legacy code. Some of our community are putting a lot of effort into enabling the use of NetCDF4 groups ... we need to be consistent about the advice we give. As far as I can tell, it would not make sense to recommend using character arrays for attributes if the file contains group structures which require NetCDF4 aware software.
I hadn't realised that the defect
label could result in curtailed discussion .. to my mind this is a counter-intuitive outcome from our procedural rules, but, as it stands, I agree this should be labelled as an enhancement
.
regards, Martin
Dear Jonathan,
with reference to your comment above about recommendations in the Convention being linked to warnings issued by the CF Checker: since my last post I've noticed this does not appear to apply to the phrase "We recommend that whenever possible ..." which is used in connection with the specification of bounds for spatial coordinate variables. Do you think this is intended (i.e. the "recommended whenever possible" is a somewhat weaker statement than "recommended", which would make sense)?
There is also at least one recommendation which clearly cannot be checked by the cd-checker (referring to use of meaningful names). It might be better to phrase such guidance as best practice notes rather than recommendations, and perhaps a similar approach could be adopted for character arrays vs. strings,
regards, Martin
Dear Martin
The CF-checker issues warnings for those recommendations which are in the conformance document. As you imply, not all recommendations can be checked, and those which can't aren't in the conformance document. I don't think it was intended that "where possible" should indicate a weaker recommendation.
I agree that it would be better not to say "recommend" for something that could be checked, but which isn't in the conformance document because we don't want to be warned about it. In that case we clearly don't feel strongly about it. The choice between strings and char arrays could be in that category, I agree.
Best wishes
Jonathan
I've added a related issue (#174) on string valued dimensions which are a new feature in NetCDF4. I think this can be handled separately, but it shares a common starting point in that it arises from changes greater flexibility introduced in NetCDF4 which the Convention text does not reflect.
@martinjuckes That is covered in issue #139. It is specifically for string variables.
given my offer to moderate #139 and the connectedness of these two issues, I'm happy to offer to moderate this issue as well. Is this helpful?
@JimBiardCics Am I correct that this proposal does not yet have an associated Pull Request?
Thanks for the offer, @marqh -- I've assigned you to the issue.
@marqh That is correct. I have verbiage, but I haven't made a pull request yet. I was figuring to do it after the 'string variable' PR was done, as there is overlap between the two. I wanted to avoid the awkward merge that could result.
Does this mean we will have no more discussion about using string array attributes?
@kenkehoe I don't think so. It is a purely logistical issue. There is so much overlap between the change sections that it seemed silly to make two independent branches of change from the original that would then require a super-awkward merge.
So now that string variables have landed, I want to bring some attention to this issue again. Some updates I've learned about:
nc_get_att_text()
returns...@DocOtak, thanks for restarting this. In light of past difficulties, I move to split the issue.
I think it would be possible to break out the more difficult parts of this topic into new and separate issues. I suggest that this issue #141 be narrowed to only a single essential ingredient: scalar string-type attributes as an alternative for traditional character-type attributes.
Can we agree to move the following to new Github issues, and focus for now only on legalizing scalar string-type attributes?
@Dave-Allured That sounds OK to me, whatever does get adopted, should probably be pretty explicit about what is "not allowed".
@Dave-Allured I approve of your proposal. I think we pretty much have no choice but to allow UTF-8 as a baseline to start with, but there clearly are larger issues to be resolved. (I say "no choice" because, for example, constraining to ASCII in python 3 is a bit complicated.)
On Fri, Mar 13, 2020 at 2:14 PM JimBiardCics notifications@github.com wrote:
@Dave-Allured https://github.com/Dave-Allured I approve of your proposal. I think we pretty much have no choice but to allow UTF-8 as a baseline to start with, but there clearly are larger issues to be resolved. (I say "no choice" because, for example, constraining to ASCII in python 3 is a bit complicated.)
Not really. Python does not use utf-8 internally, so you have to encode and decode when reading/writing a file anyway. Setting that encoding to ASCII is not easier or harder than setting it to utf-8.
But while we may have a choice -- it's not a good choice. We've all needed non-ascii charioteers for a LONG time. And Unicode and utf-8 are well established. And utf-8 is very compatible with old text processing software, so really, we should just do it.
The primary complication is that it's not always obvious how many bytes you need to store a given string, but that's more a problem with writing than reading, so we can hope that the software that writes non-ascii data is smart enough to do it right.
-CHB
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
Hmmm. Chris, I think you are implying a problem that does not exist. I do not think CF has ever restricted the use of UTF-8 in free text within attributes. I suspect there are many UTF-8 attribute examples in the wild, though I do not have one up my sleeve right now. Please correct me if I'm wrong.
Chris,
Python 3 is not the same as python 2. In Python 2 there were two types — str (ASCII) and unicode (by default UTF-8). In Python 3 there is only str, and by default it holds UTF-8 unicode (there's lots of subtly that I'm glossing over here, but this is what it boils down to). It bit me recently, so I'm sensitive to it.
https://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/ https://docs.python.org/3/howto/unicode.html
I'm getting double messages -- I think we may have a feedback loop between gitHub and the list .....
But anyway:
Hmmm. Chris, I think you are implying a problem that does not exist.
I hope that's true, Sorry if I stirred up confusion.
But I was responding to a comment about ASCII vs UTF-8, so ....
I also picked this up in email, so was unsure of the context. I've now gone and re-read the issue, and I:m a bit confused about what's still on the table.
But way back, someone wrote: " two issues: the use of strings, and the encoding. These can be decided separately, can't they?"
and there was another one: arrays of strings vs whitespace separated strings.
(I'm also not completely clear about the difference between a char* and a string anyway. Either way, it's a bunch of bytes that need to be interpreted)
So I'll just talk about encoding here. A few points:
(I know you all know most of this, and most of it has been stated in this thread, but to put it all in one place...)
Encodings are a nightmare: any place that a pile of bytes could be in more than one encoding is a pain in the a$$ for any client software -- think about the earlier days of html!
Being able to use non-ASCII characters is important and unavoidable. We can certainly restrict CF names to ASCII, but it's simply not an option for variables or attributes. (I don't think anyone is suggesting that anyway) and Unicode is the obvious way to support that.
So that leaves one open question: what encoding(s) are allowed for a CF compliant file?
I'm going to be direct here:
THERE IS NO REASON TO ALLOW MORE THAN ONE ENCODING
It only leads to pain. Period. End of story. If there is one allowed encoding, then all CF compliant software will have to be able to encode/decode that encoding. But ONLY that one! If we allow multiple encodings, than to be fully compliant, all software would have to encode/decode a wide range of encodings, and there would have to be a way to specify the encoding. So all software would have to be more complex, and there would be a lot more room for error.
If there is only one encoding allowed, then there are really only two options:
UCS-4: because it handles all of Unicode and is the always the same number of bytes per code point. A lot more like the old char* days. However, no one wants to waste all that disk space, so that leaves:
UTF-8: which is ASCII compatible, handles all of Unicode, and has been almost universally adopted in most internet exchange formats (those that are sane enough to specify a single encoding :-) )
It is also friendly to older software that uses null-terminated char* and the like, so even old code will probably not break, even if it does misinterpret the non-ascii bytes. And old software that writes plain ascii will also work fine, as ascii ID utf-8.
All that's a long way of saying:
CF should specify UTF-* as the only correct encoding for all text: char or string. With possibly some extra restrictions to ASCII in some contexts.
If that had already been decided, then sorry for the noise :-)
@JimBiardCics wrote:
Actually, I know a LOT more about Python than I do about netcdf, HDF, or CF. And I'm afraid you have it a bit confused. This is kind of off-topic, but for clarities sake:
Python 3 is not the same as python 2.
Very True, and a source of much confusion.
In Python 2 there were two types — str (ASCII) and unicode (by default UTF-8).
Almost right: there were two types:
str
: which was a single byte per character of unknown encoding -- essentially a wrapped char -- usually ascii compatible, often latin-1, but not if you were Japanese, for instance.... It was also used a holder of arbitrary binary data: see numpy's "fromstring()" methods, or reading a binary file. Much like how char is used in C.
unicode
: which was unicode text -- stored internally in UCS-2 or UCS-4 depending on how Python was compiled (I know, really?!?!) It could be encoded / decoded in various encodings for IO and interaction with other systems.
In Python 3 there is only str, and by default it holds UTF-8 unicode
Almost right: the Py3 str
type is indeed Unicode, but it holds a sequence of Unicode code points, which are internally stored in a dynamic encoding depending on the content of the string (really! a very cool optimization, actually, if you have only ascii text, it will use only one byte per char https://rushter.com/blog/python-strings-and-memory/ ). But all that is hidden from the user. To the user, a str
is a sequence of characters from the entire Unicode set, very simply.
(Unicode is particularly weird in that one "code point" is not always one character, or "grapheme" to accommodate languages with more complex systems of combining characters, etc, but I digress..)
And there are still two types -- in Python3 there is the "bytes" type, which is actually very similar to the old python2 string type -- but intended to hold arbitrary binary data, rather than text. But text is binary data, so it can still hold that. In fact, if you encode a string, you get a bytes object:
In [13]: s
Out[13]: 'some text'
In [14]: b = s.encode("ascii")
In [15]: b
Out[15]: b'some text'
Note the little 'b' before the quote. In that case, they look almost identical, as I encoded in ASCII. But what if I had some non-ASCII text?:
In [18]: s = "temp = 10\u00B0"
In [19]: s
Out[19]: 'temp = 10°'
In [20]: b = s.encode("ascii")
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-20-3930abba6989> in <module>
----> 1 b = s.encode("ascii")
UnicodeEncodeError: 'ascii' codec can't encode character '\xb0' in position 9: ordinal not in range(128)
oops, can't do that -- the degree symbol is not part of ASCII. But I can do utf-8:
In [21]: b = s.encode("utf-8")
In [22]: b
Out[22]: b'temp = 10\xc2\xb0'
which now displays the byte values, escaping the non-ascii ones. So that bytes object is what would get written to a netcdf file, or any other binary file.
And Python can just as easily encode that text in any supported encoding, of which there are many:
In [28]: s.encode("utf-16")
Out[28]: b'\xff\xfet\x00e\x00m\x00p\x00 \x00=\x00 \x001\x000\x00\xb0\x00'
But please don't use that one!
So anyway, the relevant point here is that there is NOTHING special about utf-8 as far as Python is concerned. And in fact, Python is well suited to handle pretty much any encoding folks choose to use -- but it doesn't help a bit with the fundamental problem that you need to know what the encoding of your data is in in order to use it. And if Python software (like any other) is going to write a netcdf file with non-ascii text in it, it needs to know what encoding to use.
The other complication that has come up here is that, IIUC, the netCDF4 Python library (A wrapper around the c libnetcdf) I think makes no distinction between the netcdf types CHAR and STRING (don't quote me on that), but that's a decision of the library authors, not a limitation of Python.
Actually, it does seem to give the user some control:
https://unidata.github.io/netcdf4-python/netCDF4/index.html#netCDF4.chartostring
Note that utf-8 is the default, but you can do whatever you want.
In any case, the Python libraries can be made to work with anything reasonable CF decides, even if I have to write the PRs myself :-)
Sorry to be so long winded, but this IS confusing stuff!
one small additional nota about Python and Unicode:
The post Jim pointed us to:
https://lucumr.pocoo.org/2014/1/5/unicode-in-2-and-3/
Is now six years old -- and many of the issues brought up have been addressed.
And the the author of that post has another post on the dangers of refereeing back to such older opinions:
https://lucumr.pocoo.org/2016/11/5/be-careful-about-what-you-dislike/
Another issue with that discussion is that it's written from the perspective of what some folks in the community are calling "byte slingers": Those that write libraries and the like that deal with binary data and protocols. And the fact is that Python3's String model is NOT as well suited to those use cases. But it is massively better suited to most more "casual" use cases. In that post, he refers to "beginners", but it's not beginners, it's anyone that does not understand the subtleties of binary data, encodings, and the like. Which is most of us "scientific programers".
Bringing this back to CF: For CF, ideally we would choose an approach that is well suited to the "Normal scientific programmer", and leave the encoding/decoding to the libraries. And have confidence that the "byte slingers" will correctly write the libraries to match the standard, and make things "just work" for most users.
@ChrisBarker-NOAA My original observation was that we can absolutely split off some of these issues. I see two issues being peeled off from the base issue.
I think you've made a strong case for starting out by specifying ASCII and Unicode / UTF-8 as the only valid contents for string attributes, with one of the two spinoff issues addressing the question of broadening the options.
My original observation was that we can absolutely split off some of these issues. Agreed.
Have these been started? I can't find them if they have.
There is also the question of what to do with CHAR types -- the same as STRING?
And what about encoding of CHAR and STRING variables? I can't find anything about that in the current CF document, so it doesn't seem to be settled.
Maybe this should go in a new issue, but for now, I had a (not well formed) thought:
CHAR variables and attributes should only be encoded in a 1-byte per character ascii compatible encoding: e.g. ascii, latin-1
STRING variables and attributes should only be encoded in utf-8 (of which ascii is a subset)
My justification is that there will be little software in the wild that supports Unicode, but does not support String. Setting this standard will make it less likely that older software that assumes a 1byte per character text representation will get handed something it can't deal with. And the string type is better suited to Unicode anyway, as the "length" of a string is less well defined.
@ChrisBarker-NOAA
Assuming we do spin off sub-issues related to encoding and string array attributes, I agree fully that we should, in this specific issue, propose making changes to the CF document to make it clear that CHAR attributes must be ASCII or latin-1 and STRING attributes should be unicode/utf-8.
we should, in this specific issue, propose making changes to the CF document to make it clear that CHAR attributes must be ASCII or latin-1 and STRING attributes should be unicode/utf-8
+1 on that.
Re CHAR vs STRING, the netcdf C API method calls one text and other other string. Do we want to use that language at all in whatever text is developed?
Be aware that the netcdf python library will force the use of strings for netcdf4 files if it sees unicode points outside of ASCII.
Also be aware that LATIN-1 is not compatible with UTF-8 with code points above 127. The ISO working group maintaining these "legacy" standards (ISO-8859-n, where n=1 is LATIN-1) doesn't even exist anymore...
Also be aware that LATIN-1 is not compatible with UTF-8 with code points above 127
Indeed. Which is why it should be clear that you should NOT put utf-8 in a CHAR array :-) We could say ASCII only for CHAR, but I'm not sure there is a good reason to be that restrictive.
It may be a implementation detail of the Python encodings, but at least there, latin-1 can decode ANY string of bytes (Other the the null byte) without error, and write it out again with no changes. So if consuming code uses the latin-1 encoding for all CHAR arrays, it may get garbage for the non-ascii bytes, but it won't raise an error, or mangle the data if it is written back out.
the netcdf python library will force the use of strings for netcdf4 files if it sees unicode points outside of ASCII.
which is the right thing to do, and compatible with this proposal, I think. (hmm, unless latin-1 is allowed). But you could probably send a latin-1 encoded bytes object in yes?
Anyway, if we codify this, and the netCDF4 lib (or any other) can't support it, it can be fixed. And yes, I am volunteering to do a PR for a fix to netCDF4-python.
Additionally, the netcdf standard itself has support for UTF-8 variable names, requires them to be NFC, and specifically excludes bytes 0x00 to 0x1F and 0x7F to 0xFF (see the "name" part of that document).
I think this matters because at least one of the standard attributes needs to be able to refer to variable names. Basically, allowing anything other than UTF-8, especially things that allow bytes 0x7F to 0xFF (like the ISO-8859 series encodings do), would probably cause actual problems.
Thanks! Yup -- then attributes really do need to be UTF-8 and the STRING type (for text) only.
I suppose they don't ALL HAVE to be the STRING type, but the ones that might contain variable names should be.
after all, any software that doesn't support the STRING type probably doesn't support Unicode variable names, either ...
@DocOtak I couldn't find the direct restriction on the 0x80 to 0xFF characters. Is this a side effect of utf-8 using the high bit to signal multibyte characters? Or is it a more general prohibition against using the characters in latin-1 that fall in that range?
@JimBiardCics It's the "not match" group in that regex that is doing it ([^\x00-\x1F/\x7F-\xFF]|{MUTF8})
, at least, I'm pretty sure that is what is going on. I rarely use regex myself, so I could be wrong, but I'm quite sure that the ^
is "not match".
I missed the regex. Yep, that's what it says. 0x7F is the "del" char, so it's non-printing. I think the characters from 0xC0 - 0xFF are out because they would all be interpreted in UTF-8 as signaling the start of a multi-byte character. 0x80 - 0xBF can all be interpreted as trailing elements of a multibyte character, so I guess it's a bad plan to have one lying around loose. This Wikipedia article was informative.
remember that utf-8 is ascii compatible for the first 127 (7 bits). So:
0x00 to 0x1F are the control codes from ASCII
0x7f is the DEL (not sure why that wasn't in the first set..., but there you go.
and 0x80 to 0xFF is the rest of the non-ascii bytes -- (128-255), which you have to be able to use in order to do utf-8. But frankly, I have not sure what a regex is with regard to bytes. But if I had to guess, I'd pull it apart this way (which is almost what's in the footnote:
first: MUTF8 means: multibyte UTF-8 encoded, NFC-normalized Unicode character However, Unicode doesn't quite use "characters", but rather "Code Points", so that means:
Which means any Unicode code point >= 128 (0x80) and above.
([a-zA-Z0-9_]|{MUTF8})([^\x00-\x1F/\x7F-\xFF]|{MUTF8})
The first character has to be: ([a-zA-Z0-9_]|{MUTF8}): ASCII letter, number or underscore OR any other code point over 128
All the other characters have to be: Any code point other than: \x00-\x1F and \x7F-\xFF OR any code point above 128.
Which is an odd way to define it, as the codepoints \x7F-\xFF are valid Unicode, so you're kind of excluding them, and then allowing them again .... strange.
I suspect that this started with the original pre-Unicode definition, and they added the UTF8 part, and got an odd mixture. In particular, there is really no reason to treat the single byte or multibyte UTF codepoints separately, that's just odd.
I think I'd write this as:
Names are UTF-8 encoded. The first letter can be any of these codepoints:
x30 - x39. (digits: 0-9)
x41 - x5a (upper case letters: A-Z)
x61 - x7a (lower case letters: a-z)
c5f (underscore)
>= xx80
The rest can be any code point other than:
\x00-\x1F or \x7F
However, there is a key missing piece: a number of Unicode code points are used for control character and whitespace, and probably other things unsuitable for names. Which may be why they used the term "character". But it would be better if they had clearly defined what's allowed and what s not. For instance, Python3 uses these categories: (https://docs.python.org/3/reference/lexical_analysis.html#identifiers) Lu - uppercase letters Ll - lowercase letters Lt - titlecase letters Lm - modifier letters Lo - other letters Nl - letter numbers
I have no idea if those are defined by the Unicode consortium anywhere. But it would be good for netcdf (and or CF) to define it for themselves.
I will say that it's kind of nifty to be able to do (in Python):
In [17]: π = math.pi
In [18]: area = π * r**2
But I'm not sure I need to be able to assign a variable to 💩 -- which Python will not allow, but does the netcdf spec allow it?
I think there is some confusion here.
First, this whole regex stuff is only about the physical byte layout of the netcdf classic file format. I would in principle suggest to completely focus on netcdf4 files instead.
Second, I think CF should not concern itself with encodings and byte order stuff at all. Leave that to netcdf4/hdf5 and just work at the character level. And yes, unicode has code points, but also a concept of characters (see here).
Third, looking at the regex in question
([a-zA-Z0-9_]|{MUTF8})([^\x00-\x1F/\x7F-\xFF]|{MUTF8})*
notice that it is only an explanatory comment, but apart from that the overwhelmingly likely way to parse this, thanks to the "|" alternatives, is as either
([a-zA-Z0-9_])([^\x00-\x1F/\x7F-\xFF])*
ie an ascii string starting with a character, digit, or underscore, limited to the first 128 bytes without control characters and excluding "/" everywhere or
({MUTF8})({MUTF8})*
ie any unicode string encoded as normalized UTF-8.
@ChrisBarker-NOAA wrote:
I have no idea if those are defined by the Unicode consortium anywhere.
They do indeed. See here.
@DocOtak @zklaus Sorry if I've pulled the discussion off track. The question of exactly why NUG worded things the way they did is intriguing, but I think Klaus is right that we shouldn't get wrapped around that particular axle in this issue — particularly if we are going to split encoding off into a different issue. I think the take-away is that our baseline is "sane utf-8 unicode" for attributes of type NC_STRING and ASCII for attributes of type NC_CHAR (those created with the C function nc_put_att_text.)
I agree and would go one small step further: UTF-8 is only an encoding, so we should just say "unicode" for strings. If we need to restrict that, say to disallow underscore in the beginning or to save a separation character like space in attributes right now, we should do so at the character level, possibly using categories as introduces by @ChrisBarker-NOAA above.
UTF-8 is only an encoding, so we should just say "unicode" for strings.
We could do that if and only if netcdf itself was clear about how Unicode is encoded in files. Which it is for variable names, though not so sure it is anywhere else.
But even so, once the encoding has been specified, then yes, talking about Unicode makes sense.
Agreed, it's not for this discussion, but:
MUTF8
is not quite (In that doc): "any unicode string encoded as normalized UTF-8." because I think they are specifically trying to exclude the ASCII subset, so they can handle that separately. i.e characters that are excluded, like "/" are indeed unicode strings.
But it's a pretty contorted way to describe it -- but that's netcdf's problem :-)
Ah yes, I see what you mean, you are right: Always speaking about UTF-8, multi-byte here isn't referring to the possibility of having several bytes encode one code point, but to actual code points with more than one byte, thus excluding the one-byte code points which are exactly the first 128 ASCII characters. Then they allow back in specific ASCII characters.
Dear all
The issue was opened in 2018 and has seen a long discussion, but no further contributions since 2020. It has been partly superseded, in that CF now permits string-valued attributes to be either a scalar string or a 1D character array (see Sect 2.2). Apart from that, it seems to me that the discussion was mostly concerned with three subjects:
Should CF allow arrays of strings in attributes? We are currently discussing that question in https://github.com/orgs/cf-convention/discussions/341, which refers back to this issue. Therefore I propose we don't discuss this any further here.
What encoding should be used in string attributes. The consensus was that it should always be Unicode. One reason for this is that netCDF variable names are in Unicode, and many CF attributes contain the names of netCDF variables. CF recommends that only letters, digits and underscores should be used for variable names, but does not prohibit other Unicode characters. Should we insert a statement in the CF convention about strings being Unicode?
Whether to restrict the characters allowed in string-valued attributes. The majority of CF attributes contain the names of netCDF variables and strings which come from a CF controlled vocabulary or a list in an Appendix. The set of characters that can be used in those attributes is thus dictated already by the convention. This question therefore applies only to the attributes that CF defines but whose contents it does not standardise, namely comment
history
institution
references
source
title
and long_name
. Does anyone wish to pursue this third question? For instance, @ChrisBarker-NOAA, @zklaus and @DocOtak all contributed in 2020.
I propose that this issue should be closed as dormant if no-one resumes discussion on Q2 or Q3 within the next three weeks, before 14th September.
Cheers
Jonathan
Thanks for trying to close this out :-)
Should we insert a statement in the CF convention about strings being Unicode?
I jsut looked, and all I see is this under naming:
"...is more restrictive than the netCDF interface which allows almost all Unicode characters encoded as multibyte UTF-8 characters """
So yes, I think it's good to be clear there -- maybe it's well defined by netcdf, but it doesn't hurt to be explicit, if repetivive.
I am correct to say that all strings in netCDF are Unicode, encoded as UTF-8 ?
Whether that's true or not for netcdf -- I think it should be true for CF, and we should say that explicitly in any case.
... This question therefore applies only to the attributes that CF defines but whose contents it does not standardise,
I would say that we should not restrict these otherwise not-restricted atributes.
I'm not sure if that's pursuing it or not pursuing it -- I presume the default is no restrictions?
Hmm -- not sure where this fits, but it's related:
IIUC, CF now allows either the new vlen strings, or the "tradoitional" char arrays.
The trick is that UTF-8 is not a a one-char-per-codepoint encoding.
Could we say that you can only use Unicode (UTF-8) with vlen strings, and char arrays can only hold ASCII? or is the cat way to far out of the bag for that?
Probably - could we at least encourage vlen strings for non-ascii text?
@ChrisBarker-NOAA
Am I correct to say that all strings in netCDF are Unicode, encoded as UTF-8 ?
I don't know either.
Whether that's true or not for netcdf -- I think it should be true for CF, and we should say that explicitly in any case.
I think so as well. That would go sensibly in Sect 2.2 "Data types".
We've already said in 2.2 that scalar vlen
strings and 1D char
arrays are both allowed and are equivalent in variables. We did not say so for attributes, but I expect everyone would assume that the same applies, in which case we should make it explicit. I don't think there's a problem with storing a multi-byte character codes in a char
array, is there? It would be clearest if we said that a 1D char
array should always be interpreted as a Unicode string. An ASCII string is a special case of that, so it's backwards-compatible.
Cheers
Jonathan
No-one said they wanted to resume Q1 or Q3 within three weeks, but @ChrisBarker-NOAA and I agreed that it would be useful to clarify that strings stored in variables or attributes should be Unicode characters (Q2). To do that, I propose that we replace the first 1.5 sentences of the second para of sect 2.2 "Data Types", which currently reads
Strings in variables may be represented one of two ways - as atomic strings or as character arrays. An n-dimensional array of strings may be implemented as a variable of type
string
with n dimensions, or as a variable of typechar
with n+1 dimensions, where the most rapidly varying dimension ...
with
A text string in a variable or an attribute may be represented either in Unicode characters stored in a
string
or encoded as UTF-8 and stored in achar
array. Since ASCII 7-bit character codes are a subset of UTF-8, achar
array of m ASCII characters is equivalent to astring
of m ASCII characters. Unicode characters which are not in the ASCII character set require more than one byte each to encode in UTF-8. Hence astring
of length m generally requires a UTF-8char
array of size >m to represent it.An n-dimensional array of strings may be implemented as a variable or attribute of type
string
with n dimensions (where n<2 for an attribute) or as a variable (but not an attribute) of typechar
with n+1 dimensions, where the most rapidly varying dimension ...
Also, I suggest inserting the clarification "which has variable length", in this sentence in the first paragraph:
The
string
type, which has variable length, is only available in files using the netCDF version 4 (netCDF-4) format.
Does that look all right, @ChrisBarker-NOAA, @zklaus, anyone else? I believe this is no change to the convention, just clarification of the text, so I'm going to relabel this issue as a defect
. Please speak up if you disagree. Thanks.
As PR #543 attempts to clarify a bit about Unicode, I thought I'd post here. I stared commenting on the PR, but realized I had way too much to say for a PR, so I'm putting it here.
NOTE: maybe this should be a different issue -- specifically about Unicode in CF -- but I'm putting it here for now -- we can opy to a new issue if need be.
1) there is no such thing as a Unicode "character". Unicode defines "code points", and each code point is assigned a value. However: "Code points are the numbers assigned by the Unicode Consortium to every character in every writing system." -- so interchanging "code point" and "character" is probably OK and will lead to little confusion. (one difference is how Unicode handles accented charioteers and the like, so it's not quite one-to-one code point-to-charactor).
2: There is no such thing as a Unicode String (except where defined by a programming language, e.g. Python). When stored in memory or a file strings, Unicode or not, are stored as bytes, and the relationship between the bytes and the code points is defined by an encoding. Without an encoding, there is no clear way to define what a bunch of bytes means, or in reverse, how to store a particular set of code points.
ANSI encodings are one (8 bit) byte per charactor (so easy!), and ASCII is 1 7 bit byte per character (so only 128 different chars). But therefor ANSI encodings can store only 255 different code points.
To store all possible Unicode code-points requires a 32 bit integer (4 bytes) -- that's the UCS-4 (UTF-32) encoding -- one-to-one relationship between integer value and code-point.
Other encodings that can store all of Unicode are "variable length encodings" -- a given code point can be a variable number of bytes. These allow more compact storage, but also more complexity in interpretation. Examples are UTF-8 (each code point is 8 or more bits) and UTF-16 (each character is 16 or more bits).
UTF-8 is the most common Unicode encoding for sotage of text in files, or passing over the internet (via https, or ...). UTF-16 is used internally by Windows and Java (I think).
Anyway -- unless one wants to use UCS-32 (which is what the numpy Unicode type uses), which most folks don't want to use for file storage -- it's pretty wasteful of space for virtually all text) then a variable-length encoding is required. And a char
array is not ideal for variable length encodings -- because a char array requires a fixed size, and you don't know what size is needed until you encode the text. So a variable length string
array is the "right" choice for Unicode (non-ansi) text.
So this brings us to the topic at hand -- in netcdf3 the only what to store text was in arrays of type char
. This maps directly to the char*
used to store text in C. So a pretty direct mapping to C (and other languages).
With netcdf4, a string
type was introduced: Strings are variable length arrays of char
s, while char
arrays are fixed length.
So: as far as the netcdf spec is concerned, the only difference between a char
array and a string
is that the length of char
array is fixed. Once you read it -- you have a char*.
That's all I could find in the netCDF docs. Nothing about Unicode or encodings, or ... Which means that as far as the netcdf spec is concerned, you can put anything in either data type.
Note that a char*
in C, while used for text (hence the name) is really a generic array of bytes -- it can be used to store any old collection of data.
So enter Unicode: as above, to store a "Unicode String" i.e. collection of code points, requires that the string be encoded, resulting in a set of bytes that can be stored in, you guessed it a char. (on Windows, the standard encoding is UTF-16, so a `wchar` ("wide char") is used. But a wchar can be cast to a char -- it's still an array of bytes (unsigned eight bit ints).
So as far as netcdf is concerned, you can stuff Unicode text into either a char
array or a string
in netcdf.
Note that I did find this discussion:
https://github.com/Unidata/netcdf-c/issues/402
from May-June 2017 and not closed yet. From the netCDF docs, I don't think it was ever resolved. But it does contain a proposal for using an _Encoding
attribute, and it may be kinda-sorted adopted by netCDF4 Python lib (it does respect the _Encoding
atribute of char arrays), but I can't find documentation for how it handles the netcdf string
type. and it looks like utf-8 is the default:
def chartostring(b, encoding='utf-8')
def stringtochar(a, encoding='utf-8')
I also don't know what it does for attributes, because they can't have another attribute to store the _Encoding
. So ?? However, it does seem to "just work" -- at least if you write the file with Python -- e.g. you can ncdump it and it will correctly show a non-ascii character (on my terminal, which may be utf-8?).
Anyway -- as this doesn't seem to be defined by any published spec, I hope we can define it for CF. My proposal:
In pretty much any context:
char
arrays should only be used to store ANSI encoded text. i.e. 1-byte-per character. maybe we could restrict that to ASCII or latin-1 ? (latin-1 is a superset of ASCII).
For text that can not be stored in an ansi encoding (i.e. Unicode text), the string
type should be used.
string
and string array attributes are stored with the utf-8 encoding. (note that ASCII is a strict subset of utf-8, so ASCII is also legal.string
and string
array variables are stored encoded at utf-8 be default, or in the encoding specified by the _Encoding
attribute.That's it -- pretty simple, really :-)
Points to consider:
1) Should we restrict char
arrays to ascii, or latin-1? (or allow other 1-byte encodings with an _Encoding
attribute?
2) Should we allow the _Encoding
attribute? or just say "thou shalt use only UTF-8"
My thought -- as much as I'd love to be fully restrictive to make things simpler for everyone, the cat's probably out of the bag. So we may have to require as few restrictions as possible (e.g. _Encoding
), but recommend either ASCII or UTF-8.
So -- enough words for you?
-- back in the day, a char* would be an ASCII or ANSI encoding string (null terminated), and all was good an simple.
Dear Chris
Thanks for the research and your useful exposition of the complexity of the issue. I was hoping that we could add a couple of sentences on this subject, rather than a new chapter. :-)
NetCDF allows Unicode characters in names (of dimensions, variables and attributes). The relevant text from the NUG v1.1 is as follows. (By the way, this quotation indicates that Unidata also regard it as OK to refer to Unicode "characters" instead of "codepoints", in the interest of easy understanding.)
Beginning with versions 3.6.3 and 4.0, names may also include UTF-8 encoded Unicode characters as well as other special characters, except for the character '/', which may not appear in a name. Names that have trailing space characters are also not permitted.
We've agreed that CF should not prohibit characters permitted by the NUG, although we recommend a more restricted list of characters in sect 2.3:
It is recommended that variable, dimension, attribute and group names begin with a letter and be composed of letters, digits, and underscores. By the word letters we mean the standard ASCII letters uppercase
A
toZ
and lowercasea
toz
. By the word digits we mean the standard ASCII digits0
to9
, and similarly underscores means the standard ASCII underscore_
.
In the previous discussion on this issue, an important point was made, that many CF attributes identify netCDF variables or attributes by name e.g. coordinates="lat lon"
. Therefore
Any valid character in a netCDF name might appear in one of these CF attributes.
Hence CF must allow any Unicode character in a string-valued attribute.
Since we allow char
arrays as equivalent to strings
, we can't restrict char
arrays to ASCII only (your final point 1). Any Unicode character must be possible in a char
array as well. Non-ASCII characters may already have been used in existing data, so we shouldn't restrict them now (following our usual generous principle).
On your final point 2, in my text above I proposed that we should require UTF-8 encoding for char
arrays. We haven't said anything about this before, and we didn't provide a way to record the encoding, so for existing char
data the only possibility is to guess what encoding was used, if it's not ASCII. I think we could justifiably do either of the following, but we must do one or the other in order for char
data to be properly usable:
Require UTF-8.
Recommend UTF-8, but provide a new attribute to record the encoding.
Which of these should we do?
For string
data, I suppose the encoding isn't our concern, is it? I assume that netCDF strings support Unicode. Any interface to netCDF must therefore do likewise, and we can leave it to the netCDF interface of whatever language we use to deal with the encoding of the string
data the user provides in that language.
Best wishes
Jonathan
I was hoping that we could add a couple of sentences on this subject, rather than a new chapter. :-)
There's still hope :-)
NetCDF allows Unicode characters in names (of dimensions, variables and attributes). The relevant text from the [NUG v1.1]
Darn that google! -- I could have saved a lot of writing if I'd found that.
names may also include UTF-8 encoded Unicode characters
OK -- very good -- UTF-8 it is -- whew!
We've agreed that CF should not prohibit characters permitted by the NUG,
That's clear then.
By the word letters we mean the standard ASCII letters uppercase A to Z and lowercase a to z. By the word digits we mean the standard ASCII digits 0 to 9
So CF recommends, but does not require, ASCII-only for names -- OK then, that helps, but doesn't avoid the issue :-).
Any valid character in a netCDF name might appear in one of these CF attributes.
Hence CF must allow any Unicode character in a string-valued attribute.
Darn -- but it is what it is.
Since we allow char arrays as equivalent to strings, we can't restrict char arrays to ASCII only (your final point 1).
Also darn. :-)
On your final point 2, in my text above I proposed that we should require UTF-8 encoding for char arrays.
Makes sense to me. And, in fact, there is a very strong justification for this:
char
arrays will only compare equal, at the binary level, if they are encoded the same way.This is critical, as many (most?) programming environments (C, FORTRAN) only work natively with raw binary data (e.g. char*). So it's pretty critical that all char (and string) data are encoded the same way.
the only possibility is to guess what encoding was used, if it's not ASCII.
And guessing is never good :-(
I think we could justifiably do either of the following, but we must do one or the other in order for char data to be properly usable: -(a) Require UTF-8. -(b) Recommend UTF-8, but provide a new attribute to record the encoding.
Which of these should we do?
Requiring UTF-8 is the best way to go -- see the point above about raw char*
data.
However, as I noted, an _Encoding
attribute was proposed (but not accepted?) years ago, and it seems the Python library is using that attribute [1] (while defaulting to utf-8). So that cat may be out of the bag.
Whether there are files out in the wild with _Encoding
set, I don't know -- but if there are we probably don't want to make them invalid.
So, as much as I would like to simply require UTF-8, we probably need to say it's preferred, and the default, but other encodings can by used if defined in the _Encoding
attribute.
However, for (global only?) attributes, rather than variable data, there is no way to set an _Encoding
attribute. So UTF-8 in that case?
So:
For variables:
UTF-8 is preferred, and the default, but a different encoding can be used if the _Encoding
attribute is set
For attributes: UTF-8 is required.
As for the content of an _Encoding
attribute, it would be nice to standardize that -- the best I could find for encodings is:
Do we want to specify only those encodings? and only those spellings?
What about non-unicode encodings -- e.g. latin-1 ? If we can, it would be nice to keep it simple and only allow Unicode encodings (which gives you ascii, as a subset of utf-8).
Here's a list of what Python supplies out of the box:
https://docs.python.org/3/library/codecs.html#standard-encodings
The ones in there that are "all languages" (Unicode) I think is the same as the official Unicode list :-).
Note that there are big and little endian versions of the multi-byte encodings -- as netcdf "endianness is solved by writing all data in big-endian order" -- I think only the big endian forms should be allowed.
Finally, are the encoding spellings case-sensitive? e.g. the official spelling is "UTF-8" -- but Python, for instance, will accept: "utf-8", "UTF_8", etc.
For string data, I suppose the encoding isn't our concern, is it?
Unfortunately, it is :-(
I assume that netCDF strings support Unicode.
AFAICT, the only difference between a char
array and a string
is that the length of a char
array is fixed -- that is, at the binary level, you get a char*
(array of bytes) either way.
Turning that char*
into a meaningful string requires that the encoding be known (unless you don't care what it means, and just want to pass it along, which is fine). If you want to compare it with other values you don't need to know the encoding, but you do need to know that the two you are comparing are in the same encoding. Hence why utf-8 everywhere would be easiest.
we can leave it to the netCDF interface of whatever language we use to deal with the encoding of the string data the user provides in that language.
Unfortunately, no -- there is no language-independent concept of a "Unicode String" there is only a string of bytes, and an encoding. So netcdf strings are no easier (but also no harder) than char arrays in that regard. The encoding must be specified.
The good news is that we can use exactly the same rules for char
arrays and string
s.
-Chris
[1] -- a note about Python -- internally, Python (v3+) uses a native "Unicode" string data type - a "string" of Unicode code points. The encoding is an internal implementation detail (and quite complex). This makes Unicode very easy to work with in Python, but there is no way to create a Python str from binary data without knowing the encoding. This created a LOT of drama around filenames in Python3 on nix. On Unix, a filename is a `char`, with very few restrictions -- the encoding may not be known (and may be inconsistent within a file system (!). Folks writing file processing utilities for Unix wanted to be able to work with these filenames without decoding them -- and if all you need to do is pass them around and compare them, then there is no need to know the encoding. It got ugly, and Python 3.4(?)+ finally introduced a work around.
Attributes with a type of string are now possible with netCDF-4, and many examples of attributes with this type are "in the wild". As an example of how this is happening, IDL creates an attribute with this type if you select its version of
string
type instead ofchar
type. It seems that people often assume thatstring
is the correct type to use because they wish to store strings, not characters.I propose to add verbiage to the Conventions to allow attributes that have a type of
string
. There are two ramifications to allowing attributes of this type, the second of which impacts string variables as well.string
attribute can contain 1D atomic string arrays. We need to decide whether or not we want to allow these or limit them (at least for now) to atomic string scalars. Attributes with arrays of strings could allow for cleaner delimiting of multiple parts than spaces or commas do now (e.g. flag_values and flag_meanings could both be arrays), but this would be a significant stretch for current software packages.string
attribute (and astring
variable) can contain UTF-8 Unicode strings. UTF-8 uses variable-length characters, with the standard ASCII characters as the 1-byte subset. According to the Unicode standard, a UTF-8 string can be signaled by the presence of a special non-printing three byte sequence known as a Byte Order Mark (BOM) at the front of the string, although this is not required. IDL (again, for example) writes this BOM sequence at the beginning of every attribute or variable element of typestring
.Allowing attributes containing arrays of strings may open up useful future directions, but it will be more of a break from the past than attributes that have only single strings. Allowing attributes (and variables) to contain UTF-8 will free people to store non-English content, but it might pose headaches for software written in older languages such as C and FORTRAN.
To finalize the change to support
string
type attributes, we need to decide:string
attributes and (by extension) variables?Now that I have the background out of the way, here's my proposal.
Allow
string
attributes. Specify that the attributes defined by the current CF Conventions must be scalar (contain only one string).Allow UTF-8 in attribute and variable values. Specify that the current CF Conventions use only ASCII characters (which are a subset of UTF-8) for all terms defined within. That is, the controlled vocabulary of CF (standard names and extensions, cell_methods terms other than free-text elements of comments(?), area type names, time units, etc) is composed entirely of ASCII characters. Free-text elements (comments, long names, flag_meanings, etc) may use any UTF-8 character.
Trac ticket: #176 but that's just a placeholder - no discussion took place there (@JonathanGregory 26 July 2024)