Open JimBiardCics opened 6 years ago
I am generally in support of this string
attributes proposal, including UTF-8 characters. However, for CF controlled attributes, I recommend an explicit preference for type char
rather than string
. This is for compatibility with large amounts of existing user code that access critical attributes directly, and would need to be reworked for type string
.
I suggest not including a constraint for scalar strings, simply because it seems redundant. I think that existing CF language strongly implies single strings in most cases of CF defined attributes.
How different is reading values from a string
attribute compared to a string
variable? If some software supports string
variables shouldn't it support string
attributes as well? If the CF is going to recommend char
datatype for string-valued attributes, shouldn't the same be done for string-valued variables?
Prefixing the bytes of an UTF-8 encoded string with the BOM sequence is an odd practice. Although it is permitted, afaik, it is not recommended.
Since what gets stored are always the bytes of one string in some encoding, assuming UTF-8 always should take care of the ASCII character set, too. This could cause issues if someone used other one-byte encodings (e.g. ISO 8859 family) but I don't see how such cases could be easily resolved.
Stroring Unicode strings using the string
datatype makes more sense since the number of bytes for such strings in UTF-8 encoding is variable.
This issue and issue https://github.com/cf-convention/cf-conventions/issues/139 are intertwined. There may be overlapping discussion in both.
@ajelenak-thg So I did some digging. I wrote a file with IDL and read it with C. There are no BOM prefixes. I guess some languages (such as Python) make assumptions one way or another about string attributes and variables, but it appears that it's all pretty straightforward.
@ajelenak-thg I agree that we should state that char attributes and variables should contain only ASCII characters.
@Dave-Allured When you say "CF-controlled attributes", are you referring to any values they may have, or to values that are from controlled vocabularies?
It is true that applications written in C or FORTRAN will require code changes to handle string
because the API and what is returned for string attributes and variables is different from that for char attributes and variables.
Would a warning about avoiding string
for maximum compatibility be sufficient?
@JimBiardCics, by "CF-controlled attributes", I mean CF-defined attributes within "the controlled vocabulary of CF" as you described above. By implication I am referring to any values they may have, including but not limited to values from controlled vocabularies.
A warning about avoiding data type string
is notification. An explicit preference is advocacy. I believe the compatibility issue is important enough that CF should adopt the explicit preference for type char
for key attributes.
The restriction that char
attributes and variables should contain only ASCII characters is not warranted. The Netcdf-C library is agnostic about the character set of data stored within char
attributes and char
variables. UTF-8 and other character sets are easily embedded within strings stored as char
data.
Therefore I suggest no mention of a character set restriction, outside of the CF controlled vocabulary. Alternatively you could establish the default interpretation of string data (both char
and string
data types) as the ASCII/UTF-8 conflation.
Hi all, I wasn't quite able to form this into a coherent paragraphs so here are some things to keep in mind re: UTF8 vs other encodings:
My personal recommendation is that the only encoding for text in CF netCDF be UTF8 in NFC with no byte order mark. For attributes where there is desire to restrict what is allowed (though controlled vocabulary or other limitations), the restriction should be specified using unicode points, e.g. "only printing characters between U+0000 and U+007F are allowed in controlled attributes".
Text which is in controlled vocabulary attributes should continue to be char arrays. Freeform attributes (mostly those in 2.6.2. Description of file contents), could probably be either string or char arrays.
@DocOtak, you said "the netCDF classic format required UTF8 to be in Normalization Form Canonical Composition (NFC)". This restriction is only for netCDF named objects, i.e. the names of dimensions, variables, and attributes. There is no such restriction for data stored within variables or attributes.
@Dave-Allured yes, I reread the section, object names does appear to be what it is restricting. Should there be some consideration of specifying a normalization for the purposes of data in CF netcdf?
Text encoding probably deserves its own section in the CF document, perhaps under data types. The topic of text encoding can be very foreign to someone who thinks that "plain text" is a thing that exists in computing.
@DocOtak, for general text data, I think UTF-8 normalization is more of a best practice than a necessity for CF purposes. Therefore I suggest that CF remain silent about that, but include it if you feel strongly. Normalization becomes important for efficient string matching, which is why netCDF object names are restricted.
@Dave-Allured I don't know enough about the consequences of requiring a specific normalization. There is some interesting information on the unicode website about normalization. Which suggests that over 99% of unicode text on the web is already in NFC. Also interesting is that combining NFC normalized strings may not result in a new string that is normalized. It is also stated in the FAQ that "Programs should always compare canonical-equivalent Unicode strings as equal", so it's probably not an issue as long as the controlled vocabulary attributes have values with code points in the U+0000 and U+007F range (control chars excluded).
@Dave-Allured and @DocOtak,
1) Most of the character/string attributes in the CF conventions contain a concatenation of sub-strings selected from a standardized vocabulary, variable names, and some numbers and separator symbols. It seems that for those attributes the discuss about the encoding is not so relevant as these sub-strings contain only a very basic set of characters (assuming that variable names are not allowed to contain extended characters). Even for flagmeanings the CF conventions state "Each word or phrase should consist of characters from the alphanumeric set and the following five: '', '-', '.', '+', '@'." If the alphanumeric set doesn't include extended characters this again doesn't create any problems for encoding. The only attributes that might contain extended characters (and thus be influenced by this encoding choice) are attributes like long_name, institution, title, history, ... However CF inherits most of them from the NetCDF User Guide which explicitly states that they should be stored as character arrays (see NUG Appendix A) So, is it then up to CF to allow strings here? In short, I'm not sure the encoding is important for string/character attributes at this moment.
2) I initially raised the encoding topic in the related issue #139 because we want our model users to use local names for observation points and they will end up in label variables. In that context I would like to make sure that what I store is properly described.
@hrajagers Thanks for the pointer to NUG Appendix A. It's interesting to see in that text that character array, character string, and string are used somewhat interchangeably. I'm curious to know if the NUG authors looked at this section in light of allowing string
type.
I think we are making good progress on this. I checked the Appendix A table of CF attributes and I think the following attributes can be allowed to hold string
values as well as char
:
comment
external_variables
_FillValue
flag_meanings
flag_values
history
institution
long_name
references
source
title
All the other attributes should hold char
values to maximize backward compatibility.
@ajelenak-thg Are you suggesting the other attributes must always be of type char
, or that they should only contain the ASCII subset of characters?
Based on the expressed concern so far for backward compatibility I suggested the former: always be of type char
. Leave the character set and encoding unspecified since the values of those attributes are controlled by the convention.
On the string encoding issue, CF data can be currently stored in two file formats: NetCDF Classic, and HDF5. String encoding information cannot be directly stored in the netCDF Classic format and the spec defines a special variable attribute _Encoding
for that in future implementations. The values of this attribute are not specified so anything could be used.
In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. Both char
and string
datatypes in the context of this discussion are stored as HDF5 strings. This effectively limits what could be allowed values of the (future) _Encoding
attribute for maximal data interoperability between the two file formats.
@hrajagers said: However CF inherits most of them [attributes] from the NetCDF User Guide which explicitly states that they should be stored as character arrays (see NUG Appendix A) So, is it then up to CF to allow strings here?
Yes, NUG Appendix A literally allows only char
type attributes. My sense is that proponents believe that string
type is compatible with the intent of the NUG, and also strings
have enough advantages to warrant departure from the NUG.
Personally I think string
type attributes are fine within collaborations where everyone is ready for any needed code upgrades. For exchanged and published data, char
type CF attributes should be preferred explicitly by CF.
@ajelenak-thg said: In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. Both char and string datatypes in the context of this discussion are stored as HDF5 strings.
Actually the ASCII/UTF-8 restriction is not enforced by the HDF5 library. This is used intentionally by netcdf developers to support arbitrary character sets in netcdf-4 data type char
, both attributes and variables. See netcdf issue 298. Therefore, data type char
remains fully interoperable between netcdf-3 and netcdf-4 formats.
For example, this netcdf-4 file contains a char
attribute and a char
variable in an alternate character set. You will need an app or console window enabled for ISO-8859-1 to properly view the ncdump of this file.
Dear Jim
Thanks for addressing these issues. In fact you've raised two issues: the use of strings, and the encoding. These can be decided separately, can't they?
On strings, I agree with your proposal and subsequent comments by others that we should allow string
, but we should recommend the continued use of char
, giving as the reason that char
will maximise the useability of the data, because of the existence of software that isn't expecting string
. Recommend means that the cf-checker will give a warning if string
is used. However it's not an error and a given project could decide to use string
.
For the attributes whose contents are standardised by CF e.g. coordinates
, if string
is used we should require a scalar string. This is because software will not expect arrays of strings. These attributes are often critical and so it's essential they can be interpreted. For CF attributes whose contents aren't standardised e.g. comment
, is there a strong use-case for allowing arrays of strings?
I recall that at the meeting in Reading the point was made that arrays would be natural for flag_values
and flag_meanings
. I agree that the argument is stronger in that case because the words in those two attributes correspond one-to-one. Still, it would break existing software to permit it. Is there a strong need for arrays?
Best wishes
Jonathan
@JonathanGregory I agree with you. I think it would be fine to leave string
array attributes out of the running for now. I also prefer the recommendation route.
Regarding the encoding, it seems to me that we could avoid a lot of complexity for now by making a simple requirement that all CF-defined terms and whitespace delimiters in string-valued attributes or variables be composed characters from the ASCII character set. It wouldn't matter if people used Latin-1 (ISO-8859-1) or UTF-8 for any free text or free text portions of contents, because they both contain the ASCII character set as a subset. The parts that software would be looking for would be parseable.
@JonathanGregory Another use-case (that I think came up during the Reading meeting) had the history
attribute as a string array so that each element could contain the description of an individual processing step. I think easier machine readability was mentioned as a motivation.
Regarding the encoding, I agree that for attribute contents which are standardised by CF it is fine to restrict ourselves to ASCII, in both char
and string
. For these attributes, we prescribe the possible values (they have controlled vocabulary) and so we don't need to make a rule in the convention about it for the sake of the users of the convention. If we put it in the convention, it would be as guidance for future authors of the convention. I don't have a view about whether we should do this. It would be worth noting to users that whitespace, which often appears in a "black-separated list of words", should be ASCII space. I agree that UTF-8 is fine for contents which aren't standardised.
Regarding arrays of strings, I realise I wasn't thinking clearly yesterday, sorry. As we've agreed, string
attributes will not be expected by much existing software. Hence software has to be rewritten to support the use of strings in any case, and support for arrays of strings could be added at the same time, if it's really valuable. I don't see the particular value for the use of string
arrays for comment
- do other people? For flag_meanings
, the argument was that it would allow a meaning to be a string which contained spaces (instead of being joined up with underscores, as is presently necessary); that is, it would be an enhancement to functionality.
Happy weekend - Jonathan
I meant to write, I don't see the particular value for the use of string
arrays for history
, which Ethan reminded us of. Why would this be more machine-readable?
@JonathanGregory The use of an array of strings for history would simplify denoting where each entry begins and ends as entries are added, appending or prepending a new array element each time rather than the myriad different ways people do it now. This would actually be a good thing to standardize.
I think we can just not mention string array attributes right now. The multi-valued attributes (other than history, perhaps) pretty much all specify how they are to be formed and delimited.
I think we can just not mention string array attributes right now.
Do we currently allow array of CHAR (i.e. 2D array) for attributes?
According to the netcdf docs;
The current version treats all attributes as vectors; scalar values are treated as single-element vectors.
Which makes me think no, that’s not possible.
I think allowing the string type should not change what’s allowable.
BTW, I suspect some client software (e.g. py_netCDF4) treat char and string the same ....
-CHB
Let me throw a big wrench into this argument about not allowing string arrays.
I'm not suggesting the use of all these use cases, but this relatively small change can go a long way to improve the standard and future use of the data.
OK, I've made my case I'll be quite now.
Ken
On 2018-7-27 09:23, JimBiardCics wrote:
@JonathanGregory https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JonathanGregory&d=DwMFaQ&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=Vm7o2ZGxPkkqRuPs8nVMVQ&m=8yXRwSE3siXwVZGSqDHhbz4T34VwzI3u8GXmHkJaWkY&s=d4qYLgaugDM0kdWoZHbgieEpVU-Xg_SJ1d1F_dbBs2M&e= The use of an array of strings for history would simplify denoting where each entry begins and ends as entries are added, appending or prepending a new array element each time rather than the myriad different ways people do it now. This would actually be a good thing to standardize.
I think we can just not mention string array attributes right now. The multi-valued attributes (other than history, perhaps) pretty much all specify how they are to be formed and delimited.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cf-2Dconvention_cf-2Dconventions_issues_141-23issuecomment-2D408452242&d=DwMFaQ&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=Vm7o2ZGxPkkqRuPs8nVMVQ&m=8yXRwSE3siXwVZGSqDHhbz4T34VwzI3u8GXmHkJaWkY&s=SzizwDsedBZ_n_qPzSCZ1OVJv5eli4zFSJXKogaOAtE&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AH4NvmnCKFC7HSpXQx-5FMi6Yfc-5F1HSfBaks5uKzBvgaJpZM4VbMvb&d=DwMFaQ&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=Vm7o2ZGxPkkqRuPs8nVMVQ&m=8yXRwSE3siXwVZGSqDHhbz4T34VwzI3u8GXmHkJaWkY&s=F3efrdVvqb932q5mp7D_eux9BSLztraUFgqR52IYak0&e=.
-- Kenneth E. Kehoe Research Associate - University of Oklahoma Cooperative Institute for Mesoscale Meteorological Studies ARM Climate Research Facility - Data Quality Office e-mail: kkehoe@ou.edu | Office: 303-497-4754 | Cell: 405-826-0299
I would also add:
source
attribute when holding many filenames, orreferences
attribute with more than one reference identifier.One long concatenated string is not the most appropriate container for a collection of string-valued things.
As @Dave-Allured's recent post and example file illustrate, specifying string encoding for the char
datatype is burdened by the past. Adding the string
datatype provides us a chance to do it a little bit better by explicitly stating Unicode character set and UTF-8 encoding.
A larger issue lurking in the background is how to signal file content that breaks backward compatibility. This is something we discussed at the Reading workshop but no way forward was laid out. This proposal is not going to be the only backward-incompatible one. For example, group hierarchies are coming.
@JimBiardCics said:
I'm curious to know if the NUG authors looked at this section in light of allowing
string
type.
No, the NUG has NOT been systematically reviewed with respect to the string
type or other enhanced data model features. Clearly, the NUG should support the use of enhanced data model types and features (with appropriate cautions about backward compatibility and broad usability) and leave further restrictions to conventions. So, NUG Appendix A should probably clarify that the values of those attributes are strings that can be encoded in netCDF-3 as char
arrays and in netCDF-4 as either char
arrays or string
type.
The Unidata netCDF group will work on updating the NUG (with user community input) in the fairly near-term.
Dear Ken et al.
I think we should consider the case of each attribute individually, since the uses and arguments are different for each. Perhaps it would be simpler first of all to agree Jim's proposal to allow strings as equivalent to char arrays in attributes, without introducing arrays of strings. Once that is agreed, we can talk about whether to allow arrays in separate issues for various attributes.
Best wishes
Jonathan
Here is a different compromise approach, in respect of multiple requests for string
arrays. If this were a new design, then both scalar and array string
attributes would be natural. Also, string
support of any flavor will require code upgrades. I would prefer to make code upgrades once rather than twice. Adding string
array support is not much harder than string
scalar support by itself. Therefore:
string
scalar and array attributes.char
attributes are preferred for backward compatibility.char
and string
attributes. Require that CF simple lists be stored only as string
arrays, not string
scalars with delimiters.The no-mix rule should make it easy to make general purpose parsing functions for CF simple list attributes, such that they can blindly distinguish and process both data types.
This approach sacrifices round trip generic conversions between attributes of the two data types. You would need to either have CF-aware utilities, or else simply don't convert. This restriction is not a problem for me.
Dear Dave
Maybe you'd do it like that if we were starting from scratch, but we aren't. We have to bear in mind the needs of users of the convention who write their own ad-hoc code. I would rather stick to our usual principle of not adding new possibilities in the convention unless there is a strong use-case, and even more so in situations, like this, when there is already an encoding that works fine. I'm sorry if that seems frustratingly conservative, but I believe it's a principle that has worked well for CF. There have been plenty of occasions when we've decided not to add a new way of doing something because we already have a satisfactory although less attractive way to do it.
Best wishes
Jonathan
@JonathanGregory, above I did not mean to exclude the current encoding of simple lists in char
attributes. I meant to say:
Don't mix parsing rules between char
and string
attributes. Require that CF simple list attributes be stored as either:
char
attributes with delimiters, orstring
arrays without delimiters,
but not scalar strings
with delimiters.With that clarification, do you still find the option of string
array attributes to be more objectionable than scalar strings
with delimiters?
With that clarification, do you still find the option of string array attributes to be more objectionable than scalar strings with delimiters?
I do.
CHAR vs StrIng is a relatively low level implementation detail for encoding text.
So a netcdf lib ( with no concern with CF) can transparently convert either one into a native string type.
For example py-netCDF4 can present the user with a Python string object in either case. So a delimited CHAR or String would look exactly the same to client code.
And CF aware client code can now deal with the delimiters appropriately.
In order for a library to present an array of strings the same way as a delimited CHAR array, it would need to be CF aware at a low level.
I think the basic principle should be to not add netCDF4-only features until netCDF4 can be assumed — presumably in CF2
I am unfamiliar with py-netCDF4 and python. In py-netCDF4, is there an existing function that parses CF simple list char
attributes such as coordinates
or flag_values
into component strings? How common is it for application level code to parse these attributes directly, as opposed to using the library function?
@Dave-Allured This is my own personal experience, I do most of my netCDF work using the python xarray library, which is a wrapper around a few netCDF libraries, including the one from unidata. The char
vs string
in attributes is abstracted away such that I didn't even know that the netCDF "text" attributes weren't strings as they are cast/coerced into native python strings. This is different from how string
vs char
is handled in variable data in xarray. I very rarely use the python-netCDF4 library directly.
@DocOtak, "Abstractions are good." How does xarray handle coordinates
and flag_values
attributes?
Oops, I meant, when these text attributes contain multiple values with delimiters in the input file, does xarray return them as the original single python string, or as an array of strings?
On Fri, Aug 3, 2018 at 12:54 PM, Dave Allured notifications@github.com wrote:
@DocOtak https://github.com/DocOtak, "Abstractions are good." How does xarray handle coordinates and flag_values attributes?
I'm pretty sure xarray does something "smart" with coordinates, but not with flag_values. It is not intended to comply with teh CF data model. iris may handle flag_values for you -- though this mentions in the docs:
When reading and writing NetCDF data, the CF ‘flag’ attributes, “flag_masks”, “flag_meanings” and “flag_values” are now preserved through Iris load and save.
makes me think no -- other than preserving them.
https://scitools.org.uk/iris/docs/latest/
However, that's not really the point -- we should expect folks to be able to work reasonably with non-cf, but netcdf-aware tools.
And a given programing environment may not have distinct CHAR and String data types, so they should be treated the same in CF.
-CHB
--
Christopher Barker, Ph.D. Oceanographer
Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception
Chris.Barker@noaa.gov
@Dave-Allured coordinates are handled special in that they interpreted and kept around for various operations you might want to do with them. See http://xarray.pydata.org/en/stable/data-structures.html#coordinates
It will attempt to "decode CF" by default, http://xarray.pydata.org/en/stable/generated/xarray.decode_cf.html but xarray is not a CF specific library. It doesn't do anything special as far as I know with flag_values
. Or with ancillary_variables
for that matter. The xarray maintainers have recommended iris if you want full a CF aware tool.
@Dave-Allured I did some tests, a string attribute with multiple entries will be presented as an array of string by xarray in python. I don't think it has any concept of delimiters within the string itself (e.g. break on whites pace).
As for the actual topic of adding strings attributes to CF netcdf: are the CF version numbers meant to be semantic? (see https://semver.org/) if the answer is even close to "yes", then it would probably exclude adding the ability to have a string type represent any of the existing attributes currently defined in CF-1.x. Since all the more "complicated" values rely on some sort of character delimiter already, allowing them to exist in more than one data type is just added complexity without much benefit.
@DocOtak, thanks for testing python xarray. You said "a string attribute with multiple entries ". Please clarify. CF example 5.2 shows this attribute, which is data type char
in common ncdump syntax:
T:coordinates = "lon lat" ;
Do you mean xarray currently presents this as a python array of two strings?
@Dave-Allured coordinates are a bad example as xarray by default will remove the attribute and instead present a special coords
property with a python dictionary (mapping data structure) with references to the actual data variables.
Assuming that it won't do the above, this is the behavior I've observed:
T:coordinates = "lon lat" ;
will be a python string "lon lat"
string T:coordinates = "lon lat" ;
will be a python string "lon lat"
string T:coordinates = "lon", "lat" ;
will be a python list with strings ["lon", "lat"]
A python list with a single string ["lon lat"]
appears to be encoded as a char array: T:coordinates = "lon lat" ;
I don't know how much of this is xarray doing magic, or the result of the python-netCDF4 library. I must admit that the last example would be very nice for the enumerated values (e.g. flag defs)
Do you or anyone else know what MATLAB does?
@DocOtak, I agree the "coordinates" attribute in xarray is a bad example of simply reading a text attribute. But it is also a good example of a lower level fully encapsulating that functionality, therefore hiding the details. Encapsulating functions are part of my thinking about allowing string
arrays for CF simple lists.
I do not know how MATLAB handles character and string attributes. However I found that NCL automatically converts char
attributes to scalar strings
. Because of this and lack of another inquiry function, there is no good way at the NCL user level, to distinguish char
and string
file attributes. The same may be true with python-netCDF4 and some other programming languages.
This ability to distinguish would be essential for my string
array proposal to work. I come from a Fortran perspective where the raw file data type is right up front. Making a library function to handle this distinction would be natural. I feel this could be done for CF simple list attributes, for all languages, without much trouble.
@Dave-Allured @DocOtak @JonathanGregory Chris Barker (I'm finally back at it.) Thanks for your thoughts and investigations! As far as I'm aware, most general-purpose packages don't parse scalar string or char attributes into string arrays or anything of that sort. I think it's a good point that Chris and Andrew made that many modern netCDF APIs actively hide the difference between string and char attributes, in some cases making it hard to create a char attribute.
So, given all that, I like something along the lines of Jonathan's suggestion. Allow scalar string attributes as interchangeable with char attributes. Don't mention array string attributes. Note that older software may not handle string attributes. (Panoply, python-netCDF4, IDL, and MatLab all handle string attributes well.) Leave the more "exotic" concepts (using arrays for multiple-element things like flag_meanings and Conventions) to CF 2.0.
+! on @JimBiardCics's proposal.
@JimBiardCics et al, I think string
arrays for simple list attributes are the best single choice for the long term. It is likely that CF2 and other conventions will favor string
arrays in the future. If you choose scalar strings
for CF1, this will probably commit two different ways to handle string
attributes later, in addition to the existing delimited character
type. This is a messy future scenario that I want to avoid.
I assert without proof that the necessary upgrades to languages and user code for string
arrays will be simple and straightforward. Add a function to detect the attribute's file data type, as needed. Use the data type, and nothing else, to decide when to parse on delimiters, and when to assume array. This way, there will be no future need to involve the convention version for this purpose.
There will be some short term inconveniences to adapt to string
arrays. Code can be adapted gradually as string
arrays are encountered in new data sets. Also as I said earlier, this entire process can be encapsulated in a CF aware function for the specific list attributes, to simplify user code upgrades.
My "vote" is I am abstaining from the consensus on this. Please take my comments as suggestions, and I leave the choice up to the rest of this capable group.
So, per @ChrisBarker-NOAA's comment on #139, I like the idea of stating that char attributes are constrained to ASCII (latin-1?), and that string attributes should be treated as utf-8. There's always the possibility of adding an encoding attribute at some later date if there is demand.
As much as I like @Dave-Allured's suggestion above, I think it's probably best to leave string array attributes to CF 2.0 - or at least until a later date. It's a pretty pervasive change. It's not hard from a technical standpoint (and my organizational brain loves the idea!), but I think it will be confusing to a number of 'less technical' scientists who I encounter that find netCDF and CF terribly confusing already. There's also quite a few questions that would need to be resolved about cell_methods
and other attributes like it would be affected.
Thoughts?
Attributes with a type of string are now possible with netCDF-4, and many examples of attributes with this type are "in the wild". As an example of how this is happening, IDL creates an attribute with this type if you select its version of
string
type instead ofchar
type. It seems that people often assume thatstring
is the correct type to use because they wish to store strings, not characters.I propose to add verbiage to the Conventions to allow attributes that have a type of
string
. There are two ramifications to allowing attributes of this type, the second of which impacts string variables as well.string
attribute can contain 1D atomic string arrays. We need to decide whether or not we want to allow these or limit them (at least for now) to atomic string scalars. Attributes with arrays of strings could allow for cleaner delimiting of multiple parts than spaces or commas do now (e.g. flag_values and flag_meanings could both be arrays), but this would be a significant stretch for current software packages.string
attribute (and astring
variable) can contain UTF-8 Unicode strings. UTF-8 uses variable-length characters, with the standard ASCII characters as the 1-byte subset. According to the Unicode standard, a UTF-8 string can be signaled by the presence of a special non-printing three byte sequence known as a Byte Order Mark (BOM) at the front of the string, although this is not required. IDL (again, for example) writes this BOM sequence at the beginning of every attribute or variable element of typestring
.Allowing attributes containing arrays of strings may open up useful future directions, but it will be more of a break from the past than attributes that have only single strings. Allowing attributes (and variables) to contain UTF-8 will free people to store non-English content, but it might pose headaches for software written in older languages such as C and FORTRAN.
To finalize the change to support
string
type attributes, we need to decide:string
attributes and (by extension) variables?Now that I have the background out of the way, here's my proposal.
Allow
string
attributes. Specify that the attributes defined by the current CF Conventions must be scalar (contain only one string).Allow UTF-8 in attribute and variable values. Specify that the current CF Conventions use only ASCII characters (which are a subset of UTF-8) for all terms defined within. That is, the controlled vocabulary of CF (standard names and extensions, cell_methods terms other than free-text elements of comments(?), area type names, time units, etc) is composed entirely of ASCII characters. Free-text elements (comments, long names, flag_meanings, etc) may use any UTF-8 character.
Trac ticket: #176 but that's just a placeholder - no discussion took place there (@JonathanGregory 26 July 2024)