cf-convention / cf-conventions

AsciiDoc Source
http://cfconventions.org/cf-conventions/cf-conventions
Creative Commons Zero v1.0 Universal
86 stars 45 forks source link

Add support for attributes of type string #141

Open JimBiardCics opened 6 years ago

JimBiardCics commented 6 years ago

Attributes with a type of string are now possible with netCDF-4, and many examples of attributes with this type are "in the wild". As an example of how this is happening, IDL creates an attribute with this type if you select its version of string type instead of char type. It seems that people often assume that string is the correct type to use because they wish to store strings, not characters.

I propose to add verbiage to the Conventions to allow attributes that have a type of string. There are two ramifications to allowing attributes of this type, the second of which impacts string variables as well.

  1. A string attribute can contain 1D atomic string arrays. We need to decide whether or not we want to allow these or limit them (at least for now) to atomic string scalars. Attributes with arrays of strings could allow for cleaner delimiting of multiple parts than spaces or commas do now (e.g. flag_values and flag_meanings could both be arrays), but this would be a significant stretch for current software packages.
  2. A string attribute (and a string variable) can contain UTF-8 Unicode strings. UTF-8 uses variable-length characters, with the standard ASCII characters as the 1-byte subset. According to the Unicode standard, a UTF-8 string can be signaled by the presence of a special non-printing three byte sequence known as a Byte Order Mark (BOM) at the front of the string, although this is not required. IDL (again, for example) writes this BOM sequence at the beginning of every attribute or variable element of type string.

Allowing attributes containing arrays of strings may open up useful future directions, but it will be more of a break from the past than attributes that have only single strings. Allowing attributes (and variables) to contain UTF-8 will free people to store non-English content, but it might pose headaches for software written in older languages such as C and FORTRAN.

To finalize the change to support string type attributes, we need to decide:

  1. Do we explicitly forbid string array attributes?
  2. Do we place any restrictions on the content of string attributes and (by extension) variables?

Now that I have the background out of the way, here's my proposal.

Allow string attributes. Specify that the attributes defined by the current CF Conventions must be scalar (contain only one string).

Allow UTF-8 in attribute and variable values. Specify that the current CF Conventions use only ASCII characters (which are a subset of UTF-8) for all terms defined within. That is, the controlled vocabulary of CF (standard names and extensions, cell_methods terms other than free-text elements of comments(?), area type names, time units, etc) is composed entirely of ASCII characters. Free-text elements (comments, long names, flag_meanings, etc) may use any UTF-8 character.

Trac ticket: #176 but that's just a placeholder - no discussion took place there (@JonathanGregory 26 July 2024)

Dave-Allured commented 6 years ago

I am generally in support of this string attributes proposal, including UTF-8 characters. However, for CF controlled attributes, I recommend an explicit preference for type char rather than string. This is for compatibility with large amounts of existing user code that access critical attributes directly, and would need to be reworked for type string.

I suggest not including a constraint for scalar strings, simply because it seems redundant. I think that existing CF language strongly implies single strings in most cases of CF defined attributes.

ghost commented 6 years ago

How different is reading values from a string attribute compared to a string variable? If some software supports string variables shouldn't it support string attributes as well? If the CF is going to recommend char datatype for string-valued attributes, shouldn't the same be done for string-valued variables?

Prefixing the bytes of an UTF-8 encoded string with the BOM sequence is an odd practice. Although it is permitted, afaik, it is not recommended.

Since what gets stored are always the bytes of one string in some encoding, assuming UTF-8 always should take care of the ASCII character set, too. This could cause issues if someone used other one-byte encodings (e.g. ISO 8859 family) but I don't see how such cases could be easily resolved.

Stroring Unicode strings using the string datatype makes more sense since the number of bytes for such strings in UTF-8 encoding is variable.

JimBiardCics commented 6 years ago

This issue and issue https://github.com/cf-convention/cf-conventions/issues/139 are intertwined. There may be overlapping discussion in both.

JimBiardCics commented 6 years ago

@ajelenak-thg So I did some digging. I wrote a file with IDL and read it with C. There are no BOM prefixes. I guess some languages (such as Python) make assumptions one way or another about string attributes and variables, but it appears that it's all pretty straightforward.

JimBiardCics commented 6 years ago

@ajelenak-thg I agree that we should state that char attributes and variables should contain only ASCII characters.

JimBiardCics commented 6 years ago

@Dave-Allured When you say "CF-controlled attributes", are you referring to any values they may have, or to values that are from controlled vocabularies? It is true that applications written in C or FORTRAN will require code changes to handle string because the API and what is returned for string attributes and variables is different from that for char attributes and variables. Would a warning about avoiding string for maximum compatibility be sufficient?

Dave-Allured commented 6 years ago

@JimBiardCics, by "CF-controlled attributes", I mean CF-defined attributes within "the controlled vocabulary of CF" as you described above. By implication I am referring to any values they may have, including but not limited to values from controlled vocabularies.

A warning about avoiding data type string is notification. An explicit preference is advocacy. I believe the compatibility issue is important enough that CF should adopt the explicit preference for type char for key attributes.

Dave-Allured commented 6 years ago

The restriction that char attributes and variables should contain only ASCII characters is not warranted. The Netcdf-C library is agnostic about the character set of data stored within char attributes and char variables. UTF-8 and other character sets are easily embedded within strings stored as char data.

Therefore I suggest no mention of a character set restriction, outside of the CF controlled vocabulary. Alternatively you could establish the default interpretation of string data (both char and string data types) as the ASCII/UTF-8 conflation.

DocOtak commented 6 years ago

Hi all, I wasn't quite able to form this into a coherent paragraphs so here are some things to keep in mind re: UTF8 vs other encodings:

My personal recommendation is that the only encoding for text in CF netCDF be UTF8 in NFC with no byte order mark. For attributes where there is desire to restrict what is allowed (though controlled vocabulary or other limitations), the restriction should be specified using unicode points, e.g. "only printing characters between U+0000 and U+007F are allowed in controlled attributes".

Text which is in controlled vocabulary attributes should continue to be char arrays. Freeform attributes (mostly those in 2.6.2. Description of file contents), could probably be either string or char arrays.

Dave-Allured commented 6 years ago

@DocOtak, you said "the netCDF classic format required UTF8 to be in Normalization Form Canonical Composition (NFC)". This restriction is only for netCDF named objects, i.e. the names of dimensions, variables, and attributes. There is no such restriction for data stored within variables or attributes.

DocOtak commented 6 years ago

@Dave-Allured yes, I reread the section, object names does appear to be what it is restricting. Should there be some consideration of specifying a normalization for the purposes of data in CF netcdf?

Text encoding probably deserves its own section in the CF document, perhaps under data types. The topic of text encoding can be very foreign to someone who thinks that "plain text" is a thing that exists in computing.

Dave-Allured commented 6 years ago

@DocOtak, for general text data, I think UTF-8 normalization is more of a best practice than a necessity for CF purposes. Therefore I suggest that CF remain silent about that, but include it if you feel strongly. Normalization becomes important for efficient string matching, which is why netCDF object names are restricted.

DocOtak commented 6 years ago

@Dave-Allured I don't know enough about the consequences of requiring a specific normalization. There is some interesting information on the unicode website about normalization. Which suggests that over 99% of unicode text on the web is already in NFC. Also interesting is that combining NFC normalized strings may not result in a new string that is normalized. It is also stated in the FAQ that "Programs should always compare canonical-equivalent Unicode strings as equal", so it's probably not an issue as long as the controlled vocabulary attributes have values with code points in the U+0000 and U+007F range (control chars excluded).

hrajagers commented 6 years ago

@Dave-Allured and @DocOtak,

1) Most of the character/string attributes in the CF conventions contain a concatenation of sub-strings selected from a standardized vocabulary, variable names, and some numbers and separator symbols. It seems that for those attributes the discuss about the encoding is not so relevant as these sub-strings contain only a very basic set of characters (assuming that variable names are not allowed to contain extended characters). Even for flagmeanings the CF conventions state "Each word or phrase should consist of characters from the alphanumeric set and the following five: '', '-', '.', '+', '@'." If the alphanumeric set doesn't include extended characters this again doesn't create any problems for encoding. The only attributes that might contain extended characters (and thus be influenced by this encoding choice) are attributes like long_name, institution, title, history, ... However CF inherits most of them from the NetCDF User Guide which explicitly states that they should be stored as character arrays (see NUG Appendix A) So, is it then up to CF to allow strings here? In short, I'm not sure the encoding is important for string/character attributes at this moment.

2) I initially raised the encoding topic in the related issue #139 because we want our model users to use local names for observation points and they will end up in label variables. In that context I would like to make sure that what I store is properly described.

JimBiardCics commented 6 years ago

@hrajagers Thanks for the pointer to NUG Appendix A. It's interesting to see in that text that character array, character string, and string are used somewhat interchangeably. I'm curious to know if the NUG authors looked at this section in light of allowing string type.

ghost commented 6 years ago

I think we are making good progress on this. I checked the Appendix A table of CF attributes and I think the following attributes can be allowed to hold string values as well as char:

All the other attributes should hold char values to maximize backward compatibility.

JimBiardCics commented 6 years ago

@ajelenak-thg Are you suggesting the other attributes must always be of type char, or that they should only contain the ASCII subset of characters?

ghost commented 6 years ago

Based on the expressed concern so far for backward compatibility I suggested the former: always be of type char. Leave the character set and encoding unspecified since the values of those attributes are controlled by the convention.

ghost commented 6 years ago

On the string encoding issue, CF data can be currently stored in two file formats: NetCDF Classic, and HDF5. String encoding information cannot be directly stored in the netCDF Classic format and the spec defines a special variable attribute _Encoding for that in future implementations. The values of this attribute are not specified so anything could be used.

In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. Both char and string datatypes in the context of this discussion are stored as HDF5 strings. This effectively limits what could be allowed values of the (future) _Encoding attribute for maximal data interoperability between the two file formats.

Dave-Allured commented 6 years ago

@hrajagers said: However CF inherits most of them [attributes] from the NetCDF User Guide which explicitly states that they should be stored as character arrays (see NUG Appendix A) So, is it then up to CF to allow strings here?

Yes, NUG Appendix A literally allows only char type attributes. My sense is that proponents believe that string type is compatible with the intent of the NUG, and also strings have enough advantages to warrant departure from the NUG.

Personally I think string type attributes are fine within collaborations where everyone is ready for any needed code upgrades. For exchanged and published data, char type CF attributes should be preferred explicitly by CF.

Dave-Allured commented 6 years ago

@ajelenak-thg said: In the HDF5 case, string encoding is an intrinsic part of the HDF5 string datatype and can only be ASCII or UTF-8. Both char and string datatypes in the context of this discussion are stored as HDF5 strings.

Actually the ASCII/UTF-8 restriction is not enforced by the HDF5 library. This is used intentionally by netcdf developers to support arbitrary character sets in netcdf-4 data type char, both attributes and variables. See netcdf issue 298. Therefore, data type char remains fully interoperable between netcdf-3 and netcdf-4 formats.

For example, this netcdf-4 file contains a char attribute and a char variable in an alternate character set. You will need an app or console window enabled for ISO-8859-1 to properly view the ncdump of this file.

JonathanGregory commented 6 years ago

Dear Jim

Thanks for addressing these issues. In fact you've raised two issues: the use of strings, and the encoding. These can be decided separately, can't they?

On strings, I agree with your proposal and subsequent comments by others that we should allow string, but we should recommend the continued use of char, giving as the reason that char will maximise the useability of the data, because of the existence of software that isn't expecting string. Recommend means that the cf-checker will give a warning if string is used. However it's not an error and a given project could decide to use string.

For the attributes whose contents are standardised by CF e.g. coordinates, if string is used we should require a scalar string. This is because software will not expect arrays of strings. These attributes are often critical and so it's essential they can be interpreted. For CF attributes whose contents aren't standardised e.g. comment, is there a strong use-case for allowing arrays of strings?

I recall that at the meeting in Reading the point was made that arrays would be natural for flag_values and flag_meanings. I agree that the argument is stronger in that case because the words in those two attributes correspond one-to-one. Still, it would break existing software to permit it. Is there a strong need for arrays?

Best wishes

Jonathan

JimBiardCics commented 6 years ago

@JonathanGregory I agree with you. I think it would be fine to leave string array attributes out of the running for now. I also prefer the recommendation route.

Regarding the encoding, it seems to me that we could avoid a lot of complexity for now by making a simple requirement that all CF-defined terms and whitespace delimiters in string-valued attributes or variables be composed characters from the ASCII character set. It wouldn't matter if people used Latin-1 (ISO-8859-1) or UTF-8 for any free text or free text portions of contents, because they both contain the ASCII character set as a subset. The parts that software would be looking for would be parseable.

ethanrd commented 6 years ago

@JonathanGregory Another use-case (that I think came up during the Reading meeting) had the history attribute as a string array so that each element could contain the description of an individual processing step. I think easier machine readability was mentioned as a motivation.

JonathanGregory commented 6 years ago

Regarding the encoding, I agree that for attribute contents which are standardised by CF it is fine to restrict ourselves to ASCII, in both char and string. For these attributes, we prescribe the possible values (they have controlled vocabulary) and so we don't need to make a rule in the convention about it for the sake of the users of the convention. If we put it in the convention, it would be as guidance for future authors of the convention. I don't have a view about whether we should do this. It would be worth noting to users that whitespace, which often appears in a "black-separated list of words", should be ASCII space. I agree that UTF-8 is fine for contents which aren't standardised.

Regarding arrays of strings, I realise I wasn't thinking clearly yesterday, sorry. As we've agreed, string attributes will not be expected by much existing software. Hence software has to be rewritten to support the use of strings in any case, and support for arrays of strings could be added at the same time, if it's really valuable. I don't see the particular value for the use of string arrays for comment - do other people? For flag_meanings, the argument was that it would allow a meaning to be a string which contained spaces (instead of being joined up with underscores, as is presently necessary); that is, it would be an enhancement to functionality.

Happy weekend - Jonathan

JonathanGregory commented 6 years ago

I meant to write, I don't see the particular value for the use of string arrays for history, which Ethan reminded us of. Why would this be more machine-readable?

JimBiardCics commented 6 years ago

@JonathanGregory The use of an array of strings for history would simplify denoting where each entry begins and ends as entries are added, appending or prepending a new array element each time rather than the myriad different ways people do it now. This would actually be a good thing to standardize.

I think we can just not mention string array attributes right now. The multi-valued attributes (other than history, perhaps) pretty much all specify how they are to be formed and delimited.

cf-metadata-list commented 6 years ago

I think we can just not mention string array attributes right now.

Do we currently allow array of CHAR (i.e. 2D array) for attributes?

According to the netcdf docs;

The current version treats all attributes as vectors; scalar values are treated as single-element vectors.

Which makes me think no, that’s not possible.

I think allowing the string type should not change what’s allowable.

BTW, I suspect some client software (e.g. py_netCDF4) treat char and string the same ....

-CHB

kenkehoe commented 6 years ago

Let me throw a big wrench into this argument about not allowing string arrays.

  1. I would prefer a consistent decision and standard about the use of char vs. string so a user does not need to know where to use char array, scalar string, or string arrays.
  2. Use of string arrays with flag_meanings (not sure it would be needed with flag_values?) will solve many problems for my program to actually merge our standards with CF. Currently with char arrays we need to connect all words for a single flag by underscores for space delimiting. Many of our variable names and attribute names contain underscores. So when the flag description is parsed and changed to be more human readable all the attribute and variable names are not preserved. Automated tools can no longer replace attribute or variable names with the attribute or variable value. We do this a lot. We also have lengthy descriptions for our flag_meanings. I would prefer to use flag_mask, flag_values and flag_meanings as that general method is better than the one we currently employ.
  3. I do see the benefit of storing history as string arrays. Without checking date stamps I can see how many times the file has been modified by checking the list length. It also removes any ambiguity about separators in the history attribute which differs from the CF standard of space separation and is often institution defined. The current definition for history attribute is "List of the applications that have modified the original data." In the python world the use of "list" is different than the intended definition.
  4. I'm starting to get a lot of more complicated data that are multidimensional but do not share the same units. We would need to work with udunits, but Cf/Radial is proposing a new standard for complex data which often have different units for different index in a second dimension. If we allowed string arrays in units we could store complex data or other data structures more native to the intended use since uduints interprets space characters as multiplication not a delimiter.
  5. missing_value or _FillValue currently allow one value. For string type data allowing sting arrays to have multiple fill values which would allow numeric data also have multiple fill values defined, which I'm sure there are many data sets that have multiple fill values used, but not defined correctly in the data file.
  6. valid_range can be used with string data type
  7. Conventions attribute could group multiple indicators with the same class of conventions. For example ["CF-1.7", "Cf/Radial instrument_parameters radar_parameters", "ARM-1.3"]
  8. and on and on ....

I'm not suggesting the use of all these use cases, but this relatively small change can go a long way to improve the standard and future use of the data.

OK, I've made my case I'll be quite now.

Ken

On 2018-7-27 09:23, JimBiardCics wrote:

@JonathanGregory https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_JonathanGregory&d=DwMFaQ&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=Vm7o2ZGxPkkqRuPs8nVMVQ&m=8yXRwSE3siXwVZGSqDHhbz4T34VwzI3u8GXmHkJaWkY&s=d4qYLgaugDM0kdWoZHbgieEpVU-Xg_SJ1d1F_dbBs2M&e= The use of an array of strings for history would simplify denoting where each entry begins and ends as entries are added, appending or prepending a new array element each time rather than the myriad different ways people do it now. This would actually be a good thing to standardize.

I think we can just not mention string array attributes right now. The multi-valued attributes (other than history, perhaps) pretty much all specify how they are to be formed and delimited.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_cf-2Dconvention_cf-2Dconventions_issues_141-23issuecomment-2D408452242&d=DwMFaQ&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=Vm7o2ZGxPkkqRuPs8nVMVQ&m=8yXRwSE3siXwVZGSqDHhbz4T34VwzI3u8GXmHkJaWkY&s=SzizwDsedBZ_n_qPzSCZ1OVJv5eli4zFSJXKogaOAtE&e=, or mute the thread https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_notifications_unsubscribe-2Dauth_AH4NvmnCKFC7HSpXQx-5FMi6Yfc-5F1HSfBaks5uKzBvgaJpZM4VbMvb&d=DwMFaQ&c=qKdtBuuu6dQK9MsRUVJ2DPXW6oayO8fu4TfEHS8sGNk&r=Vm7o2ZGxPkkqRuPs8nVMVQ&m=8yXRwSE3siXwVZGSqDHhbz4T34VwzI3u8GXmHkJaWkY&s=F3efrdVvqb932q5mp7D_eux9BSLztraUFgqR52IYak0&e=.

-- Kenneth E. Kehoe Research Associate - University of Oklahoma Cooperative Institute for Mesoscale Meteorological Studies ARM Climate Research Facility - Data Quality Office e-mail: kkehoe@ou.edu | Office: 303-497-4754 | Cell: 405-826-0299

ghost commented 6 years ago

I would also add:

  1. source attribute when holding many filenames, or
  2. references attribute with more than one reference identifier.

One long concatenated string is not the most appropriate container for a collection of string-valued things.

As @Dave-Allured's recent post and example file illustrate, specifying string encoding for the char datatype is burdened by the past. Adding the string datatype provides us a chance to do it a little bit better by explicitly stating Unicode character set and UTF-8 encoding.

A larger issue lurking in the background is how to signal file content that breaks backward compatibility. This is something we discussed at the Reading workshop but no way forward was laid out. This proposal is not going to be the only backward-incompatible one. For example, group hierarchies are coming.

ethanrd commented 6 years ago

@JimBiardCics said:

I'm curious to know if the NUG authors looked at this section in light of allowing string type.

No, the NUG has NOT been systematically reviewed with respect to the string type or other enhanced data model features. Clearly, the NUG should support the use of enhanced data model types and features (with appropriate cautions about backward compatibility and broad usability) and leave further restrictions to conventions. So, NUG Appendix A should probably clarify that the values of those attributes are strings that can be encoded in netCDF-3 as char arrays and in netCDF-4 as either char arrays or string type.

The Unidata netCDF group will work on updating the NUG (with user community input) in the fairly near-term.

JonathanGregory commented 6 years ago

Dear Ken et al.

I think we should consider the case of each attribute individually, since the uses and arguments are different for each. Perhaps it would be simpler first of all to agree Jim's proposal to allow strings as equivalent to char arrays in attributes, without introducing arrays of strings. Once that is agreed, we can talk about whether to allow arrays in separate issues for various attributes.

Best wishes

Jonathan

Dave-Allured commented 6 years ago

Here is a different compromise approach, in respect of multiple requests for string arrays. If this were a new design, then both scalar and array string attributes would be natural. Also, string support of any flavor will require code upgrades. I would prefer to make code upgrades once rather than twice. Adding string array support is not much harder than string scalar support by itself. Therefore:

The no-mix rule should make it easy to make general purpose parsing functions for CF simple list attributes, such that they can blindly distinguish and process both data types.

This approach sacrifices round trip generic conversions between attributes of the two data types. You would need to either have CF-aware utilities, or else simply don't convert. This restriction is not a problem for me.

JonathanGregory commented 6 years ago

Dear Dave

Maybe you'd do it like that if we were starting from scratch, but we aren't. We have to bear in mind the needs of users of the convention who write their own ad-hoc code. I would rather stick to our usual principle of not adding new possibilities in the convention unless there is a strong use-case, and even more so in situations, like this, when there is already an encoding that works fine. I'm sorry if that seems frustratingly conservative, but I believe it's a principle that has worked well for CF. There have been plenty of occasions when we've decided not to add a new way of doing something because we already have a satisfactory although less attractive way to do it.

Best wishes

Jonathan

Dave-Allured commented 6 years ago

@JonathanGregory, above I did not mean to exclude the current encoding of simple lists in char attributes. I meant to say:

With that clarification, do you still find the option of string array attributes to be more objectionable than scalar strings with delimiters?

cf-metadata-list commented 6 years ago

With that clarification, do you still find the option of string array attributes to be more objectionable than scalar strings with delimiters?

I do.

CHAR vs StrIng is a relatively low level implementation detail for encoding text.

So a netcdf lib ( with no concern with CF) can transparently convert either one into a native string type.

For example py-netCDF4 can present the user with a Python string object in either case. So a delimited CHAR or String would look exactly the same to client code.

And CF aware client code can now deal with the delimiters appropriately.

In order for a library to present an array of strings the same way as a delimited CHAR array, it would need to be CF aware at a low level.

I think the basic principle should be to not add netCDF4-only features until netCDF4 can be assumed — presumably in CF2

Dave-Allured commented 6 years ago

I am unfamiliar with py-netCDF4 and python. In py-netCDF4, is there an existing function that parses CF simple list char attributes such as coordinates or flag_values into component strings? How common is it for application level code to parse these attributes directly, as opposed to using the library function?

DocOtak commented 6 years ago

@Dave-Allured This is my own personal experience, I do most of my netCDF work using the python xarray library, which is a wrapper around a few netCDF libraries, including the one from unidata. The char vs string in attributes is abstracted away such that I didn't even know that the netCDF "text" attributes weren't strings as they are cast/coerced into native python strings. This is different from how string vs char is handled in variable data in xarray. I very rarely use the python-netCDF4 library directly.

Dave-Allured commented 6 years ago

@DocOtak, "Abstractions are good." How does xarray handle coordinates and flag_values attributes?

Dave-Allured commented 6 years ago

Oops, I meant, when these text attributes contain multiple values with delimiters in the input file, does xarray return them as the original single python string, or as an array of strings?

cf-metadata-list commented 6 years ago

On Fri, Aug 3, 2018 at 12:54 PM, Dave Allured notifications@github.com wrote:

@DocOtak https://github.com/DocOtak, "Abstractions are good." How does xarray handle coordinates and flag_values attributes?

I'm pretty sure xarray does something "smart" with coordinates, but not with flag_values. It is not intended to comply with teh CF data model. iris may handle flag_values for you -- though this mentions in the docs:

When reading and writing NetCDF data, the CF ‘flag’ attributes, “flag_masks”, “flag_meanings” and “flag_values” are now preserved through Iris load and save.

makes me think no -- other than preserving them.

https://scitools.org.uk/iris/docs/latest/

However, that's not really the point -- we should expect folks to be able to work reasonably with non-cf, but netcdf-aware tools.

And a given programing environment may not have distinct CHAR and String data types, so they should be treated the same in CF.

-CHB

--

Christopher Barker, Ph.D. Oceanographer

Emergency Response Division NOAA/NOS/OR&R (206) 526-6959 voice 7600 Sand Point Way NE (206) 526-6329 fax Seattle, WA 98115 (206) 526-6317 main reception

Chris.Barker@noaa.gov

DocOtak commented 6 years ago

@Dave-Allured coordinates are handled special in that they interpreted and kept around for various operations you might want to do with them. See http://xarray.pydata.org/en/stable/data-structures.html#coordinates

It will attempt to "decode CF" by default, http://xarray.pydata.org/en/stable/generated/xarray.decode_cf.html but xarray is not a CF specific library. It doesn't do anything special as far as I know with flag_values. Or with ancillary_variables for that matter. The xarray maintainers have recommended iris if you want full a CF aware tool.

DocOtak commented 6 years ago

@Dave-Allured I did some tests, a string attribute with multiple entries will be presented as an array of string by xarray in python. I don't think it has any concept of delimiters within the string itself (e.g. break on whites pace).

As for the actual topic of adding strings attributes to CF netcdf: are the CF version numbers meant to be semantic? (see https://semver.org/) if the answer is even close to "yes", then it would probably exclude adding the ability to have a string type represent any of the existing attributes currently defined in CF-1.x. Since all the more "complicated" values rely on some sort of character delimiter already, allowing them to exist in more than one data type is just added complexity without much benefit.

Dave-Allured commented 6 years ago

@DocOtak, thanks for testing python xarray. You said "a string attribute with multiple entries ". Please clarify. CF example 5.2 shows this attribute, which is data type char in common ncdump syntax:

T:coordinates = "lon lat" ;

Do you mean xarray currently presents this as a python array of two strings?

DocOtak commented 6 years ago

@Dave-Allured coordinates are a bad example as xarray by default will remove the attribute and instead present a special coords property with a python dictionary (mapping data structure) with references to the actual data variables.

Assuming that it won't do the above, this is the behavior I've observed:

T:coordinates = "lon lat" ; will be a python string "lon lat" string T:coordinates = "lon lat" ; will be a python string "lon lat" string T:coordinates = "lon", "lat" ; will be a python list with strings ["lon", "lat"]

A python list with a single string ["lon lat"] appears to be encoded as a char array: T:coordinates = "lon lat" ;

I don't know how much of this is xarray doing magic, or the result of the python-netCDF4 library. I must admit that the last example would be very nice for the enumerated values (e.g. flag defs)

Do you or anyone else know what MATLAB does?

Dave-Allured commented 6 years ago

@DocOtak, I agree the "coordinates" attribute in xarray is a bad example of simply reading a text attribute. But it is also a good example of a lower level fully encapsulating that functionality, therefore hiding the details. Encapsulating functions are part of my thinking about allowing string arrays for CF simple lists.

I do not know how MATLAB handles character and string attributes. However I found that NCL automatically converts char attributes to scalar strings. Because of this and lack of another inquiry function, there is no good way at the NCL user level, to distinguish char and string file attributes. The same may be true with python-netCDF4 and some other programming languages.

This ability to distinguish would be essential for my string array proposal to work. I come from a Fortran perspective where the raw file data type is right up front. Making a library function to handle this distinction would be natural. I feel this could be done for CF simple list attributes, for all languages, without much trouble.

JimBiardCics commented 6 years ago

@Dave-Allured @DocOtak @JonathanGregory Chris Barker (I'm finally back at it.) Thanks for your thoughts and investigations! As far as I'm aware, most general-purpose packages don't parse scalar string or char attributes into string arrays or anything of that sort. I think it's a good point that Chris and Andrew made that many modern netCDF APIs actively hide the difference between string and char attributes, in some cases making it hard to create a char attribute.

So, given all that, I like something along the lines of Jonathan's suggestion. Allow scalar string attributes as interchangeable with char attributes. Don't mention array string attributes. Note that older software may not handle string attributes. (Panoply, python-netCDF4, IDL, and MatLab all handle string attributes well.) Leave the more "exotic" concepts (using arrays for multiple-element things like flag_meanings and Conventions) to CF 2.0.

ChrisBarker-NOAA commented 6 years ago

+! on @JimBiardCics's proposal.

Dave-Allured commented 6 years ago

@JimBiardCics et al, I think string arrays for simple list attributes are the best single choice for the long term. It is likely that CF2 and other conventions will favor string arrays in the future. If you choose scalar strings for CF1, this will probably commit two different ways to handle string attributes later, in addition to the existing delimited character type. This is a messy future scenario that I want to avoid.

I assert without proof that the necessary upgrades to languages and user code for string arrays will be simple and straightforward. Add a function to detect the attribute's file data type, as needed. Use the data type, and nothing else, to decide when to parse on delimiters, and when to assume array. This way, there will be no future need to involve the convention version for this purpose.

There will be some short term inconveniences to adapt to string arrays. Code can be adapted gradually as string arrays are encountered in new data sets. Also as I said earlier, this entire process can be encapsulated in a CF aware function for the specific list attributes, to simplify user code upgrades.

My "vote" is I am abstaining from the consensus on this. Please take my comments as suggestions, and I leave the choice up to the rest of this capable group.

JimBiardCics commented 6 years ago

So, per @ChrisBarker-NOAA's comment on #139, I like the idea of stating that char attributes are constrained to ASCII (latin-1?), and that string attributes should be treated as utf-8. There's always the possibility of adding an encoding attribute at some later date if there is demand.

As much as I like @Dave-Allured's suggestion above, I think it's probably best to leave string array attributes to CF 2.0 - or at least until a later date. It's a pretty pervasive change. It's not hard from a technical standpoint (and my organizational brain loves the idea!), but I think it will be confusing to a number of 'less technical' scientists who I encounter that find netCDF and CF terribly confusing already. There's also quite a few questions that would need to be resolved about cell_methods and other attributes like it would be affected.

Thoughts?