cf-convention / discuss

A forum for any discussion about interpretation, clarification, and proposals for changes or extensions to the CF conventions.
43 stars 6 forks source link

Composite/array values in string valued global attributes #340

Closed benjwadams closed 1 month ago

benjwadams commented 2 months ago

Hi, I'm getting a number of questions on whether string valued global attributes such as "references" can be composed of multiple values in https://cfconventions.org/Data/cf-conventions/cf-conventions-1.9/cf-conventions.html#description-of-file-contents. Should such attributes only consist of one value, or are two or more values considered acceptable as well?

Referencing issue (among others):

https://github.com/ioos/compliance-checker/issues/1093

JonathanGregory commented 2 months ago

Do you mean, can they be arrays of strings?

benjwadams commented 2 months ago

Do you mean, can they be arrays of strings?

Yes.

My understanding if I have read the NetCDF data model UML correctly is that attributes are arrays anyhow: https://docs.unidata.ucar.edu/netcdf-c/current/netcdf_data_model.html

Assigning dataset.some_attribute = ["value"] and then querying dataset.some_attribute will return "value" instead of a list in NetCDF4 Python. Obviously, that is library specific, but it occurs from time to time.

Other than the common case above which is a single element, arrays of two or more elements allowed in these attributes?

JonathanGregory commented 2 months ago

I don't remember that we have considered this question. The text was formulated before strings were introduced. Arrays of strings previously could only be represented by two-dimensional character arrays. The document does not mention that two-dimensional character arrays might be expected in these global attributes, so I guess that no-one thought they would be needed. That's not the same thing as prohibiting them, but I imagine that existing programs which access these attributes will expect them to contain a single string. If the program was written using netCDF4 Python, it would expect a string to be returned, and receiving a list of strings would probably lead to some error.

Therefore I'm inclined to think that we should recommend that these attributes (history, references, etc.) not to be arrays of strings, or multidimensional character arrays, unless there is a persuasive need for them - is there? The text notes that you can embed newlines in them, in fact recommends so when appropriate. What do you and others think?

taylor13 commented 2 months ago

I agree that we should explicitly say that when a CF attribute value is supposed to be a string, it should not be an array of strings. Of course, a compelling use case might justify relaxing this rule, but, as Jonathan noted, that would likely break some existing software.

ChrisBarker-NOAA commented 2 months ago

Agree on all counts -- there's a lot of precedent for using strings to represent multiple items via whitespace, delimiters, etc. As long as you can embed newlines, then there should be no need for an array.

DocOtak commented 2 months ago

Pretty long discussion in https://github.com/cf-convention/cf-conventions/issues/141

The discussed use cases were for the history attribute, anything in CF that is "space separated" e.g. coordinates, flag definitions, etc...

pp-mo commented 2 months ago

FWIW my understanding of the CF spec is indeed that all attributes essentially are a 1-D "array" of values, but this only "somewhat like" an array, since it is 1-dimensional at most.
In the Python interface, at least, a single element is also treated as a scalar -- i.e. crudely, it reads and writes as "x" not ["x"]. And if ["x"] is written, "x" will be read back.

However, my further understanding is that attributes containing multiple strings are not fully supported at least in the Python interface.
From experiment, string array attributes are entirely unsupported for NetCDF 3 files -- that is, you get an error if you attempt to create one. So, it has long been practice to encode multiple strings as a single "\n"-separated character array.

Also from experiment, multi-element string arrays can now be assigned to attributes, but these are encoded as variable-length strings, not an array of 'ordinary' strings. So the datatype of a single string and a string array are effectively different. ( The python interface now also has "setattr_string", which is specific to the variable-length string type. ) For demonstration:

>>> ds = nc.Dataset("tmp.nc", "w")
>>> ds.setncattr("q", ["one string"])
>>> ds.setncattr("a", "abc def\nfgh")
>>> ds.setncattr("x", ["y", "z"])
>>> ds.setncattr_string("a2", "abc def\nfgh")
>>> ds.setncattr_string("x2", ["y", "z"])
>>> 
>>> print(repr(ds.getncattr("q")))
'one string'
>>> print(repr(ds.getncattr("a")))
'abc def\nfgh'
>>> print(repr(ds.getncattr("x")))
['y', 'z']
>>> 
>>> ds.close()
>>> os.system("ncdump -h tmp.nc")
netcdf tmp {

// global attributes:
        :q = "one string" ;
        :a = "abc def\nfgh" ;
        string :x = "y", "z" ;
        string :a2 = "abc def\nfgh" ;
        string :x2 = "y", "z" ;
}

I have almost no experience of other language APIs, but I can see that in the C interface, nc_put_att_string and nc_get_att_string pass a "* char", analogous to the " int" of nc_put_att_int and nc_get_att_int. And we are instructed to use nc_inq_attlen to determine how many elements there are. So, that does not rule out multiple "regular" strings, and possibly that limitation is confined to the Python library. Possibly, also, this was essentially different in the NetCDF3 API ?

HTH, and I'd be keen to hear the situation in other languages from someone less Python-centric.

benjwadams commented 2 months ago

Also from experiment, multi-element string arrays can now be assigned to attributes, but these are encoded as variable-length strings, not an array of 'ordinary' strings. So the datatype of a single string and a string array are effectively different. ( The python interface now also has "setattr_string", which is specific to the variable-length string type. ) For demonstration:

Yes, this issue is more or less about > 2 element variable length strings. While the question mentions global attributes, it nonetheless should be extended to cover places where string/char attributes in CF are present in general.

DocOtak commented 2 months ago

@pp-mo what you are seeing there are differences between the netCDF classic and the netCDF enhanced data models. The classic data model has no support for variable length data types (aka strings). The only slightly unusual thing I've seen that might be python specific, is the netCDF4 python library will force the data type of attributes containing character arrays that have any chars outside the ASCII code point range into a string vlen type rather than a char array.

JonathanGregory commented 2 months ago

Like for the global attributes, we don't have statements in the convention about arrays of strings for variable attributes, since they didn't exist when much of the text was written. The CF string-valued attributes are all expected to contain a word or a blank-separated list of words. I'm not aware of any reason for needing to use an array of strings, and this would certainly not work with some existing software.

From the above discussion and cf-convention/cf-conventions#141, I'm inclined to think we should insert explicit statements in the convention text and correspondingly in the conformance document to say that any CF string-valued attributes must be either a 1D character array or a scalar string. Would anyone disagree with that? As with all aspects, this could be reconsidered in future.

taylor13 commented 2 months ago

Consistent with my earlier comment, I support Jonathan's comment immediately above.

DocOtak commented 1 month ago

Arrays of vlen strings as attributes have been a thing in netCDF for about 16 years now. It has been 6 years since cf-convention/cf-conventions#141 was started, and 4 years since that conversation went dormant. I think the future to consider supporting these types of values is now (or soon). For my own work, having to parse some CF custom string syntax (see e.g. cell_methods) or mangling my flag definitions has caused more issues that just dealing with features that have been in netCDF itself for my entire professional career. I'd support keeping the Conventions attribute as a 1d char array attribute, since that is what you would need to read to know if you might encounter more "advanced" features in a file.

I guess the above is just asking, if not now, when?

ChrisBarker-NOAA commented 1 month ago

Hmm -- is there any talk of CF 2.0?

In which we could start expecting "modern" netcdf, and stop supporting old stuff from COARDS (at least when talking about it in the docs, etc?)

We wouldn't want to do any major breaking changes, but a chance to remove some of teh cruft of the past would be nice, and then we wouldn't need to have these debates ....

DocOtak commented 1 month ago

My feeling is that CF 2.0 gets talked about in a "one day" sense that just stops the conversation. In CF 1.11 right now netCDF4 groups and the vlen string dtype for variables are supported so it's already "modern" in some ways. Groups can be especially breaking if programs aren't expecting them. I know when I first started working with CF, outdated (dare I say false) statements like "dtype x aren't supported in netCDF" that used to be in the document made me suspicious that the conventions were not being updated or maintained.

Aside, I think for a CF 2.0, it would be neat to break it up into different parts that are "independent" which are then opt into via the Conventions attribute: e.g. "CF-2.0 CF-DSG" for a data file that is following the "core CF conventions" with the discrete sampling geometry extension or perhaps some externally defined standard like "CF-2.0 CF-Radial" for the radar data.

JonathanGregory commented 1 month ago

@DocOtak pointed out that it's been six years since cf-convention/cf-conventions#141 was started. I hope we can finish it now! I'm on a mission to bring ancient issues to conclusion, admittedly rather slowly.

I propose that we split the preamble of 2.6 on Attributes into two paragraphs, since it's quite a long paragraph, with a new sentence to begin the second paragraph, shown in bold in the following. The rest is all existing text. Also, I show in italic a sentence which I have moved from the very end to near the start and updated it to recognise there's now more than one list.

This standard describes many attributes (some mandatory, others optional). See Appendices A "Attributes", F "Grid Mappings", and K "Mesh Topologies" for lists of attributes defined by this standard. A file may also contain non-standard attributes. Such attributes do not represent a violation of this standard. Application programs should ignore attributes that they do not recognise or which are irrelevant for their purposes. Conventional attribute names should be used wherever applicable. Non-standard names should be as meaningful as possible. Before introducing an attribute, consideration should be given to whether the information would be better represented as a variable. In general, if a proposed attribute requires ancillary data to describe it, is multidimensional, requires any of the defined netCDF dimensions to index its values, or requires a significant amount of storage, a variable should be used instead.

Any global or variable string-valued attribute described by this standard may be stored either as a netCDF scalar string or as a netCDF 1D character array; arrays of strings are not allowed for CF attributes. When this standard defines string attributes that may take various prescribed values, the possible values are generally given in lower case. However, applications programs should not be sensitive to case in these attributes. Several string attributes are defined by this standard to contain "blank-separated lists". Consecutive words in such a list are separated by one or more adjacent spaces. The list may begin and end with any number of spaces.

In section 2.6 of the conformance document, I propose that we insert a new requirement: String-valued attributes defined by CF must be scalar strings or 1D character arrays; arrays of strings are not allowed.

Enough support has already been expressed for making this change in principle. What do you all think about the above proposals?

DocOtak commented 1 month ago

I think the restriction should be limited to only attributes that have their values controlled by the conventions, not simply if the convention defines the attribute name.

davidhassell commented 1 month ago

I think the restriction should be limited to only attributes that have their values controlled by the conventions, not simply if the convention defines the attribute name.

This seems a bit dangerous to me - e.g. we wouldn't want to find a long_name attribute that was an array of strings, would we? (I wouldn't!)

taylor13 commented 1 month ago

I too think relaxing the restriction would complicate things with very little benefit.

JonathanGregory commented 1 month ago

I agree. I think we should apply this requirement to all the string-valued attributes listed by CF, including those which are in the NUG, such as long_name and units. Such attributes are included in Appendix A. The requirement shouldn't apply to any attributes not described by CF.

larsbarring commented 1 month ago

Having just come to this issue, and read cf-convention/cf-conventions#141 that @DocOtak pointed at I find that my thought are almost exactly copying what Andrew writes.

The argument that some software will break if some new feature were to be introduced has in my opinion limited reach. I agree that it is a valid argument if something is really new and time is needed to adjust. But software simply just have to be adopted to new challenges and requirements, be it urgent security threats or more slowly evolving user requirements. In this particular case, the fundamental libray that CF almost totally depends on (netCDF) offers since quite some time (16 years), and in cf-convention/cf-conventions#141 (6 years) there were several voices in favour of -- and concrete examples for why -- adding arrays of strings to what is acceptable CF attribute values.

If the free text attribute values (i.e. not CVs) are considered as something directed (only) for human consumption, then having all information collected (or crammed) as space separated into an array of chars or one string is maybe acceptable. But if the attribute value is intended to also be parsed by software this easily becomes severely limiting.

Without more concrete argument against than "some software will break" I see no strong argument for not allowing arrays of strings. If this were to be introduced in CF-1.12 I do not expect that the use would suddenly explode, meaning that software developers should have time to make necessary adjustments.

Currently CF does not have a mechanism for alerting software developers to what is coming, but in lieu of that the 16 years or 6 years is maybe some sort of indication. Without such an alerting mechanism the argument that "some software break" I am afraid will put CF in the risk of being conserved into a fossil. But this is the topic for another Discussion.

Speaking of Discussion, would it be possible to move this issue over to the new Discussions?