NCAR / CfRadial

NetCDF CF Conventions for radial coordinate data for RADAR and LIDAR
BSD 2-Clause "Simplified" License
22 stars 7 forks source link

Need to clarify NetCDF type for string attributes #1

Open marack opened 5 years ago

marack commented 5 years ago

The current CfRadial 2 draft (2019-02-03) specifies the type of attributes as either string, int, float, double, string[], or array of same type as field data.

For most of these types the mapping to a concrete netcdf data type is obvious:

For the string attributes things are a little more complicated due to there being two functions for writing string based attributes in the NetCDF API: nc_put_att_text and nc_put_att_string.

The nc_put_att_text function writes the attribute as a 1D array of NC_CHAR. This is the traditional and most common way to write a scalar string attribute.

The nc_put_att_string function writes the attribute as a NC_STRING. This API allows us to output arrays of strings, and is thus the only option for our string[] attributes.

These two methods of writing strings result in fundamentally different types in the output file. The difference is visible in ncdump output:

    short KDP(time, range) ;
        string KDP:ancillary_variables = "foo" ;
        KDP:legend_xml = "bar" ;

Here the ancillary_variables attribute was written with nc_put_att_string (with a single string passed), while legend_xml was written with nc_put_att_text.

We probably need to clarify that string in the CfRadial 2 specification maps to the traditional array of NC_CHAR as output by nc_put_att_text, while string[] maps to an array of the NC_STRING type as output by nc_put_att_string.

kenkehoe commented 5 years ago

After looking into this with python and xarray I agree that "string" should map to character array. Initially I was going to suggest we encourage using scalar strings but I think that will be difficult for users to implement consistently with 3rd party tools. But does this mean we discourage the use of scalar string type?

marack commented 5 years ago

I've just realized that there are also string variables specified by CfRadial. The variables API also has dedicated functions corresponding to NC_CHAR and NC_STRING (nc_put_var_text and nc_put_var_string).

Unfortunately to use NC_CHAR for a variable would require adding a dimension to capture the length of the string. It would also make the strings fixed length so a user could never append to it. I think this means that for variables we are forced to use the NC_STRING type.

This means that only scalar attributes would be NC_CHAR which now feels inconsistent.

To complicate matters, the parent CF standard (1.8 draft) explicitly defines strings as being character arrays. This means that users will expect attributes that come from CF to be in the NC_CHAR type (e.g. title, standard_name, etc).

kenkehoe commented 5 years ago

If we are only talking about CF/Radial 2. then I think switching to strings for attributes is reasonable. Since CF/Radial uses netCDF4 with groups then we can consider making other larger changes that will break backwards compatibility. CF is not going to adopt string attributes soon. Maybe in CF-2.0. But if we do decided to use strings for scalars we might as well switch all attributes that do use or could use multiple elements to string or string arrays. Currently the attributes that have multiple pieces of information use the CF space delimiter. I'm in favor of using scalar strings and strings in CF/Radial 2.. It will most likely break some tools but they can be updated.

The difficult part is that many netCDF writers will need to be updated to force the attributes to be strings instead of char arrays. My test with xarray and python required me to convert an attribute that was read from a char array to a string before writing. This will make it harder to be CF/Radial compliant for users writing data. In xarray it was not hard to write string array attributes but I don't know if it will be easy (or possible) to force xarray to write a scalar string. I think it defaults to character array for scalars.

mike-dixon commented 5 years ago

I believe for CfRadial-2 we should specify the use of NetCDF-4 type strings everywhere, rather than the use of char[].

One thing we could review is our use of strings instead of enums, as in platform_type, instrument_type etc. Perhaps it would be cleaner to use enums, and some people have commented on that. The main reason to stick with strings is to match the philosophy of CfRadial-1.

kenkehoe commented 5 years ago

Just a FYI. Using xarray with Python I was not able to write a scalar string attribute. Even if the string is a scalar in the xarray in memory, when xarray writes it to a netCDF4 file it will auto-convert to char array. I think this is fine, but we should just not prohibit the use of character array for writing a scalar string attribute since I don't know how to do that with xarray and python currently.