Conventions for string and character array encoding

rsignell-usgs commented 7 years ago

As discussed here https://github.com/Unidata/netcdf4-python/issues/654#issuecomment-298284181, there is a need for conventions to specify the encoding of strings and character arrays in netcdf.

There is also a need to specify whether char arrays in NetCDF3 contain strings or character arrays.

@BobSimons addressed these issues in an enhancement to CF conventions that would specify charset for NetCDF3 and _Encoding for NetCDF4, and the Unidata gang (@DennisHeimbigner, @WardF, @ethanrd and @cwardgar) agreed with the concept, but suggested this be handled in the NUG and we came up with this slightly different proposal that would still accomplish Bob's goals of making it easy for software to figure out what is stuffed in those char or string arrays!

Proposal:

Use _CharType variable attribute with allowed values ['STRING', 'CHAR_ARRAY'] to specify if a char array variable should be interpreted as a string or as an array of individual characters. If _CharType is missing, default is 'STRING'.
Use _Encoding variable attribute with allowed values ['ISO-8859-1', 'ISO-8859-15', 'UTF-8'] to specify the encoding. If _Encoding is missing for _CharType='STRING', default is 'UTF-8'. If _Encoding is missing for _CharType='CHAR_ARRAY', default is 'ISO-8859-15'.

ethanrd commented 7 years ago

Should an _Encoding attribute on a 'char' typed variable be restricted to a 7- or 8-bit encoding?

ethanrd commented 7 years ago

As @DennisHeimbigner mentions here https://github.com/Unidata/netcdf4-python/issues/654#issuecomment-298735562, this proposal does deals only with char or String typed variables, not char or String typed attributes.

jswhit commented 7 years ago

Why wouldn't _Encoding apply to attributes as well as variable data?

DennisHeimbigner commented 7 years ago

Because netcdf does not support attributes for attributes. We would need to come up with some kind of convention for this: a second attribute that could be interpreted as applying to the string/char attribute. Alternate is to define a global encoding for all attributes. =Dennis Heimbigner Unidata

On 5/3/2017 8:01 AM, Jeff Whitaker wrote:

Why wouldn't |_Encoding| apply to attributes as well as variable data?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-298919846, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3P2yhdO2_-QxIAH57sQUlvJMbiB3I-ks5r2IjKgaJpZM4NOoaO.

rsignell-usgs commented 7 years ago

@DennisHeimbigner, I guessing you can answer @ethanrd question:

Should an _Encoding attribute on a char typed variable be restricted to a 7- or 8-bit encoding?

DennisHeimbigner commented 7 years ago

Should an _Encoding attribute on a char typed variable be restricted to a 7- or 8-bit encoding?

If an _Encoding is specified, then that technically determines 7 vs 8 bit. E.g. Ascii is 7 bit, but ISO-Latin-8859-1 is 8-bit. The tricky case is when something like utf-8 encoding is specified. Technically, the single character subset of utf-8 is 7-bit ascii. But, it is clear that some users treat an array of chars as a string, in which case any legal utf-8 char bit pattern should be legal. IMO, we should always treat char as essentially equivalent to unsigned byte so that a char can hold any 8-bit bit pattern so that e.g. _Encoding = "iso-latin-8859-1" is legal and does not lose information.

ethanrd commented 7 years ago

I suggested always indicating the encoding/charset with the same attribute (_Encoding), whether string or character, thinking it simplified things. Now that restrictions on allowed encodings has come up, I'm seeing the wisdom of @BobSimons original proposal to the CF list with one attribute that gives a string encoding for use when interpreting strings and one that gives the character set for use when interpreting individual characters. (It avoids the need for different restrictions on the value of _Encoding depending on the situation.)

So, as an alternate to the above proposal, I'll restate Bob's proposal here with a change or two given the target is the NUG rather than CF:

Use the _CharSet variable attribute to indicate that a char array should be interpreted as individual 8-bit characters. The value of the attribute gives the 8-bit character set to use when interpreting the 8-bit characters. (E.g., 'ISO-8859-15'.)
Use the _Encoding variable attribute to indicate which character encoding should be used to interpret a string variable. Used with a char array, the attribute indicates that it should be interpreted as a string (or an array of strings).

Reviewing Bob's original proposal brought up a number of questions on how the netCDF-4 and HDF5 libraries handle string encoding (if they enforce the encoding or not, etc.). I'm still digging and will report back when I get somewhere.

Also, there was some question in the CF discussion on whether an explicit indicator was needed to differentiate between whether a char array should be interpreted as individual 8-bit characters or as a string(s). Since the current proposal is suggesting a change to the NUG, I'm not sure if this question will or should play out the same as in a CF discussion.

rsignell-usgs commented 7 years ago

@lesserwhirls, do you have any thoughts here? @ethanrd are you still looking at this, or can we propose the above changes to NUG?

DennisHeimbigner commented 7 years ago

I do not understand the need for the _CharSet attribute. The type of the variable (char vs String) and the _Encoding attribute seem to me to encompass _CharSet. That is _Encoding for char type == CharSet

rsignell-usgs commented 7 years ago

@DennisHeimbigner, the problem is that while netcdf4 has char or string, netcdf3 has only char. So we don't know whether the netcdf3 char holds a string or an array of 8 bit characters.

BobSimons commented 7 years ago

@DennisHeimbigner, this is an alternative proposal. In this proposal _CharSet and _Encoding apply to different situations, have different options, and are used differently (mandatory vs optional):

_CharSet would be for char variables when they are to be interpreted as individual chars. The options are ISO-8859-1 and ISO-8859-15.
_CharSet is mandatory if the chars should be interpreted as individual chars.

_Encoding would be for String variables (e.g., in nc4) and char variables in nc3 which should be interpreted as Strings. The options are ISO-8859-1, ISO-8859-15, and UTF-8. [Different!] _Encoding is optional. The default is UTF-8. [Different!]

A further advantage is that only one attribute is needed per variable, not two.

Think of it from a software reader's point of view:

Is there a _CharSet attribute? Then these are chars and I now know the charset.
Is there an _Encoding attribute? Then these are strings and I now know the encoding.
Is there neither? Then these are UTF-8 strings.

DennisHeimbigner commented 7 years ago

I think the term "mandatory" is being misused here since a default is defined. But the real issue in Bob's proposal has to do with whether a character typed variable (or attribute?) is to be treated as if it was a surrogate for the lack of a String type in netcdf-3 and do we want a special attribute to mark that case. Personally undecided on that issue.

DennisHeimbigner commented 7 years ago

One other question. If we had an attribute to indicate that a char array should be treated like a string, do we want to limit the use of that attribute to netcdf-3 only. Since netcdf-4 has string type, that attribute is technically not needed.

ethanrd commented 7 years ago

@BobSimons Given the backward compatibility issues, I'm not sure the NUG should specify how character arrays are interpreted when the proposed attributes are not used. At least not at the level of a MUST.

rsignell-usgs commented 7 years ago

The proposed default behavior is to assume that a netcdf3 char array is a string. With Bob's proposal, if a_CharSet attribute is found, we know it's not a string.

BobSimons commented 7 years ago

With the original proposal, an nc3 file might have:

  someMonths(a=5, b=10)
    _CharType="STRING"
    _Encoding="UTF-8"
  someStatus(c=4, d=2)
    _CharType="CHAR_ARRAY"
    _Encoding="ISO-8859-1"

With the alternative proposal, that nc3 file would have:

  someMonths(a=5, b=10)
    _Encoding="UTF-8"
  someStatus(c=4, d=2)
    _CharSet="ISO-8859-1"

because _Encoding now says two things (this var is a String var and the encoding is ...) and _CharSet likewise says two things (this var has individual chars and the charset is ...).

thehesiod commented 7 years ago

@BobSimons I don't think the default for NC3 can be UTF-8 because there are existing NC3 files w/o _Encoding which are not UTF-8. Existing NC3 files w/o _Encoding are ambiguous (broken) for strings due to bad spec.

rsignell-usgs commented 7 years ago

I think what @BobSimons means is that the convention will be to assume string and UTF-8 for char arrays without any attributes because, as you say, it's ambiguous, and software will have to do something!

BobSimons commented 7 years ago

And I leave it to everyone else to say what the default should be. There are advantages and disadvantages to every choice. ISO-8859-1 probably makes sense from a safe, backward-looking sense. UTF-8 would be nice in a forward-looking sense.

On Tue, May 9, 2017 at 12:15 PM, Rich Signell notifications@github.com wrote:

I think what @BobSimons https://github.com/BobSimons means is that the convention will be to assume string and UTF-8 for char arrays without any attributes because, as you say, it's ambiguous, and software will have to do something!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-300272683, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOJxsouo4DKsLxduKi4Jw0rrQI2Gbks5r4LtjgaJpZM4NOoaO .

-- Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

dopplershift commented 7 years ago

If utf-8 is an option, why are we restricting the rest of the list to iso-8859-1/15?

BobSimons commented 7 years ago

Under either proposal, for char variables that will be interpreted as individual characters (which will be stored as individual bytes in .nc files), UTF-8 isn't and can't be an option because most UTF-8 characters are represented as more than one byte.

Under either proposal, for char variables that will be interpreted as Strings, UTF-8 is a valid option.

That said, why just 2 other options? It can't be open-ended because then all software which tries to read a .nc file is responsible for being ready to read every possible encoding. There's a question of what are the "correct" or at least valid names -- different systems seem to use slightly different names. Different computer languages support different options. So there needs to be a defined list of acceptable options. Right now, that list is short.

ISO-8859-1 is nice because it is the same as the first 256 characters of Unicode. So it is the closest to what netcdf library has been doing when writing just the low byte of a Unicode character. ISO-8859-1 has been widely used. ISO-8859-15 is nice because it is the modern version of ISO-8859-1. ISO-8859-15 has been fairly widely used.

Support for options other than UTF-8 is a way of dealing with legacy files. There are millions (billions?) of .nc files that aren't going to be re-written, so it would be nice if there were a way to specify the encoding if it is known. If it is known, it could be specified by adding an, e.g., _Encoding attribute with NCO or on-the-fly with NCML without having to write a program to read the file and write the file out with the attribute specifying the encoding.

I personally am open to allowing other options if the need arises, But I don't know which other options are needed. If others are added, we need to agree on the specific names.

On Tue, May 9, 2017 at 1:04 PM, Ryan May notifications@github.com wrote:

If utf-8 is an option, why are we restricting the rest of the list to iso-8859-1/15?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-300285127, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOPt83LfG-df6K_lr_lMzAHArBl46ks5r4MbCgaJpZM4NOoaO .

-- Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

dopplershift commented 7 years ago

Well the problem with ISO-8859-1 (aka latin-1) and ISO-8859-15 (aka latin-9) is that they're distinctly focused on western European languages. We should at a minimum look at something like koi8-r and cp1251 to encompass eastern european/cyrillic characters. You should also be able to declare ascii itself to indicate that you only intend to use the lowest 7-bits.

BobSimons commented 7 years ago

Every charset has a different focus. I suggested ISO-8859-1 and -15 because I know they have been widely used. If you know of files that use koi8-r and cp1251, then let's add them to the list of acceptable charsets/encodings.

I don't like the idea of allowing an ASCII (7-bit) option because the data is 8-bits. A reader has to be ready to deal with 8-bit data. (Or we could say that ASCII is a valid option but if the file has a character using the 8th bit, the file is invalid. I suspect we would get a lot of invalid files from non-ASCII apostrophes and hyphens that the file authors aren't even aware of.) I also don't see the need for an ASCII option because, if the author really believes the characters are all ASCII, then ISO-8859-1 can be specified (since the first 128 chars are the same).

On Tue, May 9, 2017 at 2:46 PM, Ryan May notifications@github.com wrote:

Well the problem with ISO-8859-1 (aka latin-1) and ISO-8859-15 (aka latin-9) is that they're distinctly focused on western European languages. We should at a minimum look at something like koi8-r and cp1251 to encompass eastern european/cyrillic characters. You should also be able to declare ascii itself to indicate that you only intend to use the lowest 7-bits.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-300311352, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOM4wyqAA0LWh7tNd-oYMRudXDGwfks5r4N61gaJpZM4NOoaO .

-- Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

jswhit commented 7 years ago

for multidimensional character arrays that are to be interpreted as strings, is there a standard way to interpret the dimensions? Should the last dimension be interpreted as the length of the strings? If so, is there a convention for naming that dimension?

BobSimons commented 7 years ago

"Yes" for your first two questions: CF 1.6 (and previous) section 2.2 http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_data_types says that is always the last dimension that holds the number of characters "NetCDF does not support a character string type, so these must be represented as character arrays. In this document, a one dimensional array of character data is simply referred to as a "string". An n-dimensional array of strings must be implemented as a character array of dimension (n,max_string_length), with the last (most rapidly varying) dimension declared large enough to contain the longest string in the array. All the strings in a given array are therefore defined to be equal in length. For example, an array of strings containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name."

"No" for your third question: As far as I know, there is no standard for how that dimension should be named. CF section 2.3 says "This convention does not standardize any variable or dimension names. "

On Tue, May 16, 2017 at 10:02 AM, Jeff Whitaker notifications@github.com wrote:

for multidimensional character arrays that are to be interpreted as strings, is there a standard way to interpret the dimensions? Should the last dimension be interpreted as the length of the strings? If so, is there a convention for naming that dimension?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-301848278, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOBxdV5--h9sFvQr0O2SK2gAUvWuFks5r6da6gaJpZM4NOoaO .

-- Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

rsignell-usgs commented 7 years ago

@dopplershift , are you satisfied with the explanation @BobSimons provided?
I'd like to push this one to closure and not just leave it hanging...

jswhit commented 7 years ago

I added automatic detection of the _Encoding attribute in netcdf4-python (https://github.com/Unidata/netcdf4-python/pull/665). For string variables, if _Encoding is set it is used to encode the strings into bytes when writing to the file, and to decode the bytes into strings when reading from the file. If _Encoding is not specified, utf-8 is used (which was the previous behavior). When reading data from character variables _Encoding is used to convert the character array to an array of fixed length strings, assuming the last dimension is the length of the strings. When writing data to character variables, _Encoding is used to encode the string arrays into bytes, creating an array of individual bytes with one more dimension. For character variables, if _Encoding is not set, an array of bytes is returned.

BobSimons commented 7 years ago

This seems to be significantly different from the original proposal or the alternate proposal.

"When writing data to character variables, _Encoding is used to encode the string arrays into bytes, creating an array of individual characters with one more dimension. For character variables, if _Encoding is not set, an array of characters is returned."

I'm confused. Since netcdf4 has separate char and String data types, why are you adding a dimension when writing chars to a char variable? Is this your way of allowing chars in a char variable to be encoded with UTF-8 (and thus perhaps take up multiple bytes / char)? That would expand the usage of chars significantly.

And when reading a char variable from an nc4 file, won't an array of chars always be returned? (Or, again, is this your way of expanding the usage of chars to include UTF-8 encoding?) And can't _Encoding be used to indicate the charset of the returned characters (e.g., ISO-8859-1)?

This usage seems oriented to just reading and writing netcdf-4 files. It doesn't solve the problem of how to interpret a char variable in a netcdf-3 file (as strings? as separate chars?). One of the complaints in the CF discussion was: someone writing code to read a file shouldn't have to know whether they are reading an nc3 file or an nc4 file in order to know how to interpret the data. It would be nice to have a system that works with nc3 and nc4 files.

On Wed, May 17, 2017 at 8:49 AM, Jeff Whitaker notifications@github.com wrote:

I added automatic detection of the _Encoding attribute in netcdf4-python ( Unidata/netcdf4-python#665 https://github.com/Unidata/netcdf4-python/pull/665). For string variables, if _Encoding is set it is used to encode the strings into bytes when writing to the file, and to decode the bytes into strings when reading from the file. If _Encoding is not specified, utf-8 is used (which was the previous behavior). When reading data from character variables _Encoding is used to convert the character array to an array of fixed length strings, assuming the last dimension is the length of the strings. When writing data to character variables, _Encoding is used to encode the string arrays into bytes, creating an array of individual characters with one more dimension. For character variables, if _Encoding is not set, an array of characters is returned.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-302134033, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOGWkde0uKwdnXJdxGsaZ0Sez40vlks5r6xcmgaJpZM4NOoaO .

-- Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

dopplershift commented 7 years ago

@rsignell-usgs @BobSimons Well I was thinking ascii as a nice option for "I don't care about the 8th bit", but I can see the rationale behind forcing a choice for the 8th bit--I'm just guessing most users are not going to care or even understand anything beyond ascii and are just going to pick the option that let's them write without errors. Either way, so long as we make our restricted list as inclusive as possible--was just trying to make us less US/Western Europe-centric.

DennisHeimbigner commented 7 years ago

What I wish was the case was this:

char type is an alias for unsigned byte - this guarantees that 8 bits of data must always be preserved
The _Encoding is a suggestion about how programs (ncdump, etc) should interpret the characters in the event that they have to print them (or read them from text). The same should also hold for strings in that strings are equivalent to variable length sequences of unsigned bytes. The reason I wish this were the case is that the _Encoding is AFIAK irrelevant except when reading or writing text.

DennisHeimbigner commented 7 years ago

Also, with respect to using the rightmost dim to encode (fixed-length) strings: this is purely an external convention and is certainly not part of the netcdf spec. It raises a question? Who actually makes use of this convention? I know only one place: the conversion of DAP2 string typed vars into netcdf-3 character typed variables. Is it used anywhere else?

BobSimons commented 7 years ago

--- External? I've always been confused about the relationship of netcdf and CF so I don't know if you consider CF external, but using the rightmost dim to encode strings in char variables is part of the CF specification (section 2.2).

--- Where is this relevant? Doesn't netcdf-java always use the rightmost dim when you use NetcdfFileWriter.addStringVariable() and NetcdfFileWriter.writeStringData() when writing an nc3 file? And doesn't it use the rightmost dim when you use NetcdfFile.read(), readData(), and readSection()? (When reading nc3 files, do those/how do those distinguish char variables that should be read as individual chars from char variables that should be read as Strings?) Doesn't netcdf-c do the same?

Some other software (e.g., some of mine) also uses the rightmost dimension system explicitly in places that were written before (or before my awareness of) writeStringData().

On Wed, May 17, 2017 at 11:26 AM, DennisHeimbigner <notifications@github.com

wrote:

Also, with respect to using the rightmost dim to encode (fixed-length) strings: this is purely an external convention and is certainly not part of the netcdf spec. It raises a question? Who actually makes use of this convention? I know only one place: the conversion of DAP2 string typed vars into netcdf-3 character typed variables. Is it used anywhere else?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-302185109, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOHFtmcd4UftG-NkFEHtbHcYanZGCks5r6zvKgaJpZM4NOoaO .

-- Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

DennisHeimbigner commented 7 years ago

WRT to jswhit proposal above.

As part of the python strings -> netcdf-4 string translation rules, the rule for strings seems reasonable.
The character part of that proposal is relevant only to the issue of python strings <-> netcdf-3 char array translation rules.. It is consistent with DAP2 and CF translation rules.
For translating netcdf4-char arrays to python, #2 also seems appropriate. I have a question for the python people. Is there any situation in which a python string would be translated into a netcdf- char array? I infer that this case is prohibited using the jswhit rules for python <-> netcdf-4

DennisHeimbigner commented 7 years ago

At this point, there seems to be agreement about strings: _Encoding specifies the character set and if missing, utf-8 should be assumed.

So we can focus on the character type as an eight bit value. I am not concerned here with translation rules (e.g. python strings <-> netcdf character arrays).

1 _Encoding applies to individual 8-bit characters but the only legal _Encodings are those that are inherently 8-bit or less: iso-8859, ascii being prevalent. Converting a vector of such characters to a string (via some rule) should produce a legal string of that encoding.

_Encoding applies to individual 8-bit characters and specify only the expected bit patterns. This allows utf-8 _Encoding since the set of legal utf8 bit patterns are known). Note that this does not mean that converting (via some rule) a vector of chars to a String would necessarily produce a legitimate utf8 encoded string. the default encoding is to allow any 8-bit pattern. Personally I prefer #2 since it at least allows (again via some reasonable rule) to convert a utf8 string to a vector of 8-bit characters. Choosing #1 would preclude that possibility at all and an error would have to be thrown.

jswhit commented 7 years ago

@BobSimons, regarding your comment that the python implemention deviates from your original proposal...

In the situation when a user tries to write an array of python fixed length strings to a character variable with _Encoding set, the python interface will convert that array of fixed length strings to an array of single characters (bytes) with one more dimension (equal to the length of the fixed length strings, and the rightmost dimension of the character variable) then write that array of characters to the file.

I thought this was in the spirit of the CF convention - and this is what a user would have to do manually to write the strings to the character variable. One could certainly argue that this is a too much 'magic' though.

The same happens in reverse when data is read from a char variable with _Encoding set.

jswhit commented 7 years ago

@DennisHeimbigner, regarding your question "Is there any situation in which a python string would be translated into a netcdf- char array?"....

The answer is yes, if you are writing a single string into a character array like this

>>> v
<type 'netCDF4._netCDF4.Variable'>
|S1 strings(n1, n2, nchar)
    _Encoding: ascii
unlimited dimensions: n1
current shape = (0, 10, 12)
filling on, default _FillValue of  used
>>> v[0,0,:] = 'foobar'

The string foobar will get converted into an array of 12 characters (with trailing blanks appended) and then written to the file resulting in

netcdf tst_stringarr {
dimensions:
    n1 = UNLIMITED ; // (1 currently)
    n2 = 10 ;
    nchar = 12 ;
variables:
    char strings(n1, n2, nchar) ;
        strings:_Encoding = "ascii" ;
data:

 strings =
  "foobar",

BobSimons commented 7 years ago

Your approach is internally consistent -- if someone writes files with your system and reads them with your system, all is well. But there are other nc files created by other software, which I think don't mesh with your approach.

I don't know if your system is for netcdf-4 only, but if netcdf-3 files are included, the problem is: there are nc3 files with char variables where the chars are meant to be read as individual chars without collapsing the rightmost dimension. The Argo program has 100's of 1000's (millions?) of these files. They have variables like char POSITION_QC(N_PROF=254); where there is one QC character per profile. (Yes, there's a more CF-way to do this now, but they started doing this many years ago.) I think it is a reasonable reading of the CF convention (section 2.2) to say that these are legit char variables, not to be interpreted as Strings (by collapsing the rightmost dimension).

A goal of this proposal is to make it simple for a software reader to read a file (including an Argo file) and know quickly and easily if a given char variable in an nc3 file is meant to be interpreted as individual chars (not collapsing the rightmost dimension) or as Strings (by collapsing the rightmost dimension). With nc4 files that is trivial because there are explicit char and String data types. The problem is with disambiguating char variables in nc3 files.

Stated another way, it is a goal that netcdf-java library's NetcdfFile.read() should be able to know quickly and easily whether it should return an ArrayChar (by not collapsing the rightmost dimension) or an ArrayString (by collapsing the rightmost dimension) (and also be able to properly deal with the charset/encoding of the stored characters).

On Wed, May 17, 2017 at 3:40 PM, Jeff Whitaker notifications@github.com wrote:

@BobSimons https://github.com/bobsimons, regarding your comment that the python implemention deviates from your original proposal...

In the situation when a user tries to write an array of python fixed length strings to a character variable with _Encoding set, the python interface will convert that array of fixed length strings to an array of single characters (bytes) with one more dimension (equal to the length of the fixed length strings, and the rightmost dimension of the character variable) then write that array of characters to the file.

I thought this was in the spirit of the CF convention - and this is what a user would have to do manually to write the strings to the character variable. One could certainly argue that this is a too much 'magic' though.

The same happens in reverse when data is read from a char variable with _Encoding set.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-302251264, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOAqdua1ezYBcjktrLDRQH7DevGtTks5r63dBgaJpZM4NOoaO .

-- Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

jswhit commented 7 years ago

for nc3 or nc4 files, if _Encoding is not set the individual chars will be returned by the python interface without collapsing the rightmost dimension. I presume this is the case for those ARGO files. I thought from your proposal that if _Encoding was set, then the client should interpret the char array as strings. Did I misread that?

BobSimons commented 7 years ago

Ah. Thank you. I misunderstood.

On Wed, May 17, 2017 at 4:46 PM, Jeff Whitaker notifications@github.com wrote:

for nc3 or nc4 files, if _Encoding is not set the individual chars will be returned by the python interface without collapsing the rightmost dimension. I thought from your proposal that if _Encoding was set, then the client should interpret the char array as strings. Did I misread that?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-302261668, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOD3-7mfaiaH_o3mt7JsVy0ZDdhNEks5r64bjgaJpZM4NOoaO .

-- Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

rsignell-usgs commented 7 years ago

@BobSimons, would @jswhit's approach with NetCDF-Python work for you in ERDDAP to disambiguate string and char array handling in NetCDF3 and NetCDF4?

Seems like it does, right?

BobSimons commented 7 years ago

Sorry. I'm on vacation for the next 2 weeks and not available to evaluate this. I was confused by his original email. So I don't think I understand his proposal. I stand by my proposal.

On Sun, Jun 4, 2017 at 12:25 PM, Rich Signell notifications@github.com wrote:

@BobSimons https://github.com/bobsimons, would @jswhit https://github.com/jswhit's approach with NetCDF-Python work for you in ERDDAP to disambiguate string and char array handling in NetCDF3 and NetCDF4?

Seems like it does, right?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-306050428, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOIukT81RvoM10PX0NCVIdr3y1o_Oks5sAtprgaJpZM4NOoaO .

-- Sincerely,

Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov

The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><

rsignell-usgs commented 7 years ago

Okay, I'll discuss with you when you get back from vacation.

Unidata / netcdf-c

Conventions for string and character array encoding #402