Open rsignell-usgs opened 7 years ago
Should an _Encoding attribute on a 'char' typed variable be restricted to a 7- or 8-bit encoding?
As @DennisHeimbigner mentions here https://github.com/Unidata/netcdf4-python/issues/654#issuecomment-298735562, this proposal does deals only with char or String typed variables, not char or String typed attributes.
Why wouldn't _Encoding
apply to attributes as well as variable data?
Because netcdf does not support attributes for attributes. We would need to come up with some kind of convention for this: a second attribute that could be interpreted as applying to the string/char attribute. Alternate is to define a global encoding for all attributes. =Dennis Heimbigner Unidata
On 5/3/2017 8:01 AM, Jeff Whitaker wrote:
Why wouldn't |_Encoding| apply to attributes as well as variable data?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-298919846, or mute the thread https://github.com/notifications/unsubscribe-auth/AA3P2yhdO2_-QxIAH57sQUlvJMbiB3I-ks5r2IjKgaJpZM4NOoaO.
@DennisHeimbigner, I guessing you can answer @ethanrd question:
Should an
_Encoding
attribute on achar
typed variable be restricted to a 7- or 8-bit encoding?
Should an _Encoding attribute on a char typed variable be restricted to a 7- or 8-bit encoding?
If an _Encoding is specified, then that technically determines 7 vs 8 bit. E.g. Ascii is 7 bit, but ISO-Latin-8859-1 is 8-bit. The tricky case is when something like utf-8 encoding is specified. Technically, the single character subset of utf-8 is 7-bit ascii. But, it is clear that some users treat an array of chars as a string, in which case any legal utf-8 char bit pattern should be legal. IMO, we should always treat char as essentially equivalent to unsigned byte so that a char can hold any 8-bit bit pattern so that e.g. _Encoding = "iso-latin-8859-1" is legal and does not lose information.
I suggested always indicating the encoding/charset with the same attribute (_Encoding
), whether string or character, thinking it simplified things. Now that restrictions on allowed encodings has come up, I'm seeing the wisdom of @BobSimons original proposal to the CF list with one attribute that gives a string encoding for use when interpreting strings and one that gives the character set for use when interpreting individual characters. (It avoids the need for different restrictions on the value of _Encoding
depending on the situation.)
So, as an alternate to the above proposal, I'll restate Bob's proposal here with a change or two given the target is the NUG rather than CF:
Use the _CharSet
variable attribute to indicate that a char
array should be interpreted as individual 8-bit characters. The value of the attribute gives the 8-bit character set to use when interpreting the 8-bit characters. (E.g., 'ISO-8859-15'.)
Use the _Encoding
variable attribute to indicate which character encoding should be used to interpret a string variable. Used with a char
array, the attribute indicates that it should be interpreted as a string (or an array of strings).
Reviewing Bob's original proposal brought up a number of questions on how the netCDF-4 and HDF5 libraries handle string encoding (if they enforce the encoding or not, etc.). I'm still digging and will report back when I get somewhere.
Also, there was some question in the CF discussion on whether an explicit indicator was needed to differentiate between whether a char
array should be interpreted as individual 8-bit characters or as a string(s). Since the current proposal is suggesting a change to the NUG, I'm not sure if this question will or should play out the same as in a CF discussion.
@lesserwhirls, do you have any thoughts here? @ethanrd are you still looking at this, or can we propose the above changes to NUG?
I do not understand the need for the _CharSet attribute. The type of the variable (char vs String) and the _Encoding attribute seem to me to encompass _CharSet. That is _Encoding for char type == CharSet
@DennisHeimbigner, the problem is that while netcdf4 has char or string, netcdf3 has only char. So we don't know whether the netcdf3 char holds a string or an array of 8 bit characters.
@DennisHeimbigner, this is an alternative proposal. In this proposal _CharSet and _Encoding apply to different situations, have different options, and are used differently (mandatory vs optional):
_CharSet would be for char variables when they are to be interpreted as individual chars.
The options are ISO-8859-1 and ISO-8859-15.
_CharSet is mandatory if the chars should be interpreted as individual chars.
_Encoding would be for String variables (e.g., in nc4) and char variables in nc3 which should be interpreted as Strings. The options are ISO-8859-1, ISO-8859-15, and UTF-8. [Different!] _Encoding is optional. The default is UTF-8. [Different!]
A further advantage is that only one attribute is needed per variable, not two.
Think of it from a software reader's point of view:
I think the term "mandatory" is being misused here since a default is defined. But the real issue in Bob's proposal has to do with whether a character typed variable (or attribute?) is to be treated as if it was a surrogate for the lack of a String type in netcdf-3 and do we want a special attribute to mark that case. Personally undecided on that issue.
One other question. If we had an attribute to indicate that a char array should be treated like a string, do we want to limit the use of that attribute to netcdf-3 only. Since netcdf-4 has string type, that attribute is technically not needed.
@BobSimons Given the backward compatibility issues, I'm not sure the NUG should specify how character arrays are interpreted when the proposed attributes are not used. At least not at the level of a MUST.
The proposed default behavior is to assume that a netcdf3 char array is a string.
With Bob's proposal, if a_CharSet
attribute is found, we know it's not a string.
With the original proposal, an nc3 file might have:
someMonths(a=5, b=10)
_CharType="STRING"
_Encoding="UTF-8"
someStatus(c=4, d=2)
_CharType="CHAR_ARRAY"
_Encoding="ISO-8859-1"
With the alternative proposal, that nc3 file would have:
someMonths(a=5, b=10)
_Encoding="UTF-8"
someStatus(c=4, d=2)
_CharSet="ISO-8859-1"
because _Encoding now says two things (this var is a String var and the encoding is ...) and _CharSet likewise says two things (this var has individual chars and the charset is ...).
@BobSimons I don't think the default for NC3 can be UTF-8 because there are existing NC3 files w/o _Encoding
which are not UTF-8. Existing NC3 files w/o _Encoding
are ambiguous (broken) for strings due to bad spec.
I think what @BobSimons means is that the convention will be to assume string and UTF-8 for char arrays without any attributes because, as you say, it's ambiguous, and software will have to do something!
And I leave it to everyone else to say what the default should be. There are advantages and disadvantages to every choice. ISO-8859-1 probably makes sense from a safe, backward-looking sense. UTF-8 would be nice in a forward-looking sense.
On Tue, May 9, 2017 at 12:15 PM, Rich Signell notifications@github.com wrote:
I think what @BobSimons https://github.com/BobSimons means is that the convention will be to assume string and UTF-8 for char arrays without any attributes because, as you say, it's ambiguous, and software will have to do something!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-300272683, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOJxsouo4DKsLxduKi4Jw0rrQI2Gbks5r4LtjgaJpZM4NOoaO .
-- Sincerely,
Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov
The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><
If utf-8 is an option, why are we restricting the rest of the list to iso-8859-1/15?
Under either proposal, for char variables that will be interpreted as individual characters (which will be stored as individual bytes in .nc files), UTF-8 isn't and can't be an option because most UTF-8 characters are represented as more than one byte.
Under either proposal, for char variables that will be interpreted as Strings, UTF-8 is a valid option.
That said, why just 2 other options? It can't be open-ended because then all software which tries to read a .nc file is responsible for being ready to read every possible encoding. There's a question of what are the "correct" or at least valid names -- different systems seem to use slightly different names. Different computer languages support different options. So there needs to be a defined list of acceptable options. Right now, that list is short.
ISO-8859-1 is nice because it is the same as the first 256 characters of Unicode. So it is the closest to what netcdf library has been doing when writing just the low byte of a Unicode character. ISO-8859-1 has been widely used. ISO-8859-15 is nice because it is the modern version of ISO-8859-1. ISO-8859-15 has been fairly widely used.
Support for options other than UTF-8 is a way of dealing with legacy files. There are millions (billions?) of .nc files that aren't going to be re-written, so it would be nice if there were a way to specify the encoding if it is known. If it is known, it could be specified by adding an, e.g., _Encoding attribute with NCO or on-the-fly with NCML without having to write a program to read the file and write the file out with the attribute specifying the encoding.
I personally am open to allowing other options if the need arises, But I don't know which other options are needed. If others are added, we need to agree on the specific names.
On Tue, May 9, 2017 at 1:04 PM, Ryan May notifications@github.com wrote:
If utf-8 is an option, why are we restricting the rest of the list to iso-8859-1/15?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-300285127, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOPt83LfG-df6K_lr_lMzAHArBl46ks5r4MbCgaJpZM4NOoaO .
-- Sincerely,
Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov
The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><
Well the problem with ISO-8859-1 (aka latin-1) and ISO-8859-15 (aka latin-9) is that they're distinctly focused on western European languages. We should at a minimum look at something like koi8-r and cp1251 to encompass eastern european/cyrillic characters. You should also be able to declare ascii itself to indicate that you only intend to use the lowest 7-bits.
Every charset has a different focus. I suggested ISO-8859-1 and -15 because I know they have been widely used. If you know of files that use koi8-r and cp1251, then let's add them to the list of acceptable charsets/encodings.
I don't like the idea of allowing an ASCII (7-bit) option because the data is 8-bits. A reader has to be ready to deal with 8-bit data. (Or we could say that ASCII is a valid option but if the file has a character using the 8th bit, the file is invalid. I suspect we would get a lot of invalid files from non-ASCII apostrophes and hyphens that the file authors aren't even aware of.) I also don't see the need for an ASCII option because, if the author really believes the characters are all ASCII, then ISO-8859-1 can be specified (since the first 128 chars are the same).
On Tue, May 9, 2017 at 2:46 PM, Ryan May notifications@github.com wrote:
Well the problem with ISO-8859-1 (aka latin-1) and ISO-8859-15 (aka latin-9) is that they're distinctly focused on western European languages. We should at a minimum look at something like koi8-r and cp1251 to encompass eastern european/cyrillic characters. You should also be able to declare ascii itself to indicate that you only intend to use the lowest 7-bits.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-300311352, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOM4wyqAA0LWh7tNd-oYMRudXDGwfks5r4N61gaJpZM4NOoaO .
-- Sincerely,
Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov
The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><
for multidimensional character arrays that are to be interpreted as strings, is there a standard way to interpret the dimensions? Should the last dimension be interpreted as the length of the strings? If so, is there a convention for naming that dimension?
"Yes" for your first two questions: CF 1.6 (and previous) section 2.2 http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#_data_types says that is always the last dimension that holds the number of characters "NetCDF does not support a character string type, so these must be represented as character arrays. In this document, a one dimensional array of character data is simply referred to as a "string". An n-dimensional array of strings must be implemented as a character array of dimension (n,max_string_length), with the last (most rapidly varying) dimension declared large enough to contain the longest string in the array. All the strings in a given array are therefore defined to be equal in length. For example, an array of strings containing the names of the months would be dimensioned (12,9) in order to accommodate "September", the month with the longest name."
"No" for your third question: As far as I know, there is no standard for how that dimension should be named. CF section 2.3 says "This convention does not standardize any variable or dimension names. "
On Tue, May 16, 2017 at 10:02 AM, Jeff Whitaker notifications@github.com wrote:
for multidimensional character arrays that are to be interpreted as strings, is there a standard way to interpret the dimensions? Should the last dimension be interpreted as the length of the strings? If so, is there a convention for naming that dimension?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-301848278, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOBxdV5--h9sFvQr0O2SK2gAUvWuFks5r6da6gaJpZM4NOoaO .
-- Sincerely,
Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov
The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><
@dopplershift , are you satisfied with the explanation @BobSimons provided?
I'd like to push this one to closure and not just leave it hanging...
I added automatic detection of the _Encoding
attribute in netcdf4-python (https://github.com/Unidata/netcdf4-python/pull/665). For string variables, if _Encoding
is set it is used to encode the strings into bytes when writing to the file, and to decode the bytes into strings when reading from the file. If _Encoding
is not specified, utf-8
is used (which was the previous behavior). When reading data from character variables _Encoding
is used to convert the character array to an array of fixed length strings, assuming the last dimension is the length of the strings. When writing data to character variables, _Encoding
is used to encode the string arrays into bytes, creating an array of individual bytes with one more dimension. For character variables, if _Encoding
is not set, an array of bytes is returned.
This seems to be significantly different from the original proposal or the alternate proposal.
"When writing data to character variables, _Encoding is used to encode the string arrays into bytes, creating an array of individual characters with one more dimension. For character variables, if _Encoding is not set, an array of characters is returned."
I'm confused. Since netcdf4 has separate char and String data types, why are you adding a dimension when writing chars to a char variable? Is this your way of allowing chars in a char variable to be encoded with UTF-8 (and thus perhaps take up multiple bytes / char)? That would expand the usage of chars significantly.
And when reading a char variable from an nc4 file, won't an array of chars always be returned? (Or, again, is this your way of expanding the usage of chars to include UTF-8 encoding?) And can't _Encoding be used to indicate the charset of the returned characters (e.g., ISO-8859-1)?
This usage seems oriented to just reading and writing netcdf-4 files. It doesn't solve the problem of how to interpret a char variable in a netcdf-3 file (as strings? as separate chars?). One of the complaints in the CF discussion was: someone writing code to read a file shouldn't have to know whether they are reading an nc3 file or an nc4 file in order to know how to interpret the data. It would be nice to have a system that works with nc3 and nc4 files.
On Wed, May 17, 2017 at 8:49 AM, Jeff Whitaker notifications@github.com wrote:
I added automatic detection of the _Encoding attribute in netcdf4-python ( Unidata/netcdf4-python#665 https://github.com/Unidata/netcdf4-python/pull/665). For string variables, if _Encoding is set it is used to encode the strings into bytes when writing to the file, and to decode the bytes into strings when reading from the file. If _Encoding is not specified, utf-8 is used (which was the previous behavior). When reading data from character variables _Encoding is used to convert the character array to an array of fixed length strings, assuming the last dimension is the length of the strings. When writing data to character variables, _Encoding is used to encode the string arrays into bytes, creating an array of individual characters with one more dimension. For character variables, if _Encoding is not set, an array of characters is returned.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-302134033, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOGWkde0uKwdnXJdxGsaZ0Sez40vlks5r6xcmgaJpZM4NOoaO .
-- Sincerely,
Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov
The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><
@rsignell-usgs @BobSimons Well I was thinking ascii as a nice option for "I don't care about the 8th bit", but I can see the rationale behind forcing a choice for the 8th bit--I'm just guessing most users are not going to care or even understand anything beyond ascii and are just going to pick the option that let's them write without errors. Either way, so long as we make our restricted list as inclusive as possible--was just trying to make us less US/Western Europe-centric.
What I wish was the case was this:
Also, with respect to using the rightmost dim to encode (fixed-length) strings: this is purely an external convention and is certainly not part of the netcdf spec. It raises a question? Who actually makes use of this convention? I know only one place: the conversion of DAP2 string typed vars into netcdf-3 character typed variables. Is it used anywhere else?
--- External? I've always been confused about the relationship of netcdf and CF so I don't know if you consider CF external, but using the rightmost dim to encode strings in char variables is part of the CF specification (section 2.2).
--- Where is this relevant? Doesn't netcdf-java always use the rightmost dim when you use NetcdfFileWriter.addStringVariable() and NetcdfFileWriter.writeStringData() when writing an nc3 file? And doesn't it use the rightmost dim when you use NetcdfFile.read(), readData(), and readSection()? (When reading nc3 files, do those/how do those distinguish char variables that should be read as individual chars from char variables that should be read as Strings?) Doesn't netcdf-c do the same?
Some other software (e.g., some of mine) also uses the rightmost dimension system explicitly in places that were written before (or before my awareness of) writeStringData().
On Wed, May 17, 2017 at 11:26 AM, DennisHeimbigner <notifications@github.com
wrote:
Also, with respect to using the rightmost dim to encode (fixed-length) strings: this is purely an external convention and is certainly not part of the netcdf spec. It raises a question? Who actually makes use of this convention? I know only one place: the conversion of DAP2 string typed vars into netcdf-3 character typed variables. Is it used anywhere else?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-302185109, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOHFtmcd4UftG-NkFEHtbHcYanZGCks5r6zvKgaJpZM4NOoaO .
-- Sincerely,
Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov
The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><
WRT to jswhit proposal above.
At this point, there seems to be agreement about strings: _Encoding specifies the character set and if missing, utf-8 should be assumed.
So we can focus on the character type as an eight bit value. I am not concerned here with translation rules (e.g. python strings <-> netcdf character arrays).
1 _Encoding applies to individual 8-bit characters but the only legal _Encodings are those that are inherently 8-bit or less: iso-8859, ascii being prevalent. Converting a vector of such characters to a string (via some rule) should produce a legal string of that encoding.
@BobSimons, regarding your comment that the python implemention deviates from your original proposal...
In the situation when a user tries to write an array of python fixed length strings to a character variable with _Encoding
set, the python interface will convert that array of fixed length strings to an array of single characters (bytes) with one more dimension (equal to the length of the fixed length strings, and the rightmost dimension of the character variable) then write that array of characters to the file.
I thought this was in the spirit of the CF convention - and this is what a user would have to do manually to write the strings to the character variable. One could certainly argue that this is a too much 'magic' though.
The same happens in reverse when data is read from a char variable with _Encoding
set.
@DennisHeimbigner, regarding your question "Is there any situation in which a python string would be translated into a netcdf- char array?"....
The answer is yes, if you are writing a single string into a character array like this
>>> v
<type 'netCDF4._netCDF4.Variable'>
|S1 strings(n1, n2, nchar)
_Encoding: ascii
unlimited dimensions: n1
current shape = (0, 10, 12)
filling on, default _FillValue of used
>>> v[0,0,:] = 'foobar'
The string foobar
will get converted into an array of 12 characters (with trailing blanks appended) and then written to the file resulting in
netcdf tst_stringarr {
dimensions:
n1 = UNLIMITED ; // (1 currently)
n2 = 10 ;
nchar = 12 ;
variables:
char strings(n1, n2, nchar) ;
strings:_Encoding = "ascii" ;
data:
strings =
"foobar",
Your approach is internally consistent -- if someone writes files with your system and reads them with your system, all is well. But there are other nc files created by other software, which I think don't mesh with your approach.
I don't know if your system is for netcdf-4 only, but if netcdf-3 files are included, the problem is: there are nc3 files with char variables where the chars are meant to be read as individual chars without collapsing the rightmost dimension. The Argo program has 100's of 1000's (millions?) of these files. They have variables like char POSITION_QC(N_PROF=254); where there is one QC character per profile. (Yes, there's a more CF-way to do this now, but they started doing this many years ago.) I think it is a reasonable reading of the CF convention (section 2.2) to say that these are legit char variables, not to be interpreted as Strings (by collapsing the rightmost dimension).
A goal of this proposal is to make it simple for a software reader to read a file (including an Argo file) and know quickly and easily if a given char variable in an nc3 file is meant to be interpreted as individual chars (not collapsing the rightmost dimension) or as Strings (by collapsing the rightmost dimension). With nc4 files that is trivial because there are explicit char and String data types. The problem is with disambiguating char variables in nc3 files.
Stated another way, it is a goal that netcdf-java library's NetcdfFile.read() should be able to know quickly and easily whether it should return an ArrayChar (by not collapsing the rightmost dimension) or an ArrayString (by collapsing the rightmost dimension) (and also be able to properly deal with the charset/encoding of the stored characters).
On Wed, May 17, 2017 at 3:40 PM, Jeff Whitaker notifications@github.com wrote:
@BobSimons https://github.com/bobsimons, regarding your comment that the python implemention deviates from your original proposal...
In the situation when a user tries to write an array of python fixed length strings to a character variable with _Encoding set, the python interface will convert that array of fixed length strings to an array of single characters (bytes) with one more dimension (equal to the length of the fixed length strings, and the rightmost dimension of the character variable) then write that array of characters to the file.
I thought this was in the spirit of the CF convention - and this is what a user would have to do manually to write the strings to the character variable. One could certainly argue that this is a too much 'magic' though.
The same happens in reverse when data is read from a char variable with _Encoding set.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-302251264, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOAqdua1ezYBcjktrLDRQH7DevGtTks5r63dBgaJpZM4NOoaO .
-- Sincerely,
Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov
The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><
for nc3 or nc4 files, if _Encoding
is not set the individual chars will be returned by the python interface without collapsing the rightmost dimension. I presume this is the case for those ARGO files. I thought from your proposal that if _Encoding
was set, then the client should interpret the char array as strings. Did I misread that?
Ah. Thank you. I misunderstood.
On Wed, May 17, 2017 at 4:46 PM, Jeff Whitaker notifications@github.com wrote:
for nc3 or nc4 files, if _Encoding is not set the individual chars will be returned by the python interface without collapsing the rightmost dimension. I thought from your proposal that if _Encoding was set, then the client should interpret the char array as strings. Did I misread that?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-302261668, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOD3-7mfaiaH_o3mt7JsVy0ZDdhNEks5r64bjgaJpZM4NOoaO .
-- Sincerely,
Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov
The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><
@BobSimons, would @jswhit's approach with NetCDF-Python work for you in ERDDAP to disambiguate string and char array handling in NetCDF3 and NetCDF4?
Seems like it does, right?
Sorry. I'm on vacation for the next 2 weeks and not available to evaluate this. I was confused by his original email. So I don't think I understand his proposal. I stand by my proposal.
On Sun, Jun 4, 2017 at 12:25 PM, Rich Signell notifications@github.com wrote:
@BobSimons https://github.com/bobsimons, would @jswhit https://github.com/jswhit's approach with NetCDF-Python work for you in ERDDAP to disambiguate string and char array handling in NetCDF3 and NetCDF4?
Seems like it does, right?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/Unidata/netcdf-c/issues/402#issuecomment-306050428, or mute the thread https://github.com/notifications/unsubscribe-auth/ABarOIukT81RvoM10PX0NCVIdr3y1o_Oks5sAtprgaJpZM4NOoaO .
-- Sincerely,
Bob Simons IT Specialist Environmental Research Division NOAA Southwest Fisheries Science Center 99 Pacific St., Suite 255A (New!) Monterey, CA 93940 (New!) Phone: (831)333-9878 (New!) Fax: (831)648-8440 Email: bob.simons@noaa.gov
The contents of this message are mine personally and do not necessarily reflect any position of the Government or the National Oceanic and Atmospheric Administration. <>< <>< <>< <>< <>< <>< <>< <>< <><
Okay, I'll discuss with you when you get back from vacation.
As discussed here https://github.com/Unidata/netcdf4-python/issues/654#issuecomment-298284181, there is a need for conventions to specify the encoding of strings and character arrays in netcdf.
There is also a need to specify whether
char
arrays in NetCDF3 contain strings or character arrays.@BobSimons addressed these issues in an enhancement to CF conventions that would specify
charset
for NetCDF3 and_Encoding
for NetCDF4, and the Unidata gang (@DennisHeimbigner, @WardF, @ethanrd and @cwardgar) agreed with the concept, but suggested this be handled in the NUG and we came up with this slightly different proposal that would still accomplish Bob's goals of making it easy for software to figure out what is stuffed in thosechar
orstring
arrays!Proposal:
_CharType
variable attribute with allowed values['STRING', 'CHAR_ARRAY']
to specify if achar
array variable should be interpreted as a string or as an array of individual characters. If_CharType
is missing, default is'STRING'
._Encoding
variable attribute with allowed values['ISO-8859-1', 'ISO-8859-15', 'UTF-8']
to specify the encoding. If_Encoding
is missing for_CharType='STRING'
, default is'UTF-8'
. If_Encoding
is missing for_CharType='CHAR_ARRAY'
, default is'ISO-8859-15'
.