Unidata / thredds

THREDDS Data Server v4.6
https://www.unidata.ucar.edu/software/tds/v4.6/index.html
266 stars 179 forks source link

ArrayChar should be backed by byte[] and honor the "_Encoding" attribute #788

Open cwardgar opened 7 years ago

cwardgar commented 7 years ago

Background

  1. In NetCDF, the NC_CHAR type is 8 bits.
  2. In NetCDF-Java, we read data from NC_CHAR variables into the ArrayChar class.
  3. The backing storage of ArrayChar is a java array of type char[].
  4. However, the char type in Java is 16 bits, not 8 bits. It is interpreted by the JVM as a UTF-16 code unit.

So, we have a size mismatch. This isn't a problem when reading from a file: the 8-bit value is simply widened (left-padded) to 16 bits. But what about when we need to narrow the character from 16 bits to 8 bits for writing? How should that conversion be done?

Currently, we do the conversion the same for both NetCDF-3 and NetCDF-4: the chars are simply cast to bytes. The cast discards the upper 8 bits and keeps the lower 8 bits.

That means that if the Java char is in the range 0000-00FF, no loss of data occurs and the UTF-16 code unit is effectively converted to a ISO-8859-1 character. This works because ISO-8859-1 was incorporated as the first 256 code points of Unicode.

Problem

However, if the Java char is outside of that range (i.e. it can't fit into 8 bits), there will be data loss. No replacement character (e.g. ?) is emitted; instead we just spit out the low 8 bits. Is this the best solution? Perhaps not. Instead, we could be converting to ASCII with replacement characters, as that encoding is most portable across platforms. Plus, in netcdf-c, NC_CHAR on NetCDF-4 is now interpreted as ASCII.

Proposed solution

A better approach is to change the backing storage of ArrayChar from char[] to byte[]. That way, we would avoid the need to convert 16-bit characters to 8-bit altogether.

With this change, we should mostly just treat ArrayChar as a bunch of bytes, and leave it at that. When we are required to interpret the bytes – e.g. in ncdump – we should look for the special variable attribute _Encoding (see CDL Data Types, last paragraph) and process the bytes accordingly. If the attribute is missing, we should interpret the bytes as US-ASCII.

lesserwhirls commented 7 years ago

Things to do:

Task for THREDDS v6.