In NetCDF-Java, we read data from NC_CHAR variables into the ArrayChar class.
The backing storage of ArrayChar is a java array of type char[].
However, the char type in Java is 16 bits, not 8 bits. It is interpreted by the JVM as a UTF-16code unit.
So, we have a size mismatch. This isn't a problem when reading from a file: the 8-bit value is simply widened (left-padded) to 16 bits. But what about when we need to narrow the character from 16 bits to 8 bits for writing? How should that conversion be done?
Currently, we do the conversion the same for both NetCDF-3 and NetCDF-4: the chars are simply cast to bytes. The cast discards the upper 8 bits and keeps the lower 8 bits.
However, if the Java char is outside of that range (i.e. it can't fit into 8 bits), there will be data loss. No replacement character (e.g. ?) is emitted; instead we just spit out the low 8 bits. Is this the best solution? Perhaps not. Instead, we could be converting to ASCII with replacement characters, as that encoding is most portable across platforms. Plus, in netcdf-c, NC_CHAR on NetCDF-4 is now interpreted as ASCII.
Proposed solution
A better approach is to change the backing storage of ArrayChar from char[] to byte[]. That way, we would avoid the need to convert 16-bit characters to 8-bit altogether.
With this change, we should mostly just treat ArrayChar as a bunch of bytes, and leave it at that. When we are required to interpret the bytes – e.g. in ncdump – we should look for the special variable attribute _Encoding (see CDL Data Types, last paragraph) and process the bytes accordingly. If the attribute is missing, we should interpret the bytes as US-ASCII.
Background
NC_CHAR
type is 8 bits.NC_CHAR
variables into theArrayChar
class.ArrayChar
is a java array of typechar[]
.char
type in Java is 16 bits, not 8 bits. It is interpreted by the JVM as a UTF-16 code unit.So, we have a size mismatch. This isn't a problem when reading from a file: the 8-bit value is simply widened (left-padded) to 16 bits. But what about when we need to narrow the character from 16 bits to 8 bits for writing? How should that conversion be done?
Currently, we do the conversion the same for both NetCDF-3 and NetCDF-4: the
char
s are simply cast tobyte
s. The cast discards the upper 8 bits and keeps the lower 8 bits.That means that if the Java
char
is in the range0000-00FF
, no loss of data occurs and the UTF-16 code unit is effectively converted to a ISO-8859-1 character. This works because ISO-8859-1 was incorporated as the first 256 code points of Unicode.Problem
However, if the Java
char
is outside of that range (i.e. it can't fit into 8 bits), there will be data loss. No replacement character (e.g.?
) is emitted; instead we just spit out the low 8 bits. Is this the best solution? Perhaps not. Instead, we could be converting to ASCII with replacement characters, as that encoding is most portable across platforms. Plus, in netcdf-c,NC_CHAR
on NetCDF-4 is now interpreted as ASCII.Proposed solution
A better approach is to change the backing storage of
ArrayChar
fromchar[]
tobyte[]
. That way, we would avoid the need to convert 16-bit characters to 8-bit altogether.With this change, we should mostly just treat
ArrayChar
as a bunch of bytes, and leave it at that. When we are required to interpret the bytes – e.g. inncdump
– we should look for the special variable attribute_Encoding
(see CDL Data Types, last paragraph) and process the bytes accordingly. If the attribute is missing, we should interpret the bytes as US-ASCII.