Alexander-Barth / NCDatasets.jl

Load and create NetCDF files in Julia
MIT License
146 stars 31 forks source link

use _Unsigned attribute #133

Open visr opened 3 years ago

visr commented 3 years ago

I read this StackOverflow question: https://stackoverflow.com/q/68135528, and through it I found out that there are apparently netCDFs out there with variables of type short, but if they have an attribute _Unsigned with value "true", then this data is supposed to be interpreted as unsigned short (which netCDF-4 also supports). I read some background in https://github.com/Unidata/netcdf4-python/issues/656 and it seems this is a bit of a heritage from netCDF-3.

Since readers in other languages seem to support this, I guess perhaps we should too?

EDIT: see also my SO answer for an example file with some code.

Alexander-Barth commented 3 years ago

Thanks @visr , for helping the user on SO! The _Unsigned attribute does not seem to be part of the CF convention.

Many links to the NetCDF best practise are broken, but I found it here:

  1. To be completely safe with unknown readers, widen the data type, or use floating point.
  2. You can use the corresponding signed types to store unsigned data only if all client programs know how to interpret this correctly.
  3. A new proposed convention is to create a variable attribute _Unsigned = "true" to indicate that integer data should be treated as unsigned.

I think that point 2 is also interesting. It could be read as that such files (using the corresponding signed types to store unsigned data) should not be used for public distribution were you do not control the client program.

Does somebody know if the work on the "new proposed convention" is still on-going? It also seems that this is specific to files written by old version of NetCDF-Java. Does somebody know since which version of NetCDF-Java use native unsigned types?

visr commented 3 years ago

Indeed this is not a CF convention, only an old "proposed convention", which in practise still seems to be used (the example is new data), even though it shouldn't be. It seems like NetCDF-Java has had unsigned capabilities for a while, it's just users not taking advantage of this when updating their old data model.

The most commonly used clients seem to have implemented support for this. In a way it's quite unfortunate, since it leads to potentially misinterpreted data like in the SO post. I'm not sure how difficult it would be to add support for _Unsigned, potentially we can use reinterpret like in my SO answer, and avoid copying the data. If we decide to not support it, then perhaps we should throw an error when we encounter it.