aodn / python-aodntools

Repository for templates and code relating to generating standard NetCDF files for the Australia Ocean Data Network
GNU Lesser General Public License v3.0
10 stars 3 forks source link

Allow attribute type to be specified in JSON #13

Closed mhidas closed 3 years ago

mhidas commented 6 years ago

Currently global & variable attributes are just specified as key:value pairs with the value either a string, a list, or a number, to be converted into netCDF atttributes by the netCDF4 library's setncattr and setncatts methods. This is fine in most cases, except when we explicitly want to set the type of an attribute to e.g. double (e.g. to match the data type of the variable).

Within Python code it's easy to specify the attribute type by setting its value to the appropriate numpy object (e.g. np.float32(1.234)). However, when numeric values are specified in a JSON template, they are automatically converted into int, long, or float (https://docs.python.org/2/library/json.html#json-to-py-table). We need to allow attribute values in the template to be specified as another (JSON) object, with properties "type" and "data".

ocehugo commented 6 years ago

Maybe unnecessary!?

If, you are putting numbers in attributes, you don't care much about their precision anyway...

ggalibert commented 6 years ago

You do care actually, one example is the use of CF packed data.

mhidas commented 6 years ago

Yeah, that's essentially the use case I'm thinking of here, though I think _FillValue, valid_min, valid_max, and valid_range should always be the same type as the variable, not just for packed data. (I haven't tested it, but I would think this would already be ensured by the netCDF4 package for _FillValue).

So, perhaps rather than making the template schema more complicated, the behaviour of DatasetTemplate could be to automatically cast to the variable's type either

This of course wouldn't work for global attributes, but perpahs @ocehugo's comment applies to them? I can't think of any particular case where we'd want to specify a global attribute's type.

ggalibert commented 6 years ago

My preference is that any numeric attribute that is related to a particular variable should have the same type. Exceptions are for scale_factor / add_offset variable attributes for packed variable where their type dictate the type/precision for the unpacked data.

For global attributes, things like geospatial_lat_min/max should have the same type as the latitude variable for example.

lbesnard commented 6 years ago

This of course wouldn't work for global attributes, but perpahs @ocehugo's comment applies to them? I can't think of any particular case where we'd want to specify a global attribute's type.

example with wmo codes in global attributes. You want them as integer. But they could be falsely written as floats in a NetCDF

mhidas commented 5 years ago

My preference is that any numeric attribute that is related to a particular variable should have the same type.

There is one tricky possibility: what if you want an integer attribute for a floating-point variable? (e.g. some kind of overall status flag)

For global attributes, things like geospatial_lat_min/max should have the same type as the latitude variable for example.

I'm not sure that's necessary, but would automatically be the case if we do #17 .

example with wmo codes in global attributes. You want them as integer. But they could be falsely written as floats in a NetCDF

This can already be done via the JSON. If a number is specified with no decimal point, it's read as an integer.

mhidas commented 5 years ago

Actually, following on from my last comment above, the rule for variable attributes could be to cast all floating point attributes to the variable's type.

By the way, remember that this is only an issue for values specified in a JSON template. Attribute types can be easily controlled in Python code.

ocehugo commented 5 years ago

My simpleton argument was too shallow. my point is this feature is unnecessary.

First, we should avoid automatic internal casting/conversion, because the json python package would do the right thing (it convert/recast automatically).

I can see a point in providing a string in the template (something like "np.float32(1234.0005)"), but this would create code boilerplate, eval calls, and will change the precision unnecessarily. As said, python json package already cast to the correct precision.

Maybe this issue is due to a confusion regarding type names:

py.float is actually np.double because the py.float is actually a c double. Hence, json can read float32,flaot64 and ints in any precision up to 64. Everything is actually a float64 if not integer. I can hardly see anyone using float128 and remember how many digits they have to type in the template to match the exact precision. We should dump json format anyway in this case...

You do care actually, one example is the use of CF packed data.

The packed attributes are nonsense for the template. "scale_factor" and "add_offset" are usually setup automatically for compression. You got to know the data to setup both. At template time, data is not known and setup this is pretty much noise, given any attempt to compress after that will change this values.

My preference is that any numeric attribute that is related to a particular variable should have the same type. Exceptions are for scale_factor / add_offset variable attributes for packed variable where their type dictate the type/precision for the unpacked data.

We can't do it on the package [there are ints, float32, float64, boolean and bytes from CDL for example], but you can enforce style in the IMOS template for example...

Maybe the only thing json parser is not good is to provide bytes (CDL allows it in the attributes i think).