Encoding Value as int - Githubissues

gamaanderson commented 9 years ago

I would like to open a discussion in this topic, since I already found it twice (netcdf and mdv). It's about when writing objects, in order to reduce file size, encoding variables as int (8 or 16 bits), instead of using float. The problem is how to decide the type, I see three possibilities:

the user encodes the data before calling write and them we use dtype.
the user's adds an attribute to the variable (like "write_dtype"="int8") and optionally also add scale and offset.
The user gives what type he want while calling write

An utility for number 1. is already in PR #266, but this is far from optimal since it may destroy data

Number 2. is my favorite and would be just a question of defining the attribute name and implement

I am not a fan of number 3. since it would need to add an option in all write routines; however one thing plays in its favor: NetCDF just use signed ints and MDV just UNsigned ones. However I believe we could recalculate the scale and offset to solve this problem without user interference.

What do you think? do you see any problem or other alternatives?

josephhardinee commented 9 years ago

When you say writing objects do you mean serializing of the python class, or the writing of file formats? For the filetypes, many specify an encoding ahead of time we would have to comply with. For serialization of the python objects, I guess the question would be how often do people write out the actual objects to disk?

gamaanderson commented 9 years ago

I mean writing file formats. In fact filetypes specify an encoding, but some of them also give some choice. CfRadial for instance allow for fields: ncbyte, short, int, float and double (all signed). So the question is how to allow the user to choose, comply with filetypes and at the same time don't fill up function call with options.

jjhelmus commented 9 years ago

Agreed, for file formats which specify an encoding (MDV), the only option is to cast the values in the arrays into the appropriate type after attempting to find reasonable scale/offset parameters if these are not provided by the user.

For file types which do allow for a choice of encoding the question of what to do becomes more complicated. I believe that with the compression of variables available with NetCDF4 there is little benefit to encoding values into integer types to save space while maintaining the original data precision, that is not "destroying data". In addition when a loss of precision is acceptable there are more efficient manners of reducing the size of data than encoding data into other data types.

For example many radar moments in Sigmet files are stored with 16-bit of precision but Py-ART decodes these into a more usable 32-bit floating point values. When these are written to uncompressed NetCDF files (for example NetCDF3 format which does not support compression) the file size are typically over two times the size of the original file. When compression is turned on (the default for NETCDF4 files written with Py-ARTy), the file sizes are significantly reduced although still typically larger than the original raw size.

The Cf/Radial files size can be further reduced by converting the field variables to integer values, but this nearly always is a lossy processed which reduces the precision of the data. I consider an implicit loss of precision to be highly undesirable in scientific software and something that should be avoided. Explicitly reducing the precision when a user requests such an operation is acceptable but the documentation should want about the consequences.

Another other when serializing data using the netcdf4-python library, is to use the least_significant_digit parameter. This truncates the values in the array allowing for more efficient compression which can results in reductions in the file size similar to or better than those possible by encoding the values in a different format. Py-ART supports this type of truncation of data on a per variable basis by setting the least_significant_digit key in a field or variable dictionary (see line 634 in cfradial.py). This feature is not well documented which should be address. In addition there are other options passed to the netcdf4-python library that can effect file size that should be exposed as parameters in the write_cfradial function.

gamaanderson commented 9 years ago

I'm aware of this possibilities, however I'm still not convinced, I will make some tests and return to this. In all my experience, casting to integers has been standard procedure and NetCDF library has made it even more practical. Furthermore when input data is already scaled I usually don't consider it to be any loss.

But let me make an other argument: Py-Art should concentrate it self in data manipulation, not in reading and writing operations, that is, read/write should be the most transparent possible and allow the user indirect access to the main (if not all) tools of the NetCDF4 library, being type one of those. Otherwise this may be become an inconvenient limitation, in especial if one is running an OPeNDAP/Thredds server (like the ECMWF does), witch require visualization products to be saved in files (e.g. NetCDF, HDF5 etc...) before client demand.

That is why I always believe the connection between an object and its file (NetCDF) version should be as close as possible, even to the point of considering the object as the file itself in a different form. In this context not providing int types seem as a awful limitation, while providing float in the python side and int in the file side an interesting functionality. By the way, I was actually surprise to find out that Radar and Grid structures don't carry information about the dimension of every variable, but this is a topic for an other discussion.

Anyway, I wouldn’t care for misuse of this functionality, one could also make it outside the variable dict so the standard behavioral of not-casting could just be changed by explicit user intervention. Further, although inconvenient, warnings could be shown every time a casting is done.

P.S. MDV does allow to chose between uint8, uint16 and float32, and if I well recorded sigmet also allow to choose between 16-bits and 8-bits.

jjhelmus commented 9 years ago

@gamaanderson After reflecting on this topic this evening I'm fully in support of adding better support in Py-ART for encoding variable as integers when writing to NetCDF files. As you pointed out this is a standard procedure in the field and is transparently supported in the NetCDF4 Python library. In addition, downstream tools may be expected integer encoded moments which need to be supported. Even is truncation and compression results in smaller files at time, this method is not common nor as widely supported.

I still think the Radar object should store data as floats after performing any scaling and offset calculation. The Radar class sits at the top of the abstraction of numerous radar data sources and is expected to independent of the original source of the data. As with most abstraction of this type this, the choice must be made either to support all the features of the various file formats and determine how to indicate that they original file did not support a specific feature or choose a subset of the features and throw out data from some formats when they contain data outside of this subset. The Radar object choose the later, certain parameters available in some file formats cannot be captured by the object. Although this places some limits on what can be done with the the class it also simplifies a use of the object in functions which operate on the class. The various correction, plotting, and mapping routines work on radar data regardless of the source specifically because the Radar object offers a uniform representation of the data. Dimensions do not need to be specified in the Radar and Grid classes because they are inferred from the location of the data in the class ... in most cases, some of the optional attributes are a bit more flexible for example the instrument_parameters attribute which is a bit messy.

So in short, the disconnect between the Radar object and the file is intentional. If you do want/need fine control over a radar file you need to drop down a level of abstraction to the file specific classes (SigmetFile, MDVFile, NetCDF4.Dataset, etc). This provide a more complete access to all the features of the file but at a loss of the user friendliness and generality of the Radar class.

Back to the question on hand. How should encoding to integers when writing NetCDF file (and other files?) be accomplished? I agree option 1 is the easiest to implement but a pain to use. Options 2 and 3 require more work but be much more powerful and simpler to use. Since certain keys in the variable dictionaries already have special meaning, ie 'data' contains the numerical data, 'least_significant_digit' specified truncation, I would be in favor of adding a few additional keys as you mentioned. 'write_as_dtype' would work for specifying the type, others could be added for specifying compression type and level, non-default chunking, etc. I think some of these are stored as "special" or "virtual" variable attributes in the NetCDF file and can be viewed with ncdump -s The various write functions would not be required to use these if they did not support this functionality, but those that did could. I also think that a function level parameter that would overwrite or set all of these keys may be useful.

I've seen a few NetCDF files specify a "unsigned" boolean attribute to accommodate the storage of unsigned data but this has always seemed as a kludge rather than an elegant solution. I have no problem reading data with this attribute as unsigned but would prefer not to produce files with this attribute unless a compelling case can be made.

I forgot about the multiple data types in MDV, it has been a while since I've looked into that code, so it will be necessary to decide how this works before too much more work on an MDV writer.

gamaanderson commented 9 years ago

@jjhelmus I'm happy we came to an agreement. I didn't know of this "unsigned" attribute, but I agree we should not use it. However I would allow functions to recalculate scale and offset (if given) in order to assign to signed types, instead of raising an error or going back to the standard float. As for HDF5, some conventions expect unsigned types, so as you said every function shall chose between using or not the given "write_as_dtype" (this would be a numpy dtype string, right?).

I had a look in the virtual attributes of netcdf and they appear to be a good solution for further control over the file, a description of those is found in the man page for ncdump. However I'm not sure if assigning them is enough, I thinking we will have to extract them and re-pass as arguments on createVariable, just to be safe.

The only problem that still remains is that the function the governs cf-radial convention is write_cfradial, and the one that call createVariable is _create_ncvar. So if some variable like azimuth (that should always be float) come with "write_as_dtype":"int", how will write_cfradial prevent _create_ncvar of casting this variable? My solution would be an extra argument _create_ncvar( ...,allow_casting=False)

jjhelmus commented 9 years ago

The virtual attributes should be detected in the variable dictionaries and explicitly added to the createVariable call. I do not think assigned them after creation of the variable will have the desired effect. In addition, it may be necessary to exclude them when setting the NetCDF variables attributes. This should be checked with some sample files.

Also, since the other virtual attributes begin with an underscore and a capital letter (_X) it think it would be reasonable to have the key specifying the dtype to write as _Write_as_dtype, would you agree.

As for the value of this key, I think any valid input to the datetype parameter in NetCDF4's createVariable function should be allowed.

I'm of the opinion that if a user sets the write_as_dtype key for a variable in such a manner that breaks the Cf/Radial convention the write_cfradial function should take no action to correct this except perhaps for issuing a warnings. Py-ART should not get in the way when a user explicitly specifies an action, even if it breaks a convention.

gamaanderson commented 9 years ago

No comments, I agree with all points.

As this is solved, please consider PR #266 out of date. I plan to implement the points discussed here at some point in the next weekS, but if some one want to run ahead and do it, please just tell me.

gamaanderson commented 9 years ago

I have a small problem: wenn the user ask for scaling, but does not specify scale_factor and add_offset I calculate them, in the process I also calculate a _FillValue. However the variable dict may already have a _FillValue. I believe this original _FillValue is only valid for the float data, so I should overthrown it. Do you agree?

jjhelmus commented 9 years ago

Would casitng the current _FIllValue to the encoding type work? The _FIllValue in the default configuration is -9999.0 which works with signed 32 and 16-bit integers. Casting the default value will overflow a 8 bit byte/char/int field and causes issues with unsigned types. For those a different solution would be needed, perhaps using the largest expressible value?

gamaanderson commented 9 years ago

I believe it is a little more complicated than that, even related to the missing_value and _FillValue discussion.That is because, at least in the CfRadial convention, _FillValue referee to the scaled data, that is, I should not cast 999.0 to 999, but rather calculate what is the scaled equivalent (i.e. int((999.0-offset)/scale)) which could, and most probably would, overflow.

The question is if _FillValue is just a file technicality to replace masked_arrays, or relevant for the user itself? The name with underscore indicates the first option, and therefore the user that did not specified scale_factor and add_offset should not be surprised that a more suitable _Fillvalue were used.

Just to be specific, when calculating scale_factor and add_offset, I do reserve the least expressible value for _FillValue.

gamaanderson commented 9 years ago

Also, that should not be a problem, since we are using masked arrays, an the user should not work directly with _FillValue. If on the other way the user want to separate a value for some special meaning it should be let unmasked and defined an other attribute, like 'missing_value', 'no_echo' or 'not_scanned_region' etc.

jjhelmus commented 9 years ago

I though _FillValue referred to the origin unscaled values as the scaled values would be more difficult to mask due to numerical precision in the scaling. I don't have the materials with me at the moment to check on this so I can not say for certainty. I'll investigate this and the question of if _FillValue is a user or file specification in detail when I'm back in my office on Monday.

Either way, as you mention, this should be transparent to the users. They will be working with MaskedArrays which hide these details.

Specifying other values for special meaning is difficult and should be handled with care. I've been leaning towards using an additional field variable to store this type of data. Some formats (NEXRAD that I know for certain) use different sentinel values in the raw files to indicate "region not scanned" and "below threshold" which are currently both expressed by masking the data in the returned Radar field. Differentiating between these two cases is currently not possible using only the information in the Radar object.

gamaanderson commented 9 years ago

Just to clarify, I didn't intend to suggest pyart should use values with especial meanings, that was just an example in a more general context.

jjhelmus commented 9 years ago

According to section 2.5.1 of the CF Conventions, version 1.6 the value in _FillValue should be interpreted relative to the raw values, not those that result after the scale and offset are applied:

The missing values of a variable with scale_factor and/or add_offset attributes (see section Section 
8.1, “Packed Data” ) are interpreted relative to the variable's external values , i.e., the values stored in
 the netCDF file. (a.k.a. the packed values, the raw values, the values stored in the netCDF file), not
 the values that result after the scale and offset are applied. Applications that process variables that
 have attributes to indicate both a transformation (via a scale and/or offset) and missing values should
 first check that a data value is valid, and then apply the transformation. Note that values that are
 identified as missing should not be transformed. Since the missing value is outside the valid range it is
 possible that applying a transformation to it could result in an invalid operation. For example, the
 default _FillValue is very close to the maximum representable value of IEEE single precision floats,
 and multiplying it by 100 produces an "Infinity" (using single precision arithmetic).

From that document is does seem that this value can be set by users but there are library defined default values. In Py-ART we should respect a non-default value set by the user. In addition since the default values in the netCDF library (see netcdf.h) do not match the examples provided in the CF convention documentation (-999.9) nor those typically used by the community (-9999, -999, -999.9) I do not think the default values should factor into our decision much. The current "default" fill value in Py-ART is -9999.0 which I think is reasonable provided that it is cast appropriately and not used for low precision nor unsigned data types.

gamaanderson commented 9 years ago

I've converted PR #266 to implement what was discussed here

jjhelmus commented 9 years ago

With PR #266 merged is there anything further that needs to be addressed in this issue or can it be closed?

gamaanderson commented 9 years ago

yes, after PR#266 this can be closed, but it has not been merged yet.

jjhelmus commented 9 years ago

Right you are, sorry for jumping the gun. I'll try to have a look at PR #266 next week.

ARM-DOE / pyart

Encoding Value as int #278