Documenting how cdms2 handles packed data

jypeter commented 7 years ago

I have just remembered that I have sometimes had problems with packed data. I thought I had an old issue about that somewhere, but I have not found it on github. On the other hand, I have found https://github.com/UV-CDAT/uvcdat/issues/420 and I wonder where/if this writePacked function is available

If that's not already the case, it would be nice to document somewhere how cdms2 handles packed data. And document this in a way I can easily find the information next time I need it

cdms2 transparently unpacks the data by default, so that the end-user does not have to take care of this (or even know that the data is packed). That's a neat feature!
how can you read the packed part of the data without unpacking it? e.g. you want to check things, unpack the data yourself, etc... Is this what the raw option is for (I vaguely remember a raw option, though I'm not sure I have ever used it)
can you now officially/automatically/easily write packed data?

I have googled _cdms addoffset (and _cdms scalefactor) and it brings you to CHAPTER 6 Climate Data Markup Language (CDML), which I'm not sure is the best answer... In a way, it's even worse if I google _cdms2 addoffset because I don't even find the Chapter 6 above! The fact that using cdms or cdms2 in a search string does not return the same results may also be a problem...

I mostly use CMIPn data that does not use packing, so I don't know if this kind of data is common. But this is documented in netcdf4-python (search _scalefactor and _addoffset)

jypeter commented 7 years ago

I'm also wondering what happens when you read/write a file variable in a packed file
Could we have a file with packed data in uvcdat/sample_data? Unless there is already one. See also #141

dnadeau4 commented 7 years ago

@jypeter I used ncdump on every file in sample_data for cdms and could not found a packed data array. This is really missing to our testbed and need to be added. I don't know if cdms write packed data, maybe @doutriaux1 can help.

@jypeter Can you provide me with one of your packed data file?

durack1 commented 7 years ago

@dnadeau4 @jypeter there is the functionality to pack data in netcdf4, so effectively reduce the precision to short type, take a peek here

dnadeau4 commented 7 years ago

@durack1 So it takes the min/max to compute the scale/offset.

It seems to be the best practice from netcdf implementation.

@jypeter you just need to pass pack=True in cdms2.write()

doutriaux1 commented 7 years ago

@dnadeau4 yes pack=True, but if i remember correctly this does not work well for extended dimension (like time) if you do many write in a row, because the min/max/scale/offset obviously changes between writes.

jypeter commented 7 years ago

@dnadeau4 after much looking around, I have found the following file that has packed data inside a nc4 compressed file! One of our PhD students had problems with it a few years ago... netcdf4_compressed_example.nc

@dnadeau4 and @durack1 thanks for pointing out the pack option and the matching code. The doc string for pack should probably be updated, because it says

pack :: (False/True/numpy/numpy.int8/numpy.int16/numpy.int32/numpy.int64) pack the data to save up space

It should probably mention the http://cfconventions.org/cf-conventions/v1.6.0/cf-conventions.html#packed-data link and @doutriaux1 warning about multiple writes. And it should probably only have the False/True option, because, unless I'm mistaken, the source code only uses

if pack:

This is probably low priority, because we don't run across packed data very often, but at least the issue is listed

CDAT / cdms

Documenting how cdms2 handles packed data #140