Clarify what "packed" and "compressed" mean

davidhassell commented 3 years ago

"Packing" and "compression" in this context refers to reduction in dataset size by convention, as opposed to by native netCDF compression techniques or by general-purpose data compression utilities such as gzip.

To quote CF:

"By packing we mean altering the data in a way that reduces its precision. By compression we mean techniques that store the data more efficiently and result in no precision loss."

Packing is lossy. This definition of compression is of lossless compression, but should be extended to include lossy compression in the light of https://github.com/cf-convention/discuss/issues/37 (Lossy Compression by Coordinate Sampling).

davidhassell commented 3 years ago

Compression

I think we can allow compressed aggregated variables.

If an aggregated variable is defined as compressed (e.g. by virtue of one of its dimension coordinate variables having the compress attribute) then I think we can state that the aggregated data is also compressed - i.e. the fragments are fragments of the compressed array rather than the uncompressed one.

Then uncompression of the aggregated variable can take place as usual, if it is required to do so.

I think this makes logical sense, and also allows DSGs to be aggregated (#3).

Packing

I think that packing should, however, be disallowed on aggregated variables, as the fragments are, by definition, unpacked and there is no guarantee that the data types of the fragments are consistent with the the data type implied by aggregated variable's packing definition - thereby making the packing poorly defined.

Note that in any event, we presume to uncompress/unpack fragments prior to aggregated data array construction, regardless of the compression status of the aggregated array.

davidhassell commented 3 years ago

Re. Packing.

I'm having second thoughts about disallowing it - I think we can allow it with caveats, and these caveats are not really any different to those that already exist in the specification of packing (https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#packed-data).

Perhaps we could allow it if we said the usual data type casting rule (if a fragment has a different data type to that of the aggregated variable then the fragment's data must be cast to the aggregated variable's data type.) doesn't apply in this case.

I.e. if the aggregated variable defines packed data then the data type of an fragment must be one that is consistent with data types of the aggregated variable and its scale_factor and add_offset attributes.

Note that we would always unpack a packed a fragment prior to use, as the storage of fragments is "out of scope".

If the fragment data type is wrong then this be an error.

bnlawrence commented 3 years ago

I'd like to ensure that whatever we do here we support the concept of compression "along" array dimensions as well as "of" values ... this means we can imagine compression across any (or all) dimensions in a fragment. In an abstract sense this is just a different kind of packing (using compression/decompression algorithms instead of packing algorithms) - both need to apply some function to some or all of a fragment (maybe a chunk within a fragment).

We also need to have an eye to being able to pass the agorithm name down through the API into active storage ... ala ExCALIStore.

davidhassell commented 3 years ago

Hi @bnlawrence,

I missed this earlier, sorry. Does the new definition in #14 allow for this?

A fragment may be stored in any compressed form, i.e. stored using
fewer bits than its original uncompressed representation, for which
the uncompression algorithm is encoded as part of the fragment's
metadata and so is available to application program that is managing
the aggregation.

davidhassell commented 3 years ago

We also need to have an eye to being able to pass the algorithm name down through the API into active storage ... ala ExCALIStore.

I'm pretty sure I'm missing something key here, but as yet I don't quite follow this. Is this not the preserve of the library that actually opens and reads the file (e.g. netCDF4-python) , rather than the library that asks for the file to be read (e.g. cf-python)? Whatever the form the compression takes, it will known to the library that actually opens the file and looks at the metadata ... ?

nmassey001 commented 3 years ago

I might be a bit late here, but to me "packing" means transforming the data within the file to reduce its precision. E.g. packing a 32bit float into a 16bit int. This is lossy. "Compression" means applying an external algorithm to compress the data, without changing the data's representation. Whether this algorithm is done by the netCDF4 library, using DEFLATE, or it's done externally by gzip, pkzip, bzip2, etc., it doesn't matter. The compression could also be lossy or lossless, that also doesn't matter, as long as the data type isn't transformed.

In summary -> pack a float32 into an int16, you will always get an int16. -> compress a float32, you will always get a float32 when you decompress.

NCAS-CMS / cfa-conventions

Clarify what "packed" and "compressed" mean #8

Compression

Packing