Compression and packing

davidhassell commented 3 years ago

Fixes: #3 Fixes: #8

nmassey001 commented 3 years ago

I think it is worth pointing out what packing and compression mean in this instance. I presume you mean that packing is using scale_factor and add_offset to pack data into a smaller data type? What do you mean by "compression"? Do you mean by an external program, like gzip, or pkzip, or zstd? Is the file uncompressed before reading into the aggregated data and how? By an external program called by the library? Or do you mean using the inbuilt zlib netCDF4 compression?

(I realise most of this is implementation detail!)

davidhassell commented 3 years ago

Hi @nmassey001,

Again, I'm guilty of assuming that CF is a given - apologies! I shall update the text to better define compression/packing in this context.

davidhassell commented 3 years ago

Hi @nmassey001, how about now?

This allows any "internal" compression (= reduction in dataset size), e.g., HDF5 compression, compression by gathering, scale_factor/add_offset packing, WGDOS (for PP files!), etc., but excludes "external" compression, e.g. a gzipped fragment file.

Exclusion of external compression made sense to me, as there is no guarantee that it can be recognized by the application program, nor that it is possible to uncompress it on any hardware/OS.

davidhassell commented 3 years ago

Hi @bnlawrence , the idea was to make say that when an aggregation variable defines compression by convention, then the aggregated data is also compressed by convention, so that a fragment is a part of the compressed array, rather than the uncompressed one. (Another hard-to-parse sentence!)

That probably didn't help, but it writing it made me think that the sentiment is actually obvious, so perhaps this whole paragraph can just be removed:

~If the aggregated variable is defined as being compressed by convention (with, for instance, compression by gathering, discrete sampling geometry ragged array representations, packing via the scale_factor and add_offset attributes, etc.) then the uncompressed fragment comprises a part of the compressed aggregated data.~

davidhassell commented 3 years ago

Hi @sadielbartholomew - thanks for your review. This was a big change in approach - and I should explain myself ...

I initially thought that the cases of "CF packing" and "CF compression" did indeed require separate treatment here, because I thought that we could allow CF compression on aggregation variables, but not CF packing. When I realized that CF packing and CF compression defined at the aggregation variable level should be allowed, and the aggregated data would reflect this, it became clear that the distinction between CF packed and CF compressed fragments was not needed, as we just insist that the fragment is uncompressed before use in all cases.

This led me to the wholly generic definition of "compression" (= stored using fewer bits than its original uncompressed representation), that encompasses CF packing/compression and any other forms such as HDF5, WGDOS, ...

davidhassell commented 3 years ago

I don't find it particularly obvious

Fair enough! Obviousness is obviously a very subjective concept, and it may seem more so to me just because I've been thinking about a lot recently.

What would a user do when confronted with an aggregation variable that defines CF packing or CF compression? To me, the current text would lead them to say that the aggregated data is exactly what they would have expected to have found in the dataset had the variable been a normal, non-aggregation one that contains its own data in the usual netCDF fashion. as that is completely defined in CF, we don't need to go further here. What do you think?

Clarifying this in the text would not be a problem, but I felt the existing text that attempted to do so created more confusion that it sought to dispel, given that it is already stated (I think I mean "stated", rather than "implied" - more subjectiveness, perhaps).

sadielbartholomew commented 3 years ago

What would a user do

Second-guessing this is always the difficult part :laughing:

I felt the existing text that attempted to do so created more confusion that it sought to dispel

Fair enough, that could well be true and having re-read the post-changeset text I suppose the necessary context does follow logically on from what is there without that paragraph. Unless @bnlawrence has strong opinions to the contrary let's leave it out!

bnlawrence commented 3 years ago

Is this right?

Unless the aggregated variable is defined as having packed data via the scale_factor and add_offset attributes, a fragment may have a different data type to that of the aggregated variable. In this case the fragment's data must be cast to the aggregated variable's data type.

Are you saying that we can have 32 bit reals in the fragments and a 64 bit real aggregation variable? I would have thought we don't want that (it goes beyond the conceptual framework I had in mind, which doesn't mean it's wrong of course).

bnlawrence commented 3 years ago

I've suggested to David that the obviousness or not of this discussion would be helped by an example ...

davidhassell commented 3 years ago

Hi @bnlawrence

Are you saying that we can have 32 bit reals in the fragments and a 64 bit real aggregation variable? I would have thought we don't want that (it goes beyond the conceptual framework I had in mind, which doesn't mean it's wrong of course).

Thanks for querying this. Yes, I am generally allowing (e.g.) 32 bit reals in the fragments and a 64 bit real aggregation variable. This ended up in the spec simply because it's one of those things that cf-python has been doing for years "because it can", rather than because it has been requested or is necessary.

I can see why this feature could be undesirable, as an individual number in the aggregated data might be different to the corresponding number in a fragment. Also, by disallowing it we could remove mentions of the special case of CF packing in the compression section.

However, from the analysis viewpoint, you may well not have control over the fragments, so, e.g., saving an aggregation at a different precision to the fragments would not be possible. Also from an analysis stance, may be you want to aggregate fragments from different sources (models?) that happened to be stored in different precisions. In memory you would do exactly this - i.e. cast to a common precision. I'd be very interested to hear what you think of these use cases. Are they real enough ... ?

davidhassell commented 3 years ago

Examples will follow .....

sadielbartholomew commented 3 years ago

The examples added look great to me, thanks @davidhassell , consider my review approval concrete now rather than tentative.

davidhassell commented 3 years ago

Fair enough - I had suggested just deleting that whole paragraph (https://github.com/davidhassell/cfa-conventions/pull/14#issuecomment-832753157) - would that work?

nmassey001 commented 3 years ago

I think deleting it is fine. The previous paragraph is understandable and clear, so I don't think this one is necessary.

NCAS-CMS / cfa-conventions

Compression and packing #14