cf-convention / cf-conventions

AsciiDoc Source

http://cfconventions.org/cf-conventions/cf-conventions

Creative Commons Zero v1.0 Universal

85 stars 43 forks source link

Metadata to encode quantization properties #403

Closed czender closed 4 weeks ago

czender commented 2 years ago

Metadata to encode lossy compression properties

Moderator

@davidhassell

Moderator Status Review [last updated: 2024-04-18]

Submit PR #519 Write "Technical Proposal Current 2024 Draft" Upload "Original Technical Proposal Options Discussed at 2022 CF Workshop"

Requirement Summary

Final proposal should make the lossy compression properties of data variables clear to interested users

Technical Proposal Summary Current 2024 Draft

The current draft is Option 2 of the five discussed at the 2022 CF workshop (those five options are retained below for completeness). The framework of attributes and controlled vocabularies (CVs) to encode lossy compression is thought to be sufficiently general and flexible to handle a wide variety of lossy compression algorithms. However, the initial PR describes the attributes and CVs required for only the four forms of quantization: DigitRound, BitGroom, BitRound, and Granular BitRound. The latter three algorithms are also currently implemented in the netCDF library.

Six properties of lossy compression were agreed to be good metadata candidates based on their utility to help data users understand precisely how the data were modified, and the statistical differences expected and/or achieved between the raw and the lossily compressed data. Keeping with CF precedents, the CV values are case-insensitive with white space replaced by underscores. The properties and CVs supported in the initial PR are

1.family: quantize

algorithm: depends on family above. Allowed values for quantize currently include bitgroom, bitround, granular_bitround.
implementation: This property contains free-form text that concisely conveys the algorithm provenance, including the name of the library or client that performed the quantization, the software version, and the name of the author(s) if deemed relevant.
Parameters used in the application of algorithm to each variable. The CV of bitround parameters is NSB or number_of_significant_bits or keep_bits. The CV of the bitgroom and granular_bitgroom algorithms is NSD or number_of_significant_digits.

Additionally, two categories of error metrics merit consideration for inclusion as optional attributes. These categories are referred to as "Prior metrics" and "Post metrics". Both metrics quantify the rounding errors incurred by quantizing the raw data. However, neither of these categories is included in the initial PR:

Prior metrics: Depends on algorithm. CV for bitround is maximum_relative_error (computed as 2^(-NSB)). CV for bitgroom and granular_bitgroom could be maximum_absolute_error (computed as formula TBD).
post_metrics: These metrics can be computed for all algorithms after the lossy compression has occurred. CV could include maximum_absolute_error, DSSIM, data_structural_similarity_image_metric, mean_absolute_error, snr, signal_to_noise_ratio, standard_deviation_of_difference, rmsd, root_mean_square_difference.

Benefits

Users (and producers) would benefit from knowing whether, how, and by how much their data has been distorted by lossy compression. Controlled vocabularies, algorithm parameters, and metrics will make this possible. Stakeholders include data repositories, MIP organizers, and downstream users.

Status Quo

The netCDF 4.9.2 library applies a single library attribute of the form _QuantizeBitGroomNumberOfSignificantDigits=3. This lacks some desirable properties such as extensibility, controlled vocabulary, algorithm provenance. Other than that, there are currently no known metadata standards for lossy compression being employed for CF-compliant geophysical datasets.

Associated pull request

#519

Detailed Proposal

See #519.

Technical Proposal Options Discussed at 2022 CF Workshop

There was a consensus among all who participated in the relevant discussions at the 2022 CF workshop that CF should standardize the metadata that describes the properties of lossy compression algorithms that have been applied to data variables. This presentation gave background information on the topic, and notes from the ensuing breakout session and hackathon are available here. This issue summarizes the points of agreement reached and presents the outcome of these discussions for feedback from the wider CF community.

The hackathon produced five candidate schemes (below) for encoding lossy compression properties. We encourage interested researchers to indicate their opinions on any schemes they think are especially promising or poisonous for CF to (modify and) adopt. Suggestions for other approaches are also welcome.

Lossy algorithm family (e.g., Quantize or unknown)
Algorithm name (BitGroom, BitRound, Granular Bitround, BitShave)
Implementation (potentially including library name, client name, author name, software version)
Algorithm input parameters (e.g., NSD, NSB)
A priori metrics (e.g., maximum relative or absolute error can be predicted/guaranteed for some algorithms)
A posteriori metrics: DSSIM Structural Similarity Image Metric (these can be computed for any algorithm by comparing the raw and compressed data)

The acceptable values for some properties would best be set by controlled vocabularies (CVs). Keeping with CF precedents, the CV values could be case-insensitive with white space replaced by underscores. Candidate CVs for the lossy algorithms introduced in netCDF 4.9.X follow. These CVs would be expanded to support other families, algorithms, parameters, and metrics:

1.family: quantize or unknown

algorithm: depends on family above. Allowed values for quantize currently include BitGroom, BitRound, granular_bitround.
implementation: A CV might not fit well the implementation properties?
parameters: Depends on algorithm. CV for bitround is NSB or number_of_significant_bits or keep_bits. CV for bitgroom and granular_bitgroom is NSD or number_of_significant_digits.
prior_metrics: Depends on algorithm. CV for bitround is maximum_relative_error (computed as formula TBD). CV for bitgroom and granular_bitgroom could be maximum_absolute_error (computed as formula TBD).
post_metrics: Independent of algorithm. CV could include DSSIM, data_structural_similarity_image_metric, mean_absolute_error, snr, signal_to_noise_ratio, standard_deviation_of_difference, rmsd, root_mean_square_difference

The same lossy compression algorithm is often applied with either minor or no changes to the input parameters across the set of target data variables. Hence the optimal convention for recording the lossy compression needs to allow for per-variable differences in lossy compression properties while ensuring that compression metadata does not overwhelm or reduce the visibility of other metadata such as standard_name, units, _FillValue, etc.

The five current candidate metadata schemes are:

# Schemes #1, #2, and #3 include a scalar container variable to store common algorithm properties shared amongst multiple variables. The container variable is named in the lossy_compression attribute of the data variables.
char compression_info ; // scalar container of arbitrary data type, no data
 // NB: family attribute of the container contains "lossy compression" in name to clarify its purpose    
    compression_info:lossy_compression_family = "quantize" ;
    compression_info:algorithm = "bitgroom" ; 
    compression_info:implementation = "library: netcdf version: 56.0 
                                       processor: fred client???" ;

# 1. Data variable references container variable and has motley extra attributes for per-variable properties
float data_1(x, y) ;
    data_1:lossy_compression = "compression_info" ; // shared across variables
    data_1:nsd = 4 ; // per-variable attributes...
    data_1:key1 = 34.5 ;
    data_1:key2 = 0b;

# 2. Data variable references container variable and has explicitly named "lossy_compression_XXX" attributes for per-variable properties
float data_2(x, y) ;
    data_2:lossy_compression = "compression_info" ;
    data_2:lossy_compression_nsd = 4 ;       
    data_2:lossy_compression_key1 = 34.5 ;   
    data_2:lossy_compression_key2 = 0b ;

# 3. Data variable has single string-valued attribute that references container variable and key-value pairs of per-variable properties
float data_3(x, y) ;
    data_3:lossy_compression = "compression: compression_info
                                   nsd: 4
                                   key1: 34.5
                                   key2: 0b" ;

# 4. Data variable has single string-valued attribute comprising multiple key-value pairs including a parameter list for HDF5-style filter invocation
float data_4(x, y) ;
    data_4:lossy_compression = "family: quantize
                                algorithm: bitgroom                           
                                parameters: (nsd: 4 key1: 34.5 key2: 0) // parameter list for HDF-style filter invocation
                                implementation: (a: b c:d)" ;

# 5. Data variable has single string-valued attribute comprising multiple key-value pairs with all parameters unrolled into single key-value pairs
float data_5(x, y) ;
    data_5:lossy_compression = “family: quantize
                                algorithm: bitgroom
                                nsd: 4
                                key1: 34.5
                                key2: 0
                                implementation: <not sure, here>“ ;

JonathanGregory commented 1 year ago

Dear Charlie @czender

Thanks to you and others at the hackathon for considering this.

I agree with you that it is a good idea to split up the description of the compression into a number of pieces rather than gluing them together in something like _QuantizeBitGroomNumberOfSignificantDigits. Your suggestion for the keywords look CF-friendly. I generally favour spelling things out rather than using abbreviations unless it's unbearably cumbersome or the abbreviation is widely known.

I may be misunderstanding your CDL. I can only see two metadata schemes sketched out: a container variable with attributes, or a string-valued attribute with key-value pairs. Either of those is CF-compatible. Of those two schemes, I think the container variable is better, because (a) it can be shared by many data variables, thus reducing redundancy, (b) it makes the data variables themselves more readable, (c) since some of the values are numeric, it feels nicer not to have to extract them from a string.

What would it mean if the family is unknown? Is this there because you foresee there might be some kinds of lossy compression which don't work by quantisation? Do you have some in mind? If not, then maybe we don't need this level yet. It could be introduced when there is a need for it.

Would the vocabulary for these schemes be defined by an appendix of the convention, or by other documents (like the standard name table), do you think?

Best wishes

Jonathan

czender commented 1 year ago

Apologies everyone for my long absence from updating this issue. Thanks @JonathanGregory for your comments last October. I updated the issue in December 2022, and then @davidhassell provided more feedback in March 2023 that included comments on issues Jonathan had raised earlier (above). I responded to David's feedback in May 2023 and I wanted to retain the content of that offline exchange so that others could follow our discussion. With his permission, below I quote David's comments from March (double indented) and my responses to those (single indented).

My responses to your message of 20230316 are interleaved:

I think that option #2 is preferable for the same reasons a), b), and c) that Jonathan describes.

I favor #2 as well.

Also, when the container is shared between 2+ variables in one dataset, it is likely (as opposed to unlikely) that the parameters will differ (such as NSD, NSB).

I expect many users will apply the same (overly conservative set of) lossy compression parameters to save themselves the time it would take to optimize the parameters on a per-variable basis. Still, it is necessary to allow the flexibility to have per-variable parameters.

I am not so concerned about cluttering up the variable with extra attributes, as the parameters are of key scientific importance. As they are logically properties (in the CF data model sense), an application would present them as such in any case, however they were stored.

Agreed.

If lossy compression is known to have been applied, by the presence of some underscore attribute, then providing a lossy compression container should be mandatory.

OK by me for CF-compliance. Writing the container and the per-variable attributes is likely to be the main chore for developers. The lossy algorithms themselves are often/usually library functions that already exist and are callable by software that knows nothing about CF or netCDF.

The relevant parameters should be mandatory when a lossy compression container is provided.

Agreed. Any variable referencing the container variable must have the required per-variable attributes to be CF compliant.

The prior and post metrics should be optional. If present then they may inform the user whether they want to use the data, and if so how they use it. The data is still fully defined if they are missing.

Agreed. Good point. This reduces the number of required parameters from a minimum of six to a minimum of four, three of which would be in the container. Hence a minimum of two attributes per variable: one to point to the container, and one to provide a per-variable parameter like NSD.

The six properties of lossy compression ...

Lossy algorithm family (e.g., Quantize or unknown)

Algorithm name (BitGroom, BitRound, Granular Bitround, BitShave)

Implementation (potentially including library name, client name, author name, software version)

Algorithm input parameters (e.g., NSD, NSB)

A priori metrics (e.g., maximum relative or absolute error can be predicted/guaranteed for some algorithms)

A posteriori metrics: DSSIM Structural Similarity Image Metric (these can be computed for any algorithm by comparing the raw and compressed data)

... should be described in a new appendix, in a similar way to how grid mappings are described. Only the algorithms detailed in this appendix are allowed.

Agreed

1., 2, would have a CV of allowed string values. What about 3. - CV or free text?

I can't see any practical way around allowing Free text for the Implementation attribute.

Could the name "lossy compression" not be the best one? What we are describing is not really compression, right? - lossless compression is applied later.

Names are important and I am certainly open to alternative names. You are correct that quantization algorithms only (lossily) pre-condition data for lossless compression, they do not compress anything themselves. That said, the only purpose for quantization that I am aware of is subsequent compression. And it is intended (though not required) that this proposed CF convention be extensible to non-quantization lossy codecs (e.g., Sz, Zfp). Hopefully whatever terminology is adopted can meet these dual roles.

And let me also respond directly to Jonathan's question which has not yet been directly addressed:

What would it mean if the family is unknown? Is this there because you foresee there might be some kinds of lossy compression which don't work by quantisation?

Yes

Do you have some in mind? If not, then maybe we don't need this level yet. It could be introduced when there is a need for it.

The proposal is currently only fully flesh-out for Quantization. Other potential lossy compression algorithms include Sz and Zfp. I know little about those algorithms other than their homepages. Whether they might comprise their own "families", or both be members of a single family is unclear to me. While Sz and Zfp seem like two likely candidates for codec species to implement support for under this proposed convention, I'm still unsure what the next logical "family" of lossy algorithms is. In that sense I agree with Jonathan that other values of family are currently unknown, and thus the family parameter could be optional until there is a well-defined implementation for a family of non-quantization algorithms. Feedback welcome.

czender commented 6 months ago

Hi @davidhassell and @JonathanGregory,

I am back in Santander today and all next week, courtesy of @cofinoa :) My goal next week is to draft and submit the CF PR for this issue. It would be helpful to have some guidance on a few high level question before, so I start off on the right foot with the PR:

Shall I leave the Technical Proposal Summary in this issue as is, extend it to explain that after due consideration in the comments below we settled on method number 2 above, or instead revise and condense it to include only the necessary details for method number 2?
For the PR, I was thinking of adding a Section 8.4 "Lossy Compression by Quantization". The only methods I plan to discuss are Quantization-based, even though the proposed encoding accommodates more general algorithms. And an Appendix L on "Metadata to Encode Lossy Compression Properties", in which the controlled vocabularies and examples for the 3 support Quantization methods would be detailed. Does that sound about right? Or...? Of course, the titles and sections could be changed during review.

There is also one recent update to report. NCO has supported the method number 2 draft proposal since version 5.2.0. So if you use NCO to perform lossy compression, the output dataset will contain the container variable and the minimal amount of lossy metadata in the proposed format. My upcoming EGU24 poster shows some of this:

EGU24-13651 Why and How to Increase Dataset Compression in RDIs and MIPs like CMIP7 by Charles Zender Session details here The display is Thursday, 18 April 2024, 08:30-12:30. Please visit between 10:30-12:30 if you'll be at EGU.

Here is an example:

zender@spectral:~$ ncks -O -7 --cmp='btr|shf|zst' ~/nco/data/in.nc ~/foo.nc
zender@spectral:~$ ncks -m --hdn -C -v prs_sfc,compression_info ~/foo.nc
netcdf foo {
  dimensions:
    lat = 2 ;
    lon = 4 ;
    time = UNLIMITED ; // (10 currently)

  variables:
    char compression_info ;
      compression_info:family = "quantize" ;
      compression_info:algorithm = "BitRound" ;
      compression_info:implementation = "libnetcdf version 4.9.3-development" ;

    float prs_sfc(time,lat,lon) ;
      prs_sfc:_QuantizeBitRoundNumberOfSignificantBits = 9 ;
      prs_sfc:lossy_compression = "compression_info" ;
      prs_sfc:lossy_compression_nsb = 9 ;
      prs_sfc:long_name = "Surface pressure" ;
      prs_sfc:units = "pascal" ;
      prs_sfc:_Storage = "chunked" ;
      prs_sfc:_ChunkSizes = 1145, 2, 4 ;
      prs_sfc:_Filter = "32015,3" ;
      prs_sfc:_Shuffle = "true" ;
      prs_sfc:_Endianness = "little" ;
} // group /

JonathanGregory commented 6 months ago

Dear Charlie @czender

Thanks for the update and for being willing to work on this.

There's not a definite rule about what to do with the first posting in the issue when it gets outdated. My own view is that it's good to keep the proposal as originally stated, because it is needed in order to understand how the discussion began. If I were you, I would make a new posting to this issue, perhaps with some of the headings and text repeated from the first posting if that suits your purposes, and edit the first posting itself to give a link to the new version e.g. under the latest moderator summary section. But I think any way of proceeding which is clear and suitable would be acceptable, to be honest!

I think your plan for a new subsection and appendix are fine. Whether to have an appendix depends on how long the detail is. If it's not huge, it could be in the subsection e.g. all the detail about cell_methods, which is quite complicated, is in Sect 7.3, not an appendix. Appendices are useful for large detailed sections, like Appendices H I and J, or for lists which might get added to, like Appendix A (and most others).

That's great news that you've already implemented the draft proposal!

I hope you enjoy your time in Spain.

Cheers

Jonathan

davidhassell commented 5 months ago

Hi Charlie,

This is good news! I agree with Jonathan's comments.

In particular (largely just restating items Jonathan said :))

A new post stating that what options 2 is, and that we've agreed on it, would be a good thing.
I think Section 8.4 "Lossy Compression by Quantization" is a good title
The PR could be written with an appendix or subsection - which ever you prefer - and we can make a final decision on whether or not that's the best place once it exists.

All the best, David

davidhassell commented 5 months ago

Hi @czender - is the PR ready for review?

czender commented 5 months ago

@davidhassell not yet. It still needs formatting, examples, and clean-up. Maybe next week. Don't worry, I'll request your review when ready.

czender commented 5 months ago

Dear All,

I submitted #519 to implement the core of this proposal. As noted above, it does not include anything about lossy compresion error metrics. It could, but I was running out of steam. I would appreciate anyfeedback on the core of #519 before deciding whether to extend it to include error metrics.

One issue with error metrics is that there is a large menagerie of them. I think someone with a better background in statistics could do a better job than me, at least with the "Post metrics". There is one (and only one, AFAICT) "Pre metric" that is easy to implement and to understand. Specifically is is the maximum relative error incurred by BitRound. This error is simply 2^(-NSB+1) for all values. Hence it is easy to understand, and to compute either before or after quantization. I have added this single error metric to the reference implementation in NCO. The output looks like this:

zender@spectral:~$ ncks -O -7 -C -v ps,ts --qnt_alg=btr --qnt default=9 --qnt ps=13 --cmp='shf|zst' ~/nco/data/in.nc ~/foo2.nc
zender@spectral:~$ ncks -m -C -v ps,ts,compression_info ~/foo2.nc
netcdf foo2 {
  dimensions:
    lat = 2 ;
    lon = 4 ;
    time = UNLIMITED ; // (10 currently)

  variables:
    char compression_info ;
      compression_info:family = "quantize" ;
      compression_info:algorithm = "bitround" ;
      compression_info:implementation = "NCO version 5.2.5-alpha02" ;

    float ps(time,lat,lon) ;
      ps:standard_name = "surface_air_pressure" ;
      ps:units = "Pa" ;
      ps:lossy_compression = "compression_info" ;
      ps:lossy_compression_nsb = 13 ;
      ps:lossy_compression_maximum_relative_error = 6.103516e-05f ;

    float ts(time) ;
      ts:standard_name = "surface_temperature" ;
      ts:units = "K" ;
      ts:lossy_compression = "compression_info" ;
      ts:lossy_compression_nsb = 9 ;
      ts:lossy_compression_maximum_relative_error = 0.0009765625f ;
} // group /

Here at EGU24 today, a few people mentioned they would be more inclined to use lossy compression if the error metrics were included with the data. Any of the metrics mentioned above are possible to add if the software has access to both the quantized and raw arrays. So you can imagine, e.g., correlations, SNR, etc. being computed and added by software that wants to do so (and has access to raw and quantized data). Thoughts on whether to add the maximum_relative_error, and whether to add more complex metrics, to the PR are welcome. Personally I am only inclined to add the maximum_relative_error myself, and only to add it to the convention once the initial PR has been finalized (because I'm learning, again, how much work it is to draft these conventions!).

sethmcg commented 5 months ago

Hi Charlie,

I think it makes sense to wait to add error metrics until after the initial PR is finished. It sounds like a topic that warrants a separate discussion of its own. (And it also sounds like that will help keep this issue from stalling out, which is good.)

davidhassell commented 5 months ago

Hi Charlie, Thanks for getting this together - I'm about to dive into the PR (today/tomorrow). I'm also good with tackling the metrics later. David

davidhassell commented 5 months ago

Hi Charlie,

I'm currently reviewing the text, but before I carry on, I thought it would be good mention the term lossy comrpession variable. I not sure that this is a good name. I think that it's fine to describe quantization in chapter 8, because it's primary use case is to compress, but the container variable described is not generic to lossy compression (e.g. lossy compression by coordinate subsampling also has a container variable called interpolation variable). Rather, it is specific to quantization, so wouldn't quantization variable be better? I would follow this through to other uses of "lossy compression", e.g. I might call the per-variable attribute quantization.

What do you think?

Cheers, David

czender commented 5 months ago

Hi David,

Thanks for editing this. Let me explain the intent. And then you and @JonathanGregory can decide:

A bit further on in the "family" section the PR says: "Other potential families of lossy algorithms include rounding, packing, zfp, and fpzip." And that's not a comprehensive list. Other lossy algorithm families could be, e.g., discrete cosine transform, layer packing, logarithmic packing...AFAICT, the lossy_compression variable as outlined in the PR could fit all of these families, and would always signal to the user that lossy compression of some sort had been employed, with the family and algorithm attributes bringing more specificity.

If we change the name to quantization variable, then may as well eliminate the family variable completely. And, if CF is extended later to include other lossy compression algorithms, then we could expect to have a zfp variable, a fpzip variable, a logarithmic packing variable. In other words, turning the generic lossy compression variable into a more specific like quantization variable sets a precedent that would naturally be followed by subsequent algorithms placed in CF. If that sounds desirable to folks, fine, I'll change the PR accordingly.

Note that we discussed what to name the container variable at the 2022 workshop. I think there were folks on both sides of the question you are (re-?) raising. The current PR implements what we hammered out at that workshop. Of course, it's better to change things now if in hindsight you/we think the workshop conclusions could be improved.

sethmcg commented 5 months ago

As a data producer / consumer, I strongly favor having a single lossy_compression variable/attribute that could be relied on as an indicator of its presence.

It's much easier to check for that and then decide how to deal with it than to check for a whole bunch of different indicators, and the failure state is much better if the compression uses an algorithm that you didn't know existed. It's also more communicative to folks who may have never encountered the issue before; if I'm scanning through the headers of a file, the term lossy_compression is going to catch my attention in a way that quantization or zfp would not.

davidhassell commented 5 months ago

Hi Charlie and Seth,

You both put the reasons for using lossy_compression well, and I agree with them. Thanks for explaining things.

However I'm concerned that we are being misleading in the case that a creator wants to remove false precision for scientific reasons, but doesn't want to use native compression (or indeed can't - netcdf-3). In that case we haven't done any lossy compression, but have done a lossy filter (is that the right word?).

I retract my suggested renaming, but (thinking out loud, here!) wonder if we need two containers: "lossy_compression", and another one for algorithms that have changed the data without compressing it ("filter", perhaps?). The former would not be needed at this time as there is no use case, yet.

Thanks, David

sethmcg commented 5 months ago

Hi David,

That's an interesting point, and it will deserve thoughtful consideration if and when we get a use case for it. There are a lot of potential uses for filtering, and I think if we're going to represent them in CF, we want to set ourselves up to use an umbrella approach that can handle all of them in a unified way, rather than dealing with each of them completely independently of the others. So we should try to remember this (potential) use case if/when that comes up.

(I think CF is starting to bump up against the issue of explosive combinatorics making namespaces unmanageable in a number of areas, and I'm concerned about forestalling more of these problems in the future.)

But since we aren't currently (AFAIK) considering any use cases for filtering, we should leave that aside for now.

Cheers, --Seth

czender commented 5 months ago

Hi All,

@davidhassell your suggestion is absolutely correct. And thank you @sethmcg for your thoughts. Here's my response to your suggestion: Quantization per se does not compress and is perhaps better termed a pre-conditioner (as I describe it in the PR) or a filter (in HDF5-speak) whose raison d'être is to be followed by a lossless compression scheme. Quantization can therefore be applied to netCDF3 files as you mention. A few pertinent technical qualifications before responding: I think the netCDF library implementation will not quantize netCDF3-format datasets (it either ignores the request or fails with a suitable error, I can't remember which). The NCO implementation does support quantization of netCDF3 files (becuase, why not?).

@JonathanGregory raised a similar point, above, which I copy along with my response here:

Could the name "lossy compression" not be the best one? What we are describing is not really compression, right? - lossless compression is applied later.

Names are important and I am certainly open to alternative names. You are correct that quantization algorithms only (lossily) pre-condition data for lossless compression, they do not compress anything themselves. That said, the only purpose for quantization that I am aware of is subsequent compression. And it is intended (though not required) that this proposed CF convention be extensible to non-quantization lossy codecs (e.g., Sz, Zfp). Hopefully whatever terminology is adopted can meet these dual roles.

It might be better to segregate, as you suggest, these dual roles into different container variables. Let's consider the implications of a distinct container variable to describe transformations that do no compression. One example is the Shuffle filter. Byte-Shuffle and bit-Shuffle rearrange (bytes and bits, respectively) to enhance subsequent compression, similar to quantization. However, shuffling leaves the values in a non-IEEE format that cannot be stored in a netCDF3 file. A reverse transformation (unshuffle?) is required before using the data so Shuffle must be and is implemented as an HDF5 filter so that it is done transparently for the user. Quantization is IEEE-compatible and has no such requirement so it's fine for netCDF3 files. Which begs the question: Do we really want a container variable for transformations that do no compression? Which transformations are you thinking of? Sure, quantization fits the bill. But is that enough? Are there other transformation algorithms you have in mind that would justify creating a generic container variable to hold them?

Deciding on the right level of generality is indeed hard as there are bound to be trade-offs. Perhaps quantization is just in a category by itself. Which would support your recent suggestion of naming the container "quantization" instead of "lossy compression". I wrote the current PR based on the discussion at the 2022 workshop. Let me reiterate that I'll go along with (and rewrite the PR for) whatever you and @JonathanGregory and any other interested people like @sethmg and @cofinoa decide is best for CF.

czender commented 4 months ago

We seem to be stalled on this issue. Would everyone interested in this proposal, in some form, seeing the light of day please respond with what changes, if any, you would like to see in the current PR? Helpful responses include, though are not limited to:

No Major Changes to PR, though possibly minor changes (please include those minor suggestions)
Replace lossy_compression container with quantization-specific container as suggested (and retracted) by @davidhassell above
Replace lossy_compression container with something like transformation or filter container that could be shared by other algorithms that alter though do not compress data as suggested by @davidhassell
Other (please be as specific as possible)

Mahalo, Charlie

sethmcg commented 4 months ago

Hi Charlie, 1: The PR looks good to me as-is.

JonathanGregory commented 3 months ago

Dear Charlie @czender

I'm sorry that I've kept you waiting. Below are some minor review comments. None of these imply any change to the proposed convention; they're just about the text. I hope they make sense and you can work out what each applies to.

Regarding the question of what the container attribute should be called, I tend to think that "lossy compression" sounds too general, because Sect 8.3 is about "lossy compression" as well, and the basic "packing" of Sect 8.1 is a kind of lossy compression. Furthermore, as you write in the preable of the new section, this convention is actually not about compression at all, but about modifying the data to make it suitable for compression. The title of the new section being "Lossy Compression by Quantization" makes me prefer "quantization" as a name for the container, and dispense with family, as you suggest in your earlier comment.

You also sensibly remark that there may be other algorithms that come along, which could also be described as lossy compression and might want similar container variables. That's true, but also they might not! I don't think we are tying our hands. The emergence of new use-case in future, which needs a similar treatment but can't be described as quantization, might suggest a suitable name for a container that could handle both quantization and the new method. We could then, for example, make quantization into an alias.

The new attributes should be described in Appendix A, within which you will also need to invent a new "Type" value for the container variable.

Many thanks for developing this addition. I support its going ahead. I hope it won't take long.

Best wishes

Jonathan

Convention

codec. Could you replace this word with less technical English.

define a metadata framework to provide quantization properties. Maybe replace "provide" with "describe" or "record"?

irrecoverable. It's possible it still exists, but that is beside the point. Could we say "differ from the original unquantized data, which are not stored in the dataset and may no longer exist."

I suggest omitting the paragraph "Software can use ...". This isn't part of the purpose or description of the convention; it's an explanation of why you proposed this form of it. It's useful in making the proposal, but we don't usually preserve such reasoning in the conventions document.

lossy compression should be ordinary font, although it's a CF technical term. We use verbatim for the names and contents of attributes and variables.

If and when algorithms. Actually we would probably not do that, since we have a generous interpretation of backward compatibility, in order to minimise misinterpretations, in principle 9 of sect 1.2, "Because many datasets remain in use for a long time after production, it is desirable that metadata written according to previous versions of the convention should also be compliant with and have the same interpretation under later versions." Following this principle, we would never make family mandatory, but we could make quantize its default.

The final attribute. As well as "free-form", you could say "unstandardised".

Each variable that. I think this should be "each data variable". It might also be useful, for the sake of clarity, to state in the preamble that this convention is only for data variables, if that's the case.

"... must include at least two attributes to describe the lossy compression. First, all such data variables must have a lossy_compression attribute containing the name of the lossy compression variable describing the algorithm." That is to explain what "associated" means explicitly. Also I suggest omitting the sentence "This attribute is attached ..." because, again, that is an explanation of the design. I would run into the next para with "Second, all such variables must record ...", and end the para after "preserved by the algorithm". That makes this para apply generally. It leaves two more paras with the requirements specific to the algorithms.

This section briefly describes ... for BitRound. I think "thus" should be omitted before "bias-free". This word implies that BitRound is free of bias because the IEEE adopted it!

Conformance

I would say, "The value of algorithm must be one of the values permitted by thie section." I think that's sound, and it is less likely to have to be updated (one fewer thing to go wrong).

The value of implementation. I would omit this, because it can't be verified by a checker program (unless it's truly intelligent). If you think it's essential to include, I would say, "must be a string that concisely".

The value of lossy_compression must be the name of the lossy compression container variable which exists in the file.

"The value of lossy_compression_nsb must be in the range ..." and similarly for lossy_compression_nsd, because you've already required them to be integer attributes.

czender commented 3 months ago

Thank you for these comments @JonathanGregory. I'll try to incorporate them in time for the 1.12 release.

czender commented 3 months ago

@davidhassell and @JonathanGregory and @sethmcg I have addressed and merged the comments from Jonathan's recent review in #519. I hope I interpreted all the suggestions as intended. Please take another look and LMK what further changes are desired. (FYI I have also updated the reference implementation in NCO to adhere to this updated version).

davidhassell commented 3 months ago

Hi Charlie, I'm away for a week from tomorrow, but look forward to taking a good look when I get back. Cheers, David

JonathanGregory commented 3 months ago

Dear Charlie @czender

Many thanks for making changes following my comments. I hope they made sense to you.

You have some text explaining why it would not be a good idea to quantize grid metrics. That's new text, isn't it? I think this is useful background to include, although it's not strictly part of the convention. However, it would be helpful to clarify that this text is intended to explain why quantization is not allowed for coordinate variables, bounds variables, etc. (as indicated by Appendix A, where the quantization attribute is allowed only for data variables). As it stands, this text could be read to imply that it is allowed, although not advisable. Have I interpreted your intention correctly?

If so, I would suggest moving and modifying (in bold) a few sentences in the preamble, like this:

The CF conventions of this section define a metadata framework to record quantization properties alongside quantized floating-point data variables. The goals are twofold. First, to inform interested users how, and to what degree, the quantized data differ from the original unquantized data, which are not stored in the dataset and may no longer exist. Second, to provide the necessary provenance metadata for users to reproduce the data transformations on the same or other raw data. These conventions also allow users to better understand the precision that data producers expect from source models or measurements. Use of these conventions ensures that all quantized variables are clearly marked as such, and thus alerts users to cases where these guidelines have not been followed.

These conventions must not be used with data variables of integer type, or any other kind of CF variable. This is because fields that describe idealized or reference coordinate grids, or grid transformations, are often known to the highest precision possible. These fields can include spatial and temporal coordinate variables (e.g., latitude, longitude, time) and properties derived from these coordinates (e.g., area, volume). Degrading the precision of such grid properties may have unintended side effects on the accuracy of subsequent operations such regridding, interpolation, and conservation checks which should generally be performed with the highest precision possible.

Finally, you have the sentence, "In general, we recommend against quantizing any coordinate variable, bounds variable, cell_measures variables, and any variables employed in formula_terms." If my interpretation is correct, the convention isn't allowed for coordinate, bounds or cell measures, so we don't need to recommend against it. Therefore we can omit the sentence, except that some variables named by formula terms are data variables, for example the surface pressure field required by a atmosphere sigma coordinate. Do you intend to recommend against quantizing such fields? If so, that would require another recommendation in the conformance document.

I hadn't previously noted the point that it's only for floating-point data (which makes sense, of course). I think that ought to be a requirement in the conformance document.

Best wishes and thanks for your patience

Jonathan

czender commented 3 months ago

@JonathanGregory Thanks for your additional suggestions. Responses interleaved...

You have some text explaining why it would not be a good idea to quantize grid metrics. That's new text, isn't it?

No, that text was in the original PR.

I think this is useful background to include, although it's not strictly part of the convention. However, it would be helpful to clarify that this text is intended to explain why quantization is not allowed for coordinate variables, bounds variables, etc. (as indicated by Appendix A, where the quantization attribute is allowed only for data variables). As it stands, this text could be read to imply that it is allowed, although not advisable. Have I interpreted your intention correctly?> As it stands, this text could be read to imply that it is allowed, although not advisable. Have I interpreted your intention correctly?

Yes, you have.

If so, I would suggest moving and modifying (in bold) a few sentences in the preamble, like this ...:

Agreed. Done.

Finally, you have the sentence, "In general, we recommend against quantizing any coordinate variable, bounds variable, cell_measures variables, and any variables employed in formula_terms." If my interpretation is correct, the convention isn't allowed for coordinate, bounds or cell measures, so we don't need to recommend against it. Therefore we can omit the sentence

Agreed. Done.

, except that some variables named by formula terms are data variables, for example the surface pressure field required by a atmosphere sigma coordinate. Do you intend to recommend against quantizing such fields? If so, that would require another recommendation in the conformance document.

Yes, I think that CF should recommend against quantizing all data variables that are in formula_terms since, as I understand it, formula_terms is only used to define grid-variables like vertical coordinates. I just now added this to the conformance document:

"Data variables that appear in formula_terms attributes should not be quantized"

I hadn't previously noted the point that it's only for floating-point data (which makes sense, of course). I think that ought to be a requirement in the conformance document.

I just added this to the conformance document:

"Only floating-point type data variables can be quantized."

The PR now contains all of the above changes. Let's keep this momentum going and resolve all the issues. More feedback welcome.

Charlie

JonathanGregory commented 3 months ago

Dear Charlie

Thanks very much for explanations and improvements. I think this is essentially fine! Hoping that you have patience, I would like to make the following minor suggestions:

At the end of the paragraph beginning "These conventions must not be used with data variables of integer type, or any other kind of CF variable", I suggest "For the same reason, it is recommended not to quantize any data variable which is referenced by a formula_terms attribute of any variable." This is consistent with the recommendation in the conformance document, but it's not now stated in the conventions text otherwise. Unlike the coordinate variables etc., which can't be quantized because they are not data variables, formula terms can be data variables, so we need to say why it's not a good idea to quantize them.
In the first requirement, I would say "Quantization container variables", because in a later requirement you refer to "quantization container variable" and if we're not consistent it may cause confusion. This phrase makes it clear we don't mean the quantized data variable.
I would join the third and fourth requirements, thus: "Only floating-point type data variables can be quantized. Quantized variables must have and are identified by having a string-valued attribute named quantization." This is to make clear how to detect a "quantized variable", for anyone writing a checker program. The actual "mechanical" check is that the attribute should not be present on a variable which doesn't qualify to have it.
To the new, second, recommendation, I would add "and therefore should not have a quantization attribute".

Seth was already happy and I hope that he still is. I expect we will hear soon from David.

Best wishes

Jonathan

czender commented 3 months ago

Aloha @JonathanGregory, Thank you for the additional suggestions. I agree they improve consistency, clarity, and completeness. I have implemented them verbatim. Mahalo, Charlie

czender commented 2 months ago

Dear All, I have now heard from and addressed comments from a number of helpful reviewers. The associated PR appears likely to be merged without any further substantive changes. This is a last call for feedback before that occurs.

JonathanGregory commented 2 months ago

I have now heard from and addressed comments from a number of helpful reviewers. The associated PR appears likely to be merged without any further substantive changes. This is a last call for feedback before that occurs.

Thanks, Charlie. In that case, since the proposal has already received sufficient support to be adopted, according to the rules, we can "start the clock", by saying it will be accepted (and merged) in three weeks from today (on 9th August), if no-one raises any concerns before then.

JonathanGregory commented 1 month ago

Three weeks ago I said that this proposal would be accepted if no further comments were made. Two weeks ago @davidhassell and Charlie @czender had a discussion in the pull request #519 about two matters of substance. For the sake of clarity I give here a digest of the conversation. It would be good if an agreement can swiftly be reached so that the proposal can be accepted in time for the next release.

David proposed a requirement that "Variables that are referenced by any other variable must not have a quantization attribute." He suggested that Charlie's sentence: "These conventions must not be used with data variables of integer type, or any other kind of CF variable." should be replaced with:

These conventions must only be used with variables of floating-point type that are not part of a domain defined by a domain variable or data variable, i.e. are not acting as coordinates (of any type), cell measures, nor are terms in the definition of a parametric vertical coordinate.

To which Charlie replied: Your suggested replacement sentence might be OK. However, it's hard for me to parse and I do not understand the logic of excluding variables that belong to a domain from being quantized. (NB: I just learned about domain variables today so I do not understand them that well). I agree that coordinate (of any type) variables that define a domain should never be quantized. Is that what you mean by variables that are "part of a domain"? To me "variables that are part of a domain" naturally includes data variables that occupy the domain, and I do not see the logic in proscribing quantization of these data variables. Please explain your reasoning or suggest a rewording that allows data variables in a domain to be quantized. Or are data variables and domain variables mutually exclusive so "data variables in a domain" is oxymoronic?

David and Charlie agreed that ancillary variables, which are data variables in their own right, should not be excluded from quantization, although they are referred to by the data variable. On the other hand, formula_terms can reference data variables as well, and these are involved in calculating coordinates. It's also possible that a variable referenced as cell_measures might be a data variable in its own right. If I understand correctly, David's concern is that no variable that might be involved in computing coordinates or metrics of the domain should be quantized, and I suspect that Charlie agrees.

If my understanding is correct, I suggest that the simplest and clearest thing may be to spell out which kinds of variable can or should not be quantized. This could be done by replacing

These conventions must not be used with data variables of integer type, or any other kind of CF variable. This is because variables that describe metadata are often known to the highest precision possible, and degrading ...

with

These conventions must not be used with data variables of integer type. They must not be used with any variable, even if it is also a data variable, that serves as a coordinate variable, or is named by a coordinates, formula_terms or cell_measures attribute of any other variable. This is because variables that provide metadata or are used in computation of domain metrics are often known to the highest precision possible, and degrading ...

David suggested deleting the conformance recommendation:

The value of implementation should be specified as library version number, e.g., libnetcdf version 4.9.2 or as client version number, e.g., NCO version 5.2.6.

because Charlie's text states that the implementation attribute can also contain any information that is useful, beyond the library version number. But if any such information were included, a checker program would give a warning, because of the above recommendation.

Charlie replied: Ahh, I see. That's fair. My intent was to recommend that the implementation attribute begin with "library version number" or "client version number", and then be followed by any additional information. It seems like all conceivable implementations will have either an umambiguous client or library, together with a version. Is a recommendation to put those before any other disambiguating information not worth including? If you think checking that would be too hard or error prone then I agree the recommendation should be removed from the conformance doc.

I suggest that the optional additional information could be put within ( ... ) after the software implementation. That's the syntax we use for cell_methods. Then the checker can tell whether something has been provided which isn't "additional", although I don't think it could check that it's the name of a valid library.

What do you think, David and Charlie? Please could you reply in this issue rather than in the pull request, to keep the record straight. Thanks. :smiley:

czender commented 1 month ago

@JonathanGregory Thank you for moving this along so it has a chance of being included in the next CF release.

I agree with your two suggestions, and will add a notes to underscore some points you make in the summary of the discussion that @davidhassell and I have recorded in the PR. First I'll note that the discussion was in the PR not on this issue page because it started in response to specific wording suggestions in David's review. I am not familiar with the etiquettte of CF discussions that are substantive, so I responded in the PR. Thank you for summarizing this discussion in the issue.

Regarding

... If I understand correctly, David's concern is that no variable that might be involved in computing coordinates or metrics of the domain should be quantized, and I suspect that Charlie agrees.

You suspect correctly. I share (with you and David) the concern that no coordinates or metrics be quantized. I thought that David's suggestion risked prohibiting quantization of any data variables that occupy a domain, which would be overly restrictive in my opinion.

Finally, regarding the Conformance recommendation for implementation, I am more inclined to retain it with your suggested format of (...) than to remove it completely. I think there is substantial value in a string that can be unambiguously parsed into two required and one optional pieces of information, namely the name of the client or library, its version number/identifer and the optional part to disambiguate it from potentially similar sources. But I also understand that there is a price to be paid in terms of the complexity of compliance checkers and would go along with removing this Conformance recommendation completely if others are in favor of that.

Charlie

JonathanGregory commented 1 month ago

Dear Charlie @czender

Thanks for following up. I'm happy with the idea that implementation should take the form "library version-number [( optional-information )]". Probably all the checker could reasonably do would be to ensure there were two blank-separated words before the optional information, but that doesn't sound onerous and I agree it could be useful.

Best wishes

Jonathan

davidhassell commented 1 month ago

Hello Charlie and Jonathan,

Thanks, Jonathan, for moderating here - it has brought me back to thinking about it. I'm very happy with the current suggestions, specifically:

These conventions must not be used with data variables of integer type. They must not be used with any variable, even if it is also a data variable, that serves as a coordinate variable, or is named by a coordinates, formula_terms or cell_measures attribute of any other variable. This is because variables that provide metadata or are used in computation of domain metrics are often known to the highest precision possible, and degrading ...

I like this new, clear text.

[The] implementation should take the form "library version-number [( optional-information )]".

I'm happy with this, and that therefore that this should therefore be a requirement in the conformance document.

David

czender commented 1 month ago

@JonathanGregory and @davidhassell and any others interested: The latest PR has been updated to include these suggestions. As far as I can tell, it includes all requested changes, reads pretty well, and the formatting looks good. Please point out any remaining issues at your leisure. Charlie

JonathanGregory commented 1 month ago

Dear Charlie @czender

I'm entirely happy with the new version. Thanks for your hard work. We will merge it three weeks from today (5th September) if @davidhassell agrees and there are no further concerns raised.

Best wishes

Jonathan

czender commented 4 weeks ago

@davidhassell I just merged two recent changes to master into my branch csz_qntto resolve conflicts. Three weeks have elapsed without further concerns being raised so #519 can now be merged into master.

JonathanGregory commented 4 weeks ago

Thank you, Charlie @czender and @davidhassell. I will merge it now.

davidhassell commented 4 weeks ago

Marvelous! Thanks to Charlie for persisting with this over many years, and also to Antonio for working on the final PR.

czender commented 3 weeks ago

Thanks for all your editorial help @JonathanGregory and @davidhassell. Pinging WIP members @durack1 @taylor13 @mauzey1 who know that quantization and Zstandard were recently enabled in CMOR (https://github.com/PCMDI/cmor/pull/751). WIP folks: If CMOR were to implement this CF convention on top of the quantization/Zstandard work already done, then CMOR could implement the CMIP7 quantization proposal that I presented and we discussed in January. Seems like an opportune time to further consider that proposal. This could save a lot of disk space in a transparent, CF-compliant way without sacrificing any scientifically meaningful data!