AndersMS commented 3 years ago

Title

Lossy Compression by Coordinate Sampling

Moderator

@JonathanGregory

Moderator Status Review [last updated: YYYY-MM-DD]

Brief comment on current status, update periodically

Requirement Summary

The spatiotemporal, spectral, and thematic resolution of Earth science data are increasing rapidly. This presents a challenge for all types of Earth science data, whether it is derived from models, in-situ, or remote sensing observations.

In particular, when coordinate information varies with time, the domain definition can be many times larger than the (potentially already very large) data which it describes. This is often the case for remote sensing products, such as a swath measurements from a polar orbiting satellite (e.g. slide 4 in https://cfconventions.org/Meetings/2020-workshop/Subsampled-coordinates-in-CF-netCDF.pdf).

Such datasets are often prohibitively expensive to store, and so some form of compression is required. However, native compression, such as is available in the HDF5 library, does not generally provide enough of a saving, due to the nature of the values being compressed (e.g. few missing or repeated values).

An alternative form of compression-by-convention amounts to storing only a small subsample of the coordinate values, alongside an interpolation algorithm that describes how the subsample can be used to generate the original, unsampled set of coordinates. This form of compression has been shown to out-perform native compression by "orders of magnitude" (e.g. slide 6 in https://cfconventions.org/Meetings/2020-workshop/Subsampled-coordinates-in-CF-netCDF.pdf).

Various implementations following this broad methodology are currently in use (see https://github.com/cf-convention/discuss/issues/37#issuecomment-608459133 for examples), however, the steps that are required to reconstitute the full resolution coordinates are not necessarily well defined within a dataset.

This proposal offers a standardized approach covering the complete end-to-end process, including a detailed description of the required steps. At the same time it is a framework where new methods can be added or existing methods can be extended.

Unlike compression by gathering, this form of compression is lossy due to rounding and approximation errors in the required interpolation calculations. However, the loss in accuracy is a function of the degree to which the coordinates are subsampled, and the choice of interpolation algorithm (of which there are configurable standardized and non-standardized options), and so may be determined by the data creator to be within acceptable limits. For example, in one application with cell sizes of approximately 750 metres by 750 metres, interpolation of a stored subsample comprising every 16th value in each dimension was able to recreate the original coordinate values to a mean accuracy of ~1 metre. (Details of this test are available.)

Whilst remote sensing applications are the motivating concern for this proposal, the approach presented has been designed to be fully general, and so can be applied to structured coordinates describing any domain, such as one describing model outputs.

Technical Proposal Summary

See PR #326 for details. In summary:

The approach and encoding is fully described in the new section 8.3 "Lossy Compression by Coordinate Sampling" to Chapter 8: Reduction of Dataset Size.

A new appendix J describes the standardized interpolation algorithms, and includes guidance for data creators.

Appendix A has been updated for a new data and domain variable attribute.

The conformance document has new checks for all of the new content.

The new "interpolation variable" has been included in the Terminology in Chapter 1.

The list of examples in toc-extra.adoc has been updated for the new examples in section 8.3.

Benefits

Anyone may benefit who has prohibitively large domain descriptions for which absolute accuracy of cell locations is not an issue.

Status Quo

The storage of large, structure domain descriptions is either prohibitively expensive, or is handled non-standardized ways

Associated pull request

PR #326

Detailed Proposal

PR #326

Authors

This proposal has been put together by (in alphabetic order)

Aleksandar Jelenak Anders Meier Soerensen Daniel Lee David Hassell Lucile Gaultier Sylvain Herlédan Thomas Lavergne

davidhassell commented 3 years ago

Hello @taylor13, @AndersMS,

It might be better to continue the conversion over at #37 on the precision of interpolation calculations (the comment thread starting at https://github.com/cf-convention/discuss/issues/37#issuecomment-832142697) here in this issue, as this is the main place now for discussing the PR containing the details of this proposal, of which this precision question is one.

I hope that's alright, thanks, David

AndersMS commented 3 years ago

Hi @taylor13

Thank you very much or your comments. We did have a flaw or a weakness in the algorithm, which we have corrected following your comments.

To briefly explain: the methods of the proposal stores coordinates at a set of tie points, from which the coordinates in the target domain may then be reconstituted by interpolation. The source of the problem was the computation of the distance squared between two such tie points. The distance will never be zero and could for example be in the order of a few kilometers. As the line between the two tie points forms a right triangle with two other lines of known length, the fastest way to compute the distance squared is to use Pythagoras's theorem. However, as the two other sides both of a length significantly larger than the one we wish to calculate, the result was very sensitive to rounding in 32-bit floating-point calculations and occasionally returned zero. We have now changed the algorithm to compute the distance squared asx*x + y*y + z*z, where(x, y, z) is the vector between the two tie points. This expression does not have the weakness explained above and has now been tested to work well.

In terms of accuracy of how well the method reconstitutes the original coordinates, the change improved the performance of the internal calculations being calculated in 32-bit floating-point. However, still with errors a couple of times larger than when using 64-bit floating-point calculations.

I would therefore support the proposal put forward by @davidhassell. The proposal avoids setting a general rule, which, as you point out, may not cover all cases. It permits setting a requirement when needed to reconstitute data with the accuracy intended by the data creator.

Once again, thank you very much for your comments – further comments from your side on the proposal would be highly welcome!

Cheers Anders

davidhassell commented 3 years ago

For convenience, here is the proposal for specifying the precision to be used for the interpolation calculations (slightly robustified):

By default, the user may use any precision they like for the interpolation calculatins, but if the interpolation_precision attribute has been set to a numerical value then the precision should match the precision of the given numerical value.

// Interpolation variable with NO 'interpolation_precision' attribute
// => the creator is saying "you can use whatever precision you like when uncompressing" 
char interp ;
    interp:interpolation_name = "bi_linear" ;

// Interpolation variable with 'interpolation_precision' attribute
// => the creator is saying "you must use the precision I have specified when uncompressing"
char interp ;
    interp:interpolation_name = "bi_linear" ;
    interp:interpolation_precision = 0D ;  // use double precision when uncompressing

Do you think that this might work, @taylor13?

Thanks, David

taylor13 commented 3 years ago

Thanks @AndersMS for the care taken to address my concern, and thanks @davidhassell for the proposed revision. A few minor comments:

I wonder if users could be given the freedom to do their interpolation at a higher precision than specified by interpolation_precision. I would hate to think that the interpolation method would be degraded by doing so. I suggest, therefore, replacing "the precision should match" with "the precision should match or exceed" or something similar. Also, a comma should follow the introductory clause, "if the interpolation_precision attribute has been set to a numerical value" and typo in "calculatins" should be corrected.
I think we should avoid using a specified numerical value in a format that is language dependent. Rather I would prefer following the IEEE 753 standard identifiers of precision. Also, I think "interpolation_precision" might imply something about the accuracy of the interpolation method. To avoid this confusion, perhaps the attribute could be named "computational_precision". I therefore propose the following alternative:
- By default, the user may use any precision they like for the interpolation calculations, but if accuracy requires a certain precision, the calculational_precision attribute should be defined and set to one of the names defined by the IEEE 753 technical standard for floating point arithmetic (e.g., "decimal32", "decimal64", "decimal128"). If the calculational_precision attribute has been defined, all interpolation calculations should be executed at the specified precision (or higher).

In the example, then, "0D" would be replaced by "decimal64".

davidhassell commented 3 years ago

Hi @taylor13,

1: I agree that higher precisions should be allowed. A modified description (which could do with some rewording, but the intent is clear for now, I hope):

By default, the user may use any precision they like for the interpolation calculations. If the computational_precision attribute has been set then the precision should match or exceed the precision specified by computational_precision attribute. \<some text about allowed values and their interpretation>.

2: computational_precision is indeed better. You mention "calculational_precision" in 3 - was that intentional? That term is also OK for me.

3: A controlled vocabulary is certainly clearer than my original proposal, both in term of defining the concept and the encoding, and the IEEE standard does indeed provide what we need. I wonder if it might be good to define the (subset) of IEEE terms ourselves in a table (I'm reminded of https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#table-supported-units) rather than relying on the contents of the external standard, to avoid the potential governance issues we always have when standards outside of CF's influence are brought in. Would the "binary" terms be valid, as well as the "decimal" ones?

taylor13 commented 3 years ago

yes, calculational_precision was a mistake; I prefer computational_precision. Also I'd be happy with not referring to an external standard, and for now, just suggesting that two values, "decimal32" and "decimal64", are supported, unless someone thinks others are needed at this time.

AndersMS commented 3 years ago

Thank you @taylor13 for the proposals and @davidhassell for the implementation details.

I fully agree with your point 1, 2 and 3.

There is possibly one situation that might need attention. If the coordinates subject to compression are stored in decimal64, typically we would require the computations to be decimal64 too, rater than decimal32.

We could deal with that either by:

A. Using the scheme proposed above, requiring the data creator to set the computational_precision accordingly. B. Requiring that the interpolation calculation are never carried out at a lower precision than that of the coordinates subject to compression, even if the computational_precision is not set.

Probably A would be the cleanest, what do you think?

davidhassell commented 3 years ago

Thanks, @taylor13 and @AndersMS,

I, too, would favour A (_Using the scheme proposed above, requiring the data creator to set the computationalprecision accordingly.).

I'm starting to think that the we need to be clear about "decimal64" (or 32, 128, etc.). I'm fairly sure that we only want to specify a precision, rather than also insisting/implying that the user should use decimal64 floating-point format numbers in their calculations. The same issue would arise with "binary64", although I suspect that most code would use double precision floating-point by default.

Could the answer to be to define our own vocabulary of "16", "32", "64", and "128"?

Or am I over complicating things?

taylor13 commented 3 years ago

I don't understand the difference between decimal64 and binary64 or what they precisely mean. If these terms specify things beyond precision, it's probably not appropriate to use them here, so I would support defining our own vocabulary, which would not confuse precision with anything else.

And I too would favor (or favour) A over B.

AndersMS commented 3 years ago

Hi @taylor13 and @davidhassell,

I am not fully up to date on the data types, but following the links that David sent, it appears that decimal64 is a base-10 floating-point number representation that is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations. I think we can disregard that for now.

binary32 and binary64 are the new official IEEE 754 names for what used to be called single- and double-precision floating-point numbers respectively, and is what most of us are familiar with.

I would suggest that we do not require a specific floating-point arithmetic standard to be used, but more a level of precision. If we adopt the naming convention proposed by David, it could look like:

By default, the user may use any floating-point arithmetic precision they like for the interpolation calculations. If the computational_precisionattribute has been set then the precision should match or exceed the precision specified by computational_precisionattribute.

The allowed values of computational_precision attribute are:

(table) "32": 32-bit base-2 floating-point arithmetic, such as IEEE 754 binary32 or equivalent "64": 64-bit base-2 floating-point arithmetic, such as IEEE 754 binary64 or equivalent

I think that would archive what we are after, while leaving the implementers the freedom to use what their programming language and computing platform offers.

What do you think?

taylor13 commented 3 years ago

looks good to me. Can we omit "base-2" from the descriptions, or is that essential? Might even reduce description to, for example:

"32": 32-bit floating-point arithmetic

AndersMS commented 3 years ago

Leaving out "base-2" is fine. Shortening the description further as you suggest would also be fine with me.

I am wondering if we could change the wording to:

"The floating-point arithmetic precision should match or exceed the precision specified by computational_precisionattribute. The allowed values of computational_precision attribute are:

(table) "32": 32-bit floating-point arithmetic "64": 64-bit floating-point arithmetic

If the computational_precisionattribute has not been set, then the default value "32" applies."

That would ensure that we can assume a minimum precision on the user side, which would be important. Practically speaking, high level languages that support 16-bit floating-point variables, typically use 32-bit floating-point arithmetic for the 16-bit floating-point variables (CPU design).

erget commented 3 years ago

@taylor13 by the way I'm still on the prowl for a moderator for this discussion. As I see you've taken an interest, would you be willing to take on that role? I'd be able to do it as well, but as I've been involved in this proposal for quite some time it would be nice to have a fresh set of eyes on it.

davidhassell commented 3 years ago

Hi Anders,

"The floating-point arithmetic precision should match or exceed the precision specified by computational_precision attribute. The allowed values of computational_precision attribute are:

(table) "32": 32-bit floating-point arithmetic "64": 64-bit floating-point arithmetic

This is good for me.

If the computational_precision attribute has not been set, then the default value "32" applies."

That would ensure that we can assume a minimum precision on the user side, which would be important. Practically speaking, high level languages that support 16-bit floating-point variables, typically use 32-bit floating-point arithmetic for the 16-bit floating-point variables (CPU design).

I'm not so sure about having a default value. In the absence of guidance from the creator, I'd probably prefer that the user is free to use whatever precision they would like.

Thanks, David

AndersMS commented 3 years ago

Hi David,

Fine, I take your advice regarding not having a default value. That is probably also simpler - one rule less.

Anders

davidhassell commented 3 years ago

Hi Anders - thanks, it sounds like we're currently in agreement - do you want to update the PR?

AndersMS commented 3 years ago

Hi David,

Yes, I would be happy to update the PR. However, I still have one concern regarding the computational_precisionattribute.

In the introduction to Lossy Compression by Coordinate Sampling in chapter 8, I am planning to change the last sentence from

The creator of the compressed dataset can control the accuracy of the reconstituted coordinates through the degree of subsampling and the choice of interpolation method, see [Appendix J].

to

The creator of the compressed dataset can control the accuracy of the reconstituted coordinates through the degree of subsampling, the choice of interpolation method (see [Appendix J]) and the choice of computational precision (see section X).

where section X will be a new short section in chapter 8 describing the computational_precision attribute.

Recalling that we also write in the introduction to Lossy Compression by Coordinate Sampling in chapter 8 that

The metadata that define the interpolation formula and its inputs are complete, so that the results of the coordinate reconstitution process are well defined and of a predictable accuracy.

I think it would be more consistent if we make the computational_precision attribute mandatory and not optional. Otherwise the accuracy would not be predictable.

Would that be agreeable?

davidhassell commented 3 years ago

Hi Anders,

I think it would be more consistent if we make the computational_precision attribute mandatory and not optional. Otherwise the accuracy would not be predictable.

That's certainly agreeable to me, as is your outline of how to change chapter 8.

Thanks, David

taylor13 commented 3 years ago

Wouldn't the statement be correct as is (perhaps rewritten slightly; see below), if we indicated that if the computational_precision attribute is not specified, a default precision of "32" should be assumed? I would think that almost always the default precision would suffice, so for most data writers, it would be simpler if we didn't require this attribute. (But I don't feel strongly about this.)

Not sure how to word this precisely. Perhaps:

The attributes and default values defined for the interpolation formula and its inputs ensure 
that the results of the coordinate reconstitution process are reproducible and of predictable 
accuracy.

AndersMS commented 3 years ago

Hi @taylor13 and @davidhassell,

Regarding the computational_precision attribute, it appears that we currently have two proposals: Either an optional attribute with a default value or a mandatory attribute.

I have written two versions of the new section 8.3.8, one for each of the two proposals. I hope that will help deciding!

Anders

Optional attribute version:

8.3.8 Computational Precision

The accuracy of the reconstituted coordinates will depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision with which the interpolation method is applied.

To ensure that the results of the coordinate reconstitution process are reproducible and of predictable accuracy, the creator of the compressed dataset may specify the floating-point arithmetic precision by setting the interpolation variable’s computational_precision attribute to one of the following values:

(table) 32: 32-bit floating-point arithmetic (default) 64: 64-bit floating-point arithmetic

For the coordinate reconstitution process, the floating-point arithmetic precision should (or shall?) match or exceed the precision specified by computational_precision attribute, or match or exceed 32-bit floating-point arithmetic if the computational_precision attribute has not been set.

Mandatory attribute version:

8.3.8 Computational Precision

The accuracy of the reconstituted coordinates will depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision with which the interpolation method is applied.

To ensure that the results of the coordinate reconstitution process are reproducible and of predictable accuracy, the creator of the compressed dataset must specify the floating-point arithmetic precision by setting the interpolation variable’s computational_precision attribute to one of the following values:

(table) "32": 32-bit floating-point arithmetic "64": 64-bit floating-point arithmetic

For the coordinate reconstitution process, the floating-point arithmetic precision should (or shall?) match or exceed the precision specified by computational_precision attribute.

taylor13 commented 3 years ago

I have a preference for "optional" because I suspect in most cases 32-bit will be sufficient and this would relieve data writers from including this attribute. There may be good reasons for making it mandatory; what are they?

Not sure about this, but I think "should" rather than "shall" is better.

JonathanGregory commented 3 years ago

Dear all

I've studied the text of proposed changes to Sect 8, as someone not at all involved in writing it or using these kinds of technique. (It's easier to read the files in Daniel's repo than the pull request in order to see the diagrams in place.) I think it all makes sense. It's well-designed and consistent with the rest of CF. Thanks for working it out so thoughtfully and carefully. The diagrams are very good as well.

I have not yet reviewed Appendix J or the conformance document. I'm going to be on leave next week, so I thought I'd contribute just this part before going.

Best wishes

Jonathan

There is one point where I have a suggestion for changing the content of the proposal, although probably you've already discussed this possibility. If I understand correctly, you must always have both the tie_point_dimensions and tie_point_indices attributes of the interpolation variable, and they must refer to the same tie point dimensions. Therefore I think a simpler design, easier for the both data-writer and data-reader to use, would combine these two attributes into one attribute, whose contents would be "interpolation_dimension: tie_point_interpolation_dimension tie_point_index_variable [interpolation_zone_dimension] [interpolation_dimension: ...]".

Also, I have some suggestions for naming:

If you adopt my suggestion for a single attribute to replace tie_point_dimensions and tie_point_indices, an obvious name for it would be tie_points. You've used that name for the attribute of the data variable. However, I would suggest that the attribute of the data variable could equally well be called interpolation, since it names the interpolation variable, and signals that interpolation is to be used.
Your terminology has "tie point interpolation dimension" and "interpolation dimension", but the former is not a special case of the latter. That could be confusing, in the same way that (unfortunately) in CF terminology an auxiliary coordinate variable is not a special kind of coordinate variable. I suggest you rename "tie point interpolation dimension" as e.g. "tie point reduced dimension" to avoid this misunderstanding.
A similar possible confusion is that a tie point index variable is not a special kind of tie point variable. To avoid this confusion and add clarity, I suggest you could rename "tie point variable" as "tie point coordinate variable".
The terms "interpolation zone" and "interpolation area" are unhelpful because it's not obvious from the words which one is bigger, so it's hard to remember. If you stick with "zone" for the small one, for area it would be better to use something which is more obviously much bigger, such as "province" or "realm"! Or perhaps you could use "division" or "department", since the defining characteristic is the discontinuity.

In the first paragraph of Sect 8 we distinguish three methods of reduction of datset size. I would suggest minor clarifications:

There are three methods for reducing dataset size: packing, lossless compression, and lossy compression. By packing we mean altering the data in a way that reduces its precision (but has no other effect on accuracy). By lossless compression we mean techniques that store the data more efficiently and result in no loss of precision or accuracy. By lossy compression we mean techniques that store the data more efficiently and retain its precision but result in some loss in accuracy.

Then I think we could start a new paragraph with "Lossless compression only works in certain circumstances ...". By the way, isn't it the case that HDF supports per-variable gzipping? That wasn't available in the old netCDF data format for which this section was first written, so it's not mentioned, but perhaps it should be now.

There are a few points where I found the text of Sect 8.3 possibly unclear or difficult to follow:

"This form of compression may also be used on a domain variable with the same effect." I think this is an unclear addition. If I understand you correctly, insead of this final sentence you could begin the paragraph with "For some applications the coordinates of a data variable or a domain variable can require considerably more storage than the data in its domain."
Tie Point Dimensions Attribute. If you adopt my suggestion above, this subsection would change its name to "Tie points attribute". It would be good to begin the section by saying what the attribute is for. As it stands, it plunges straigjt into details. The second sentence in particular, about interpolation zones, bewildered me - I didn't know what it was talking about.
I follow this sentence: "For instance, interpolation dimension dimension1 could be mapped to two different tie point interpolation dimensions with dimension1: tp_dimension1 dimension1: tp_dimension2." But I don't understand the next sentence: "This is necessary when different tie point variables for a particular interpolation dimension do not contain the same number of tie points, and therefore define different numbers of interpolation zones, as is the case in Multiple interpolation variables with interpolation parameter attributes." The situation described does not occur in the example quoted, I think. I wonder if it should say, "This occurs when data variables that share an interpolation dimension and interpolation variable have different tie points for that dimension."
Instead of "A tie point variable must span at most one of the tie point interpolation dimensions associated with a given interpolation dimension." I would add a sentence to the first para of "Interpolation and non-interpolation dimension", which I would rewrite as follows:

For each interpolation variable identified in the tie_points attribute, all the associated tie point variables must share the same set of one or more dimensions. Each of the dimensions of a tie point variable must be either a dimension of the data variable, or a dimension of which is to be interpolated to a dimension of the data variable. A tie point variable must not have more than one dimension corresponding to any given dimension of the data variable, and may have fewer dimensions than the data variable. Dimensions of the tie point variable which are interpolated are called tie point reduced dimensions, and the corresponding data variable dimensions are called interpolation dimensions, while those for which no interpolation is required, being the same in the data variable and the tie point variable, are called non-interpolation dimensions. The size of a tie point reduced dimension must be less than or equal to the size of the corresponding interpolation dimension.

In one place, you say "For each interpolation dimension, the number of interpolation zones is equal to the number of tie points minus the number of interpolation areas," and in another place, "An interpolation zone must span at least two points of each of its corresponding interpolation dimensions." It seems to me that "at least" is wrong - it should be "exactly two".
"The dimensions of an interpolation parameter variable must be a subset of zero or more of the ...".
I suggest a rewriting of the part about the dimensions of interpolation paramater variable, for clarity, if I've understood it correctly, as follows:

Where an interpolation zone dimension is provided, the variable provides a single value along that dimension for each interpolation zone, assumed to be defined at the centre of interpolation zone.

Where a tie point reduced dimension is provided, the variable provides a value for each tie point along that dimension. The value applies to the two interpolation zones on either side of the tie point, and is assumed to be defined at the interpolation zone boundary (figure 3).

In both cases, the implementation of the interpolation method should assume that an interpolation parameter variable applies equally to all interpolation zones along any interpolation dimension which it does not span.

For "The bounds of a tie point must be the same as the bounds of the corresponding target grid cells," I would suggest, "The bounds of a tie point must be the same as the bounds of the target grid cells whose coordinates are specified as the tie point."
I don't understand this sentence: "In this case, though, the tie point index variables are the identifying target domain cells to which the bounds apply, rather than bounds values themselves." A tie point index variable could not possibly contain bounds values.
In Example 8.5, you need only one (or maybe two) data variables since they're all the same in structure.

AndersMS commented 3 years ago

Dear @JonathanGregory

Thank you very much for your rich and detailed comments and suggestions, very appreciated.

The team behind the proposal met today and discussed all the points you raised. We have prepared or are in the process of preparing replies to each of the points. However, before sharing these here, we would like to update the proposal text accordingly via pull requests, in order to see if the changes have other effects on the overall proposal, which we have not yet identified.

Best regards, Anders

AndersMS commented 3 years ago

Dear All,

Following a discussion yesterday in the team behind the proposal, we propose the 'computational_precision` attribute to be optional. Here is the proposed text, which now has a reference to [IEEE_754]. Feel free to comment.

Anders

8.3.8 Computational Precision

The accuracy of the reconstituted coordinates will depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision with which the interpolation method is applied.

To ensure that the results of the coordinate reconstitution process are reproducible and of predictable accuracy, the creator of the compressed dataset may specify the floating-point arithmetic precision by setting the interpolation variable’s computational_precision attribute to one of the following values:

(table) 32: 32-bit floating-point arithmetic (default), comparable to the binary32 standard in [IEEE_754] 64: 64-bit floating-point arithmetic, comparable to the binary64 standard in [IEEE_754]

For the coordinate reconstitution process, the floating-point arithmetic precision should match or exceed the precision specified by computational_precision attribute, or match or exceed 32-bit floating-point arithmetic if the computational_precision attribute has not been set.

Bibliography References

[IEEE_754] "IEEE Standard for Floating-Point Arithmetic," in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.

davidhassell commented 3 years ago

Thank you, Anders. I very happy with this.

A minor suggestion - perhaps change:

"...may specify the floating-point arithmetic precision by setting ..."

to

... may specify the floating-point arithmetic precision to be used in the interpolation calculations by setting ...

just to be extra clear which precision is being specified.

AndersMS commented 3 years ago

Good idea David.

Should we perhaps use computation instead of calculation to match the attribute name? Here I have updated the first two paragraphs and added an example:

8.3.8 Computational Precision

"The accuracy of the reconstituted coordinates will depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision used in the interpolation method computations.

To ensure that the results of the coordinate reconstitution process are reproducible and of predictable accuracy, the creator of the compressed dataset may specify the floating-point arithmetic precision to be used in the interpolation method computations by setting the interpolation variable’s computational_precision attribute to one of the following values:

(table) "32": 32-bit floating-point arithmetic (default), comparable to the binary32 standard in [IEEE_754] "64": 64-bit floating-point arithmetic, comparable to the binary64 standard in [IEEE_754]

For the coordinate reconstitution process, the floating-point arithmetic precision should match or exceed the precision specified by computational_precisionattribute, or match or exceed 32-bit floating-point arithmetic if the computational_precision attribute has not been set.

As an example, computational_precision = "64" would specify that the floating-point arithmetic precision should match or exceed 64-bit floating-point arithmetic.

Bibliography References

[IEEE_754] IEEE Standard for Floating-Point Arithmetic, in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.

davidhassell commented 3 years ago

That looks good to me, Anders. The word computation is good.

JonathanGregory commented 3 years ago

I agree. This specification of precision is good.

taylor13 commented 3 years ago

Editorial suggestion: In the statement,

To ensure that the results of the coordinate reconstitution process are reproducible and of 
predictable accuracy, the creator of the compressed dataset may specify the floating-point 
arithmetic precision to be used in the interpolation method computations by ....

I think we should replace "reproducible and of predictable accuracy" with "reproducible with sufficient accuracy" (or something similar). The accuracy might for some algorithms be improved using a higher precision than specified by the computational_precision attribute, but such higher accuracy might be considered unwarranted for a given dataset. So the accuracy really isn't totally determined by the attribute (i.e., it isn't predictable) because the user is free to perform the calculation at a higher precision.

(Hope this is correct and understandable.)

AndersMS commented 3 years ago

Hi @taylor13,

Your point is valid. I guess there would be two alternative solutions:

We remove 'or exceed' from the sentence _'For the coordinate reconstitution process, the floating-point arithmetic precision should match or exceed the precision specified by computationalprecision attribute.' That would be the closets we can get to reproducible and of predictable accuracy . It would leave no choice to be made by the users uncompressing the coordinates and all users would get the same results.
We keep 'or exceed' and adapt the wording along the lines of what you propose.

Personally, I think that ensuring that the results of the coordinate reconstitution process are reproducible and of predictable accuracy is very valuable, and my preference would be option 1.

I believe that if a data creator has judged that computational_precision = "32" is sufficient and appropriate for the data product, it would typically also imply that there is only a limited scope for real improvements on the user side by going to 64-bit floating-point arithmetic. That would also support option 1.

What do you think?

Anders

AndersMS commented 3 years ago

Dear @JonathanGregory

We have progressed with preparing the replies to your proposals. Although there are still a couple of open points, we thought it would be useful to share what we already have.

We have numbered your proposal as Proposed Change 1-16 and treated each of these separately below. For each of the Proposed Changes, you will find a reply to the proposed change as well as the related commit(s).

We are still working on a reply to Proposed Change 15, the other replies are completed.

We are still working on completing the corresponding document updates in the form of commits for the Proposed Change 1, 2, 8, 13 and 14, the other document commits are complete.

We will notify you once all replies and document commits are complete.

Best regards Anders

Proposed Change 1 – Combining the tie_point_dimensions and tie_point_indices attributes

There is one point where I have a suggestion for changing the content of the proposal, although probably you've already discussed this possibility. If I understand correctly, you must always have both the tie_point_dimensions and tie_point_indices attributes of the interpolation variable, and they must refer to the same tie point dimensions. Therefore I think a simpler design, easier for the both data-writer and data-reader to use, would combine these two attributes into one attribute, whose contents would be "interpolation_dimension: tie_point_interpolation_dimension tie_point_index_variable [interpolation_zone_dimension] [interpolation_dimension: ...]".

Reply to Proposed Change 1 We agree with combining the tie_point_dimensions and tie_point_indices attributes in a single attribute as you suggest, but propose to put the tie_point_index_variable before the dimensions: interpolated_dimension: tie_point_index_variable tie_point_interpolation_dimension [interpolation_subarea_dimension] [interpolated_dimension: ...].

Commit(s) related to Proposed Change 1 e5feea3 9807518

Proposed Change 2 – Naming combined tie_point_dimensions and tie_point_indices to tie_points and existing tie_points to interpolation

Also, I have some suggestions for naming: • If you adopt my suggestion for a single attribute to replace tie_point_dimensions and tie_point_indices, an obvious name for it would be tie_points. You've used that name for the attribute of the data variable. However, I would suggest that the attribute of the data variable could equally well be called interpolation, since it names the interpolation variable, and signals that interpolation is to be used.

Reply to Proposed Change 2 We propose renaming the tie_pointsattribute of the data variable to coordinate_interpolationas this makes the name more descriptive. We propose to use the name tie_point_mappingfor the attribute of the interpolation variable resulting from combining the tie_point_dimensionsand tie_point_indicesattributes. We favor this compared to tie_points, as the attribute does not contain or reference tie point coordinates variables.

Commit(s) related to Proposed Change 2 ea5268b f8cd983 e5feea3

Proposed Change 3 - Rename term "tie point interpolation dimension" to e.g. "tie point reduced dimension"

• Your terminology has "tie point interpolation dimension" and "interpolation dimension", but the former is not a special case of the latter. That could be confusing, in the same way that (unfortunately) in CF terminology an auxiliary coordinate variable is not a special kind of coordinate variable. I suggest you rename "tie point interpolation dimension" as e.g. "tie point reduced dimension" to avoid this misunderstanding.

Reply to Proposed Change 3 For the coordinate dimensions in the target domain, we suggest replacing the terms “interpolating dimension” and “non-interpolating dimension” with the terms “interpolated dimension” and “non-interpolated dimension”.

Further to this, we propose to change the term "tie point interpolation dimension" to "subsampled dimension".

We maintain the term "interpolation subarea dimension".

Commit(s) related to Proposed Change 3 7ab0c4f db0eb4e 3501992

Proposed Change 4 - Rename term "tie point variable" to "tie point coordinate variable"

• A similar possible confusion is that a tie point index variable is not a special kind of tie point variable. To avoid this confusion and add clarity, I suggest you could rename "tie point variable" as "tie point coordinate variable".

Reply to Proposed Change 4 We agree to rename the term "tie point variable" to "tie point coordinate variable".

Commit(s) related to Proposed Change 4 a6d37b4

Proposed Change 5 – Renaming of term "interpolation area"

• The terms "interpolation zone" and "interpolation area" are unhelpful because it's not obvious from the words which one is bigger, so it's hard to remember. If you stick with "zone" for the small one, for area it would be better to use something which is more obviously much bigger, such as "province" or "realm"! Or perhaps you could use "division" or "department", since the defining characteristic is the discontinuity.

Reply to Proposed Change 5 We propose replacing the terms "interpolation area(s)”, each consisting of one or more “interpolation zone(s)” with “continuous area(s)”, each consisting of one or more “interpolation subarea(s)”.

Commit(s) related to Proposed Change 5 d6d7ea3

Proposed Change 6 - Rewording of first paragraph of Section 8

In the first paragraph of Sect 8 we distinguish three methods of reduction of datset size. I would suggest minor clarifications: There are three methods for reducing dataset size: packing, lossless compression, and lossy compression. By packing we mean altering the data in a way that reduces its precision (but has no other effect on accuracy). By lossless compression we mean techniques that store the data more efficiently and result in no loss of precision or accuracy. By lossy compression we mean techniques that store the data more efficiently and retain its precision but result in some loss in accuracy.

Then I think we could start a new paragraph with "Lossless compression only works in certain circumstances ...". By the way, isn't it the case that HDF supports per-variable gzipping? That wasn't available in the old netCDF data format for which this section was first written, so it's not mentioned, but perhaps it should be now.

Reply to Proposed Change 6

We agree with your proposed text and have updated the text accordingly. Additionally, we have opened anissue (Rework intro to Section 8: Accuracy & precision · Issue #330 · cf-convention/cf-conventions (github.com)) to address per-variable gzipping as well as to verify that the usage if the terms precision and accuracy are correct.

Commit(s) related to Proposed Change 6 27d1733

Proposed Change 7 – Clarification of effect for domain variables

There are a few points where I found the text of Sect 8.3 possibly unclear or difficult to follow: • "This form of compression may also be used on a domain variable with the same effect." I think this is an unclear addition. If I understand you correctly, insead of this final sentence you could begin the paragraph with "For some applications the coordinates of a data variable or a domain variable can require considerably more storage than the data in its domain."

Reply to Proposed Change 7

We propose to remove the sentence "This form of compression may also be used on a domain variable with the same effect." and not replace it, as the the definition of the Domain variable already allows for this compression.

Commit(s) related to Proposed Change 7 971bfbe

Proposed Change 8 – Rewording of section on Tie Point Dimensions Attribute

• Tie Point Dimensions Attribute. If you adopt my suggestion above, this subsection would change its name to "Tie points attribute". It would be good to begin the section by saying what the attribute is for. As it stands, it plunges straight into details. The second sentence in particular, about interpolation zones, bewildered me - I didn't know what it was talking about.

Reply to Proposed Change 8

As we have agreed to combine the tie_point_dimensionsand tie_point_indicesattributes (Proposed Change 1 and 2), then we must also reorganise the old sections Section 8.3.5, "Tie Point Dimensions Attribute", Section 8.3.6, "Tie Point Indices" to reflect this change. When doing this, we will also improve the wording.

Commit(s) related to Proposed Change 8 e5feea3 2becd52

Proposed Change 9

• I follow this sentence: "For instance, interpolation dimension dimension1 could be mapped to two different tie point interpolation dimensions with dimension1: tp_dimension1 dimension1: tp_dimension2." But I don't understand the next sentence: "This is necessary when different tie point variables for a particular interpolation dimension do not contain the same number of tie points, and therefore define different numbers of interpolation zones, as is the case in Multiple interpolation variables with interpolation parameter attributes." The situation described does not occur in the example quoted, I think. I wonder if it should say, "This occurs when data variables that share an interpolation dimension and interpolation variable have different tie points for that dimension."

Reply to Proposed Change 9 We will delete the reference to the example.

We will update the text as follows:

A single interpolated dimension may be associated with multiple tie point interpolation dimensions by repeating the interpolated dimension in the tie_point_mapping attribute. For instance, interpolated dimension dimension1 could be mapped to two different tie point interpolation dimensions with dimension1: tp_index_variable1 tp_dimension1 dimension1: tp_index_variable2 tp_dimension2. This is necessary when two or more tie point coordinate variables have different tie point index variables corresponding to the same interpolated dimension. A tie point coordinate variable must span at most one of the tie point interpolation dimensions associated with a given interpolation dimension.

Commit(s) related to Proposed Change 9 3d4348f

Proposed Change 10

• Instead of "A tie point variable must span at most one of the tie point interpolation dimensions associated with a given interpolation dimension." I would add a sentence to the first para of "Interpolation and non-interpolation dimension", which I would rewrite as follows: For each interpolation variable identified in the tie_points attribute, all the associated tie point variables must share the same set of one or more dimensions. Each of the dimensions of a tie point variable must be either a dimension of the data variable, or a dimension of which is to be interpolated to a dimension of the data variable. A tie point variable must not have more than one dimension corresponding to any given dimension of the data variable, and may have fewer dimensions than the data variable. Dimensions of the tie point variable which are interpolated are called tie point reduced dimensions, and the corresponding data variable dimensions are called interpolation dimensions, while those for which no interpolation is required, being the same in the data variable and the tie point variable, are called non-interpolation dimensions. The size of a tie point reduced dimension must be less than or equal to the size of the corresponding interpolation dimension.

Reply to Proposed Change 10

We propose the new wording: For each interpolation variable identified in the coordinate_interpolation attribute, all of the associated tie point coordinate variables must share the same set of one or more dimensions. This set of dimensions must correspond to the set of dimensions of the uncompressed coordinate or auxiliary coordinate variables, such that each of these dimensions must be either the uncompressed dimension itself, or a dimension that is to be interpolated to the uncompressed dimension.

Dimensions of the tie point coordinate variable which are to be interpolated are called tie point interpolation dimensions, and the corresponding data variable dimensions are called interpolated dimensions, while those for which no interpolation is required, being the same in the data variable and the tie point coordinate variable, are called non-interpolated dimensions. The dimensions of a tie point coordinate variable must contain at least one tie point interpolation dimension, for each of which the corresponding interpolated dimension cannot be included.

Commit(s) related to Proposed Change 10 190fdff 7ab0c4f

Proposed Change 11

• In one place, you say "For each interpolation dimension, the number of interpolation zones is equal to the number of tie points minus the number of interpolation areas," and in another place, "An interpolation zone must span at least two points of each of its corresponding interpolation dimensions." It seems to me that "at least" is wrong - it should be "exactly two".

Reply to Proposed Change 11 Both sentences are true, but we can see that it is easily misunderstood.

The “span at least two points” refers to points in the interpolated dimension of the target domain. With the proposed Proposed Change 3, the proposed rewording of the second sentence is "An interpolation zone must span at least two points in each of its corresponding interpolated dimensions"

Commit(s) related to Proposed Change 11 d715552 5cfae45

Proposed Change 12

• "The dimensions of an interpolation parameter variable must be a subset of zero or more of the ...".

Reply to Proposed Change 12 We agree.

Commit(s) related to Proposed Change 12 fdeef67

Proposed Change 13

• I suggest a rewriting of the part about the dimensions of interpolation parameter variable, for clarity, if I've understood it correctly, as follows: Where an interpolation zone dimension is provided, the variable provides a single value along that dimension for each interpolation zone, assumed to be defined at the centre of interpolation zone. Where a tie point reduced dimension is provided, the variable provides a value for each tie point along that dimension. The value applies to the two interpolation zones on either side of the tie point, and is assumed to be defined at the interpolation zone boundary (figure 3).

In both cases, the implementation of the interpolation method should assume that an interpolation parameter variable applies equally to all interpolation zones along any interpolation dimension which it does not span.

Reply to Proposed Change 13 We propose changing this existing text:

“The dimensions of an interpolation parameter variable must be a subset of zero or more the tie point variable dimensions, with the possibility of a tie point interpolation dimension being replaced with the corresponding interpolation zone dimension. The interpretation of an interpolation parameter variable depends on which of its dimensions are tie point interpolation dimensions, and which are interpolation zone dimensions: • If no tie point interpolation dimensions are spanned, then the variable provides values for every interpolation zone. This case is akin to values being defined at the centre of interpolation zones. • If at least one dimension is a tie point interpolation dimension, then the variable’s values are to be shared by the interpolation zones that are adjacent along each of the specified tie point interpolation dimensions. This case is akin to the values being defined at the interpolation zone boundaries, and therefore equally applicable to the interpolation zones that share that boundary (figure 3).” In both cases, the implementation of the interpolation method should assume that an interpolation parameter variable is broadcast to any interpolation zones that it does not span."

with this new text:

“The interpolation parameter variable dimensions must include, for all of the interpolation dimensions, either the associated tie point interpolation dimension or the associated interpolation subarea dimension. Additionally, any subset of zero or more of the non-interpolation dimensions of the tie point coordinate variable are permitted as interpolation parameter variable dimensions.

The application of an interpolation parameter variable is independent of its non-interpolation dimensions, but depends on its set of tie point interpolation dimensions and interpolation subarea dimensions:

If the set only contains tie point interpolation dimensions, then the variable provides values for every tie point and therefore equally applicable to the interpolation zones that share that tie point, see example a) in figure 3;
If the set only contains interpolation subarea dimensions, then the variable provides values for every interpolation zone and therefore only applicable to that interpolation zone, see example b) in figure 3;
If the set contains both tie point interpolation dimensions and interpolation zone dimensions, then the variable’s values are to be shared by the interpolation zones that are adjacent along each of the specified tie point interpolation dimensions. This case is akin to the values being defined at the interpolation zone boundaries, and therefore equally applicable to the interpolation zones that share that boundary, see example c) and d) in figure 3;”

In figure 3, the fourth example will be deleted and the broadcast type application of interpolation parameter variable values is no longer supported, as it was difficult to define accurately.

Commit(s) related to Proposed Change 13 ba4a65e 3501992

Proposed Change 14

• For "The bounds of a tie point must be the same as the bounds of the corresponding target grid cells," I would suggest, "The bounds of a tie point must be the same as the bounds of the target grid cells whose coordinates are specified as the tie point."

Reply to Proposed Change 14

We agree with the proposed new text: "The bounds of a tie point must be the same as the bounds of the target grid cells whose coordinates are specified as the tie point."

Commit(s) related to Proposed Change 14 aceb987

Proposed Change 15

• I don't understand this sentence: "In this case, though, the tie point index variables are the identifying target domain cells to which the bounds apply, rather than bounds values themselves." A tie point index variable could not possibly contain bounds values.

Reply to Proposed Change 15 A completely rewritten section "Interpolation of Cell Boundaries" has been introduced, see f3de508.

Commit(s) related to Proposed Change 15 f3de508

Proposed Change 16

• In Example 8.5, you need only one (or maybe two) data variables since they're all the same in structure.

Reply to Proposed Change 16

In Example 8.5, we propose deleting the data variables I01_radianceand I01_reflectanceand to keep I04_radianceand I04_brightness_temperatureto demonstrate the reuse of the interpolation variable for data variables with different units.

Commit(s) related to Proposed Change 16 f439ee5

AndersMS commented 3 years ago

Hi again @JonathanGregory

Just to add that the figures have not yet been updated, I think we will do this when all text changes have ben agreed.

Anders

AndersMS commented 3 years ago

Dear All,

I believe the following paragraph from our chapter 8 is no longer relevant, after we have moved all the dimension related attributes from the data variable to the interpolation variable.

The tie point variables lat and lon spanning dimension tp_dimension1 and tie point variable time spanning dimension tp_dimension2 must have each their interpolation variable.

Would you agree?

Anders

The same interpolation variable may be multiply mapped from the different sets of tie point coordinate variables. For instance, if tie point variables lat and lon span dimension tp_dimension1 and tie point variable time spans dimension tp_dimension2, and all three are to interpolated according to interpolation variable linear, then the *coordinate_interpolation** attribute could be lat: lon: linear time: linear. In this case it is not possible to simultaneously map all three tie point coordinate variables to the linear interpolation variable because they do not all span the same axes.

davidhassell commented 3 years ago

Hi Anders,

I believe the following paragraph from our chapter 8 is no longer relevant

I do agree.

David

AndersMS commented 3 years ago

I have removed the paragraph "The same interpolation variable may be multiply mapped ...." as proposed here.

Commit 485d3b8

JonathanGregory commented 3 years ago

Dear @AndersMS and colleagues

Thanks very much for taking my comments so seriously and for the modifications and explanations. I agree with all these improvements, with two reservations:

Do you somewhere state that the size of a tie point interpolated dimension must be less than or equal to the size of the corresponding interpolation dimension? I suggested this sentence somewhere but you haven't included it there. Maybe it is somewhere else. It seems obvious but is nonetheless worth stating.
While I appreciate you want to relate things to interpolation, I would urge you to use a different word from "interpolated", because you're depending on a very attentive reader in sentences such as "A single interpolated dimension may be associated with multiple tie point interpolation dimensions." My suggestion of "reduced" is not necessarily a good one, but it is noticeably different from "interpolation". Also, "interpolated" doesn't seem quite right to me. You mean, it's going to be interpolated. It hasn't yet been interpolated, though.

I will study the appendix and conformance document next week sometime.

Best wishes

Jonathan

AndersMS commented 3 years ago

Dear @JonathanGregory

Dear Jonathan,

Thank you for the feedback.

yes, we had a sentence saying that the size of a tie point interpolation dimension must be less than or equal to the size of the corresponding interpolated dimension. I actually deleted it, since it is a consequence of other constraints, rather than a constraint of its own. But it makes sense to state it and I will re-introduce the sentence.
I like the the term interpolated dimension that we proposed- it has made it easier for me to read and write the text. And I am hesitant to introduce an additional term like "reduced" in the text. Note that it is "tie point interpolation dimension" against "interpolated dimension", so the difference is more than just the difference between interpolation and interpolated. The way I memorize it, is that a tie point interpolation dimension is available for interpolation, whereas the corresponding dimension in the target domain is, once it exists, an interpolated dimension. Non-interpolated dimensions are the same in both the tie point domain and the target domain and are never interpolated.

I will discus the last point with the rest of the group when we meet tomorrow.

Best regards, Anders

JonathanGregory commented 3 years ago

Dear @AndersMS

In your proposed change 10, you used the word "uncompressed", and "compression" is in the title of this proposal. I think it would be clear to speak of a "compressed dimension" of the tie point variable corresponding to an "uncompressed dimension" of the data variable, or perhaps an "expanded dimension", with the other dimensions being non-interpolation/non-interpolated.

Best wishes

Jonathan

AndersMS commented 3 years ago

Dear @JonathanGregory

That's an interesting suggestion, than you. We will discuss it in the group tomorrow.

Best regards, Anders

erget commented 3 years ago

Terminology issues

Dear @JonathanGregory et al. (@AndersMS @davidhassell @oceandatalab @ajelenak) Concerning terminology, following discussion in the group, these terms seem good candidates:

At tie-point level: "subsampled dimension", "non-interpolated dimension"
At reconstituted level: "interpolated dimension", "non-interpolated dimension"

"non-interpolated dimension" is repeated because it is shared across the 2 domains.

Would this be an improvement in your view?

Computational precision `computational_precision`

This attribute should be mandatory for data producers to specify the precision in which interpolation should take place. This means that there should be no default value; the creator specifies it. It is up to the user to interpret that and using a precision that deviates from the recommendation would not prompt an arrest through the CF police but would mean that they might have deviations in the interpolated data. This should be clearly described so that the reader of the Conventions understands the potential impacts (no need to wax eloquent here though).

Actionees!

@AndersMS will update the draft with regards to the terminology issues
@oceandatalab will propose a paragraph regarding computational precision in this issue and @AndersMS will integrate it into the document after we've agreed it here

For next week

Discussion on bounds will be on the agenda.

JonathanGregory commented 3 years ago

Dear @AndersMS. Daniel @erget et al.,

Concerning terminology, following discussion in the group, these terms seem good candidates:

At tie-point level: "subsampled dimension", "non-interpolated dimension" At reconstituted level: "interpolated dimension", "non-interpolated dimension"

Yes, that terminology seems clear and self-explanatory to me. Thanks for your patience and carefulness.

Actionees!

I take on myself an action to review the rest of the proposal (Appendix and conformance document) this week.

Best wishes

Jonathan

AndersMS commented 3 years ago

Dear All,

Considering that we have now renamed the term tie point interpolation dimension to subsampled dimension, should we possibly change the title

Lossy Compression by Coordinate Sampling

to

Lossy Compression by Coordinate Subsampling

and the replace the occurrences of sample/sampled in the text with subsample/subsampled?

Anders

oceandatalab commented 3 years ago

Hi,

Here is a new take on the computational precision paragraph:

8.3.8 Computational Precision

The accuracy of the reconstituted coordinates will depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision used in the interpolation method computations.

Implementation details of the interpolation methods and hardware can also have an impact on the accuracy of the reconstituted coordinates.

The creator of the compressed dataset must check that the coordinates reconstituted using the interpolation parameters specified in the file have sufficient accuracy compared to the coordinates at full resolution.

Although it may depend on the software and hardware used by the creator, the floating-point arithmetic precision used during this validation step must be specified in the computational_precision attribute of the interpolation method as an indication of potential floating-point precision issues during the interpolation computations.

The computational_precision attribute is mandatory and accepts the following values:

(table) "32": 32-bit floating-point arithmetic, comparable to the binary32 standard in [IEEE_754] "64": 64-bit floating-point arithmetic, comparable to the binary64 standard in [IEEE_754]

For the coordinates reconstitution process, using a floating-point arithmetic precision matching or exceeding the precision specified by computational_precision is likely to produce results with an accuracy similar to what the creator obtained during the validation of the dataset, but it cannot be guaranteed due to the software/hardware factor.

As an example, computational_precision = "64" would specify that, using the same software and hardware as the creator of the compressed dataset, sufficient accuracy could not be reached when using a floating-point precision lower than 64-bit floating-point arithmetic in the interpolation computations required to reconstitute the coordinates.

Bibliography References

[IEEE_754] IEEE Standard for Floating-Point Arithmetic, in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.

Rationale:

The accuracy of the interpolation methods depends not only on the choices made by the data producer (tie points density, area subdivisions, interpolation method parameters, etc...) but also on the software (programming language, libraries) and on the hardware (CPU/FPU) used by the data consumers.

The data producers only know about their own software and hardware, so the computational_precision attribute can only mean that the data producer used this floating point precision when they validated these data using their implementation of the interpolation method, not that using this floating point precision on any software/hardware combination will produce exactly the same results.

I think the computational_precision attribute can only be considered as a hint provided by the data producer regarding numerical issues they encountered when trying to reconstruct the target variables at their full resolution with their implementation of the interpolation method: if the computational_precision exceeds the precision of the data type (e.g. a "64" computational_precision used when interpolating a float variable), then users know that the data producer did not obtain satisfying results when using a lower precision, hence they should be wary of underflow/overflow errors when they interpolate these data. So computational_precision is more of an informational hint than a compulsory instruction given to the users (unless @erget 's CF police becomes a reality), and it is not a reproductibility guarantee either.

Yet it is still a useful piece of information and no one except the data producer can provide it since you need access to the original data at their native resolution to make actual checks on the accuracy of the interpolation method. As the information cannot be derived from the content of the file it makes sense to require that data producers include this attribute systematically: the computational_precision should be mandatory.

Sylvain

oceandatalab commented 3 years ago

@AndersMS: yes I think replacing "sample/sampled" with "subsample/subsampled" would make the text more consistent.

AndersMS commented 3 years ago

Dear Sylvain (@oceandatalab)

Thank you very much for your proposed wording of the Computational Precision text, which I think is a sound way to formulate the meaning and usage of the computational_precision attribute.

I like the detailed rationale you have provided and support having the computational_precision attribute of the interpolation variable mandatory.

Possibly we could shorten the text slightly and still convey the message? Would the following possibly do the job?

8.3.8 Computational Precision

The accuracy of the reconstituted coordinates will mainly depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision used in the interpolation method computations.

The accuracy of the reconstituted coordinates may also depend on details of the interpolation method implementation and on the computer platform, meaning that the results of the coordinate reconstitution process may not be fully reproducible.

However, to enable the data user to reconstitute the coordinates to an accuracy comparable to the accuracy intended by the data creator, the data creator shall specify the floating-point arithmetic precision used during the preparation and validation of the compressed coordinates by setting the interpolation variable’s computational_precisionattribute to one of the following values:

(table) "32": 32-bit floating-point arithmetic, comparable to the binary32 standard in [IEEE_754] "64": 64-bit floating-point arithmetic, comparable to the binary64 standard in [IEEE_754]

As an example, a computational_precision = "64" would provide the guidance to the data user that using 64-bit floating-point arithmetic will reconstitute the coordinates with an accuracy comparable to the accuracy intended by the data creator.

erget commented 3 years ago

@oceandatalab (Sylvain) & @AndersMS - I am in favour of the shorter text; in fact, perhaps one could combine these 3 paragraphs into 1:

The accuracy of the reconstituted coordinates will mainly depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision used in the interpolation method computations.

The accuracy of the reconstituted coordinates may also depend on details of the interpolation method implementation and on the computer platform, meaning that the results of the coordinate reconstitution process may not be fully reproducible.

However, to enable the data user to reconstitute the coordinates to an accuracy comparable to the accuracy intended by the data creator, the data creator shall specify the floating-point arithmetic precision used during the preparation and validation of the compressed coordinates by setting the interpolation variable’s computational_precision attribute to one of the following values:

Please understand that as a suggestion from my side - if you feel the 3 paragraphs are better I wouldn't feel strongly enough to call the CF police.

oceandatalab commented 3 years ago

Thank you for the comments @AndersMS and @erget.

I like the concise version too, I would just keep my version of the "As an example ..." paragraph even if it is more verbose because it states exactly what the attribute means, hopefully leaving no room for misinterpretation. The "{...] using 64-bit floating-point arithmetic will reconstitute [...]" in the shorter version is misleading from my point of view because it eludes the software/hardware factor (though I agree it will not be an issue in most cases).

As for regrouping the 3 paragraphs into one, I think we should keep them separated so that the content of paragraph 2 stands out: it is really important to state that exact reproducibility is not what is offered here so that users don't have unrealistic expectations.

davidhassell commented 3 years ago

Hi all,

Sylvain's descriptions and rational are very good, I think. I am wondering, however, if we are making too bold claims about accuracy when we have no control over the interpolation method's implementation. A user's technique may differ from the creator's (that's OK), but if one technique was numerically ill-conditioned and the other not, even using the same precision could lead to inaccurate results.

With that in mind, here's another suggestion (I think I prefer the one-paragraph approach, as the it helps connect the constituent points, but I don't have a strong opinion on that):

The accuracy of the reconstituted coordinates depends mainly on the degree of subsampling and the choice of interpolation method, both of which are set by the creator of the dataset. The accuracy will also depend, however, on how the interpolation method is implemented and on the computer platform carrying out the computations .There are no restrictions on the choice of interpolation method implementation, for neither the data creator nor the data user, but the floating-point arithmetic precision used by the data creator during the preparation and validation of the compressed coordinates must be specified by setting the interpolation variable’s computational_precision attribute to one of the following values:

(table) "32": 32-bit floating-point arithmetic, comparable to the binary32 standard in [IEEE_754] "64": 64-bit floating-point arithmetic, comparable to the binary64 standard in [IEEE_754]

Using the given computational precision in the interpolation computations is a necessary, but not sufficient, condition for the data user to be able to reconstitute the coordinates to an accuracy comparable to that intended by the data creator. For instance, a computational_precision value of "64" would specify that, using the same software and hardware as the creator of the compressed dataset, sufficient accuracy could not be reached when using a floating-point precision lower than 64-bit floating-point arithmetic in the interpolation computations required to reconstitute the coordinates.

AndersMS commented 3 years ago

Dear All,

As proposed above I will go ahead and change all occurrences and forms of sampled with subsampled in the present PR #326, including the headings of chapter 8.3, Append J and chapter 8.3 of the conformance document, unless I receive reservations against the proposal by tomorrow end of business.

The new title of Chapter 8.3 would then become Lossy Compression by Coordinate Subsampling.

Best regards, Anders

oceandatalab commented 3 years ago

Hi @davidhassell,

I am in favor of your version of the "computational precision" paragraph: it conveys all the required information while remaining concise and yet clearly warns users about the limited scope of the computational_precision attribute.

cf-convention / cf-conventions

Lossy Compression by Coordinate Sampling #327

Title

Moderator

Moderator Status Review [last updated: YYYY-MM-DD]

Requirement Summary

Technical Proposal Summary

Benefits

Status Quo

Associated pull request

Detailed Proposal

Authors

Terminology issues

Computational precision `computational_precision`

Actionees!

For next week

8.3.8 Computational Precision

Rationale:

cf-convention / cf-conventions

Lossy Compression by Coordinate Sampling #327

Title

Moderator

Moderator Status Review [last updated: YYYY-MM-DD]

Requirement Summary

Technical Proposal Summary

Benefits

Status Quo

Associated pull request

Detailed Proposal

Authors

Terminology issues

Computational precision computational_precision

Actionees!

For next week

8.3.8 Computational Precision

Rationale:

Computational precision `computational_precision`