Closed AndersMS closed 3 years ago
Hello @taylor13, @AndersMS,
It might be better to continue the conversion over at #37 on the precision of interpolation calculations (the comment thread starting at https://github.com/cf-convention/discuss/issues/37#issuecomment-832142697) here in this issue, as this is the main place now for discussing the PR containing the details of this proposal, of which this precision question is one.
I hope that's alright, thanks, David
Hi @taylor13
Thank you very much or your comments. We did have a flaw or a weakness in the algorithm, which we have corrected following your comments.
To briefly explain: the methods of the proposal stores coordinates at a set of tie points, from which the coordinates in the target domain may then be reconstituted by interpolation. The source of the problem was the computation of the distance squared between two such tie points. The distance will never be zero and could for example be in the order of a few kilometers. As the line between the two tie points forms a right triangle with two other lines of known length, the fastest way to compute the distance squared is to use Pythagoras's theorem. However, as the two other sides both of a length significantly larger than the one we wish to calculate, the result was very sensitive to rounding in 32-bit floating-point calculations and occasionally returned zero. We have now changed the algorithm to compute the distance squared asx*x + y*y + z*z,
where(x, y, z)
is the vector between the two tie points. This expression does not have the weakness explained above and has now been tested to work well.
In terms of accuracy of how well the method reconstitutes the original coordinates, the change improved the performance of the internal calculations being calculated in 32-bit floating-point. However, still with errors a couple of times larger than when using 64-bit floating-point calculations.
I would therefore support the proposal put forward by @davidhassell. The proposal avoids setting a general rule, which, as you point out, may not cover all cases. It permits setting a requirement when needed to reconstitute data with the accuracy intended by the data creator.
Once again, thank you very much for your comments – further comments from your side on the proposal would be highly welcome!
Cheers Anders
For convenience, here is the proposal for specifying the precision to be used for the interpolation calculations (slightly robustified):
interpolation_precision
attribute has been set to a numerical value then the precision should match the precision of the given numerical value.// Interpolation variable with NO 'interpolation_precision' attribute
// => the creator is saying "you can use whatever precision you like when uncompressing"
char interp ;
interp:interpolation_name = "bi_linear" ;
// Interpolation variable with 'interpolation_precision' attribute
// => the creator is saying "you must use the precision I have specified when uncompressing"
char interp ;
interp:interpolation_name = "bi_linear" ;
interp:interpolation_precision = 0D ; // use double precision when uncompressing
Do you think that this might work, @taylor13?
Thanks, David
Thanks @AndersMS for the care taken to address my concern, and thanks @davidhassell for the proposed revision. A few minor comments:
interpolation_precision
. I would hate to think that the interpolation method would be degraded by doing so. I suggest, therefore, replacing "the precision should match" with "the precision should match or exceed" or something similar. Also, a comma should follow the introductory clause, "if the interpolation_precision attribute has been set to a numerical value" and typo in "calculatins" should be corrected.calculational_precision
attribute should be defined and set to one of the names defined by the IEEE 753 technical standard for floating point arithmetic (e.g., "decimal32", "decimal64", "decimal128"). If the calculational_precision
attribute has been defined, all interpolation calculations should be executed at the specified precision (or higher).In the example, then, "0D" would be replaced by "decimal64".
Hi @taylor13,
1: I agree that higher precisions should be allowed. A modified description (which could do with some rewording, but the intent is clear for now, I hope):
computational_precision
attribute has been set then the precision should match or exceed the precision specified by computational_precision
attribute. \<some text about allowed values and their interpretation>.2: computational_precision
is indeed better. You mention "calculational_precision" in 3 - was that intentional? That term is also OK for me.
3: A controlled vocabulary is certainly clearer than my original proposal, both in term of defining the concept and the encoding, and the IEEE standard does indeed provide what we need. I wonder if it might be good to define the (subset) of IEEE terms ourselves in a table (I'm reminded of https://cfconventions.org/Data/cf-conventions/cf-conventions-1.8/cf-conventions.html#table-supported-units) rather than relying on the contents of the external standard, to avoid the potential governance issues we always have when standards outside of CF's influence are brought in. Would the "binary" terms be valid, as well as the "decimal" ones?
yes, calculational_precision
was a mistake; I prefer computational_precision
. Also I'd be happy with not referring to an external standard, and for now, just suggesting that two values, "decimal32" and "decimal64", are supported, unless someone thinks others are needed at this time.
Thank you @taylor13 for the proposals and @davidhassell for the implementation details.
I fully agree with your point 1, 2 and 3.
There is possibly one situation that might need attention. If the coordinates subject to compression are stored in decimal64, typically we would require the computations to be decimal64 too, rater than decimal32.
We could deal with that either by:
A. Using the scheme proposed above, requiring the data creator to set the computational_precision
accordingly.
B. Requiring that the interpolation calculation are never carried out at a lower precision than that of the coordinates subject to compression, even if the computational_precision
is not set.
Probably A would be the cleanest, what do you think?
Thanks, @taylor13 and @AndersMS,
I, too, would favour A (_Using the scheme proposed above, requiring the data creator to set the computationalprecision accordingly.).
I'm starting to think that the we need to be clear about "decimal64"
(or 32, 128, etc.). I'm fairly sure that we only want to specify a precision, rather than also insisting/implying that the user should use decimal64 floating-point format numbers in their calculations. The same issue would arise with "binary64"
, although I suspect that most code would use double precision floating-point by default.
Could the answer to be to define our own vocabulary of "16"
, "32",
"64",
and "128"
?
Or am I over complicating things?
I don't understand the difference between decimal64 and binary64 or what they precisely mean. If these terms specify things beyond precision, it's probably not appropriate to use them here, so I would support defining our own vocabulary, which would not confuse precision with anything else.
And I too would favor (or favour) A over B.
Hi @taylor13 and @davidhassell,
I am not fully up to date on the data types, but following the links that David sent, it appears that decimal64 is a base-10 floating-point number representation that is intended for applications where it is necessary to emulate decimal rounding exactly, such as financial and tax computations. I think we can disregard that for now.
binary32 and binary64 are the new official IEEE 754 names for what used to be called single- and double-precision floating-point numbers respectively, and is what most of us are familiar with.
I would suggest that we do not require a specific floating-point arithmetic standard to be used, but more a level of precision. If we adopt the naming convention proposed by David, it could look like:
By default, the user may use any floating-point arithmetic precision they like for the interpolation calculations. If the computational_precision
attribute has been set then the precision should match or exceed the precision specified by computational_precision
attribute.
The allowed values of computational_precision
attribute are:
(table)
"32":
32-bit base-2 floating-point arithmetic, such as IEEE 754 binary32 or equivalent
"64"
: 64-bit base-2 floating-point arithmetic, such as IEEE 754 binary64 or equivalent
I think that would archive what we are after, while leaving the implementers the freedom to use what their programming language and computing platform offers.
What do you think?
looks good to me. Can we omit "base-2" from the descriptions, or is that essential? Might even reduce description to, for example:
"32": 32-bit floating-point arithmetic
Leaving out "base-2" is fine. Shortening the description further as you suggest would also be fine with me.
I am wondering if we could change the wording to:
"The floating-point arithmetic precision should match or exceed the precision specified by computational_precision
attribute. The allowed values of computational_precision
attribute are:
(table)
"32":
32-bit floating-point arithmetic
"64"
: 64-bit floating-point arithmetic
If the computational_precision
attribute has not been set, then the default value "32" applies."
That would ensure that we can assume a minimum precision on the user side, which would be important. Practically speaking, high level languages that support 16-bit floating-point variables, typically use 32-bit floating-point arithmetic for the 16-bit floating-point variables (CPU design).
@taylor13 by the way I'm still on the prowl for a moderator for this discussion. As I see you've taken an interest, would you be willing to take on that role? I'd be able to do it as well, but as I've been involved in this proposal for quite some time it would be nice to have a fresh set of eyes on it.
Hi Anders,
"The floating-point arithmetic precision should match or exceed the precision specified by computational_precision attribute. The allowed values of computational_precision attribute are:
(table) "32": 32-bit floating-point arithmetic "64": 64-bit floating-point arithmetic
This is good for me.
If the computational_precision attribute has not been set, then the default value "32" applies."
That would ensure that we can assume a minimum precision on the user side, which would be important. Practically speaking, high level languages that support 16-bit floating-point variables, typically use 32-bit floating-point arithmetic for the 16-bit floating-point variables (CPU design).
I'm not so sure about having a default value. In the absence of guidance from the creator, I'd probably prefer that the user is free to use whatever precision they would like.
Thanks, David
Hi David,
Fine, I take your advice regarding not having a default value. That is probably also simpler - one rule less.
Anders
Hi Anders - thanks, it sounds like we're currently in agreement - do you want to update the PR?
Hi David,
Yes, I would be happy to update the PR. However, I still have one concern regarding the computational_precision
attribute.
In the introduction to Lossy Compression by Coordinate Sampling in chapter 8, I am planning to change the last sentence from
The creator of the compressed dataset can control the accuracy of the reconstituted coordinates through the degree of subsampling and the choice of interpolation method, see [Appendix J].
to
The creator of the compressed dataset can control the accuracy of the reconstituted coordinates through the degree of subsampling, the choice of interpolation method (see [Appendix J]) and the choice of computational precision (see section X).
where section X will be a new short section in chapter 8 describing the computational_precision
attribute.
Recalling that we also write in the introduction to Lossy Compression by Coordinate Sampling in chapter 8 that
The metadata that define the interpolation formula and its inputs are complete, so that the results of the coordinate reconstitution process are well defined and of a predictable accuracy.
I think it would be more consistent if we make the computational_precision
attribute mandatory and not optional. Otherwise the accuracy would not be predictable.
Would that be agreeable?
Hi Anders,
I think it would be more consistent if we make the computational_precision attribute mandatory and not optional. Otherwise the accuracy would not be predictable.
That's certainly agreeable to me, as is your outline of how to change chapter 8.
Thanks, David
Wouldn't the statement be correct as is (perhaps rewritten slightly; see below), if we indicated that if the computational_precision attribute is not specified, a default precision of "32" should be assumed? I would think that almost always the default precision would suffice, so for most data writers, it would be simpler if we didn't require this attribute. (But I don't feel strongly about this.)
Not sure how to word this precisely. Perhaps:
The attributes and default values defined for the interpolation formula and its inputs ensure
that the results of the coordinate reconstitution process are reproducible and of predictable
accuracy.
Hi @taylor13 and @davidhassell,
Regarding the computational_precision
attribute, it appears that we currently have two proposals: Either an optional attribute with a default value or a mandatory attribute.
I have written two versions of the new section 8.3.8, one for each of the two proposals. I hope that will help deciding!
Anders
Optional attribute version:
8.3.8 Computational Precision
The accuracy of the reconstituted coordinates will depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision with which the interpolation method is applied.
To ensure that the results of the coordinate reconstitution process are reproducible and of predictable accuracy, the creator of the compressed dataset may specify the floating-point arithmetic precision by setting the interpolation variable’s computational_precision
attribute to one of the following values:
(table) 32: 32-bit floating-point arithmetic (default) 64: 64-bit floating-point arithmetic
For the coordinate reconstitution process, the floating-point arithmetic precision should (or shall?) match or exceed the precision specified by computational_precision
attribute, or match or exceed 32-bit floating-point arithmetic if the computational_precision
attribute has not been set.
Mandatory attribute version:
8.3.8 Computational Precision
The accuracy of the reconstituted coordinates will depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision with which the interpolation method is applied.
To ensure that the results of the coordinate reconstitution process are reproducible and of predictable accuracy, the creator of the compressed dataset must specify the floating-point arithmetic precision by setting the interpolation variable’s computational_precision
attribute to one of the following values:
(table) "32": 32-bit floating-point arithmetic "64": 64-bit floating-point arithmetic
For the coordinate reconstitution process, the floating-point arithmetic precision should (or shall?) match or exceed the precision specified by computational_precision
attribute.
I have a preference for "optional" because I suspect in most cases 32-bit will be sufficient and this would relieve data writers from including this attribute. There may be good reasons for making it mandatory; what are they?
Not sure about this, but I think "should" rather than "shall" is better.
Dear all
I've studied the text of proposed changes to Sect 8, as someone not at all involved in writing it or using these kinds of technique. (It's easier to read the files in Daniel's repo than the pull request in order to see the diagrams in place.) I think it all makes sense. It's well-designed and consistent with the rest of CF. Thanks for working it out so thoughtfully and carefully. The diagrams are very good as well.
I have not yet reviewed Appendix J or the conformance document. I'm going to be on leave next week, so I thought I'd contribute just this part before going.
Best wishes
Jonathan
There is one point where I have a suggestion for changing the content of the proposal, although probably you've already discussed this possibility. If I understand correctly, you must always have both the tie_point_dimensions
and tie_point_indices
attributes of the interpolation variable, and they must refer to the same tie point dimensions. Therefore I think a simpler design, easier for the both data-writer and data-reader to use, would combine these two attributes into one attribute, whose contents would be "interpolation_dimension:
tie_point_interpolation_dimension tie_point_index_variable [interpolation_zone_dimension] [interpolation_dimension:
...]".
Also, I have some suggestions for naming:
If you adopt my suggestion for a single attribute to replace tie_point_dimensions
and tie_point_indices
, an obvious name for it would be tie_points
. You've used that name for the attribute of the data variable. However, I would suggest that the attribute of the data variable could equally well be called interpolation
, since it names the interpolation variable, and signals that interpolation is to be used.
Your terminology has "tie point interpolation dimension" and "interpolation dimension", but the former is not a special case of the latter. That could be confusing, in the same way that (unfortunately) in CF terminology an auxiliary coordinate variable is not a special kind of coordinate variable. I suggest you rename "tie point interpolation dimension" as e.g. "tie point reduced dimension" to avoid this misunderstanding.
A similar possible confusion is that a tie point index variable is not a special kind of tie point variable. To avoid this confusion and add clarity, I suggest you could rename "tie point variable" as "tie point coordinate variable".
The terms "interpolation zone" and "interpolation area" are unhelpful because it's not obvious from the words which one is bigger, so it's hard to remember. If you stick with "zone" for the small one, for area it would be better to use something which is more obviously much bigger, such as "province" or "realm"! Or perhaps you could use "division" or "department", since the defining characteristic is the discontinuity.
In the first paragraph of Sect 8 we distinguish three methods of reduction of datset size. I would suggest minor clarifications:
There are three methods for reducing dataset size: packing, lossless compression, and lossy compression. By packing we mean altering the data in a way that reduces its precision (but has no other effect on accuracy). By lossless compression we mean techniques that store the data more efficiently and result in no loss of precision or accuracy. By lossy compression we mean techniques that store the data more efficiently and retain its precision but result in some loss in accuracy.
Then I think we could start a new paragraph with "Lossless compression only works in certain circumstances ...". By the way, isn't it the case that HDF supports per-variable gzipping? That wasn't available in the old netCDF data format for which this section was first written, so it's not mentioned, but perhaps it should be now.
There are a few points where I found the text of Sect 8.3 possibly unclear or difficult to follow:
"This form of compression may also be used on a domain variable with the same effect." I think this is an unclear addition. If I understand you correctly, insead of this final sentence you could begin the paragraph with "For some applications the coordinates of a data variable or a domain variable can require considerably more storage than the data in its domain."
Tie Point Dimensions Attribute. If you adopt my suggestion above, this subsection would change its name to "Tie points attribute". It would be good to begin the section by saying what the attribute is for. As it stands, it plunges straigjt into details. The second sentence in particular, about interpolation zones, bewildered me - I didn't know what it was talking about.
I follow this sentence: "For instance, interpolation dimension dimension1 could be mapped to two different tie point interpolation dimensions with dimension1: tp_dimension1 dimension1: tp_dimension2." But I don't understand the next sentence: "This is necessary when different tie point variables for a particular interpolation dimension do not contain the same number of tie points, and therefore define different numbers of interpolation zones, as is the case in Multiple interpolation variables with interpolation parameter attributes." The situation described does not occur in the example quoted, I think. I wonder if it should say, "This occurs when data variables that share an interpolation dimension and interpolation variable have different tie points for that dimension."
Instead of "A tie point variable must span at most one of the tie point interpolation dimensions associated with a given interpolation dimension." I would add a sentence to the first para of "Interpolation and non-interpolation dimension", which I would rewrite as follows:
For each interpolation variable identified in the tie_points attribute, all the associated tie point variables must share the same set of one or more dimensions. Each of the dimensions of a tie point variable must be either a dimension of the data variable, or a dimension of which is to be interpolated to a dimension of the data variable. A tie point variable must not have more than one dimension corresponding to any given dimension of the data variable, and may have fewer dimensions than the data variable. Dimensions of the tie point variable which are interpolated are called tie point reduced dimensions, and the corresponding data variable dimensions are called interpolation dimensions, while those for which no interpolation is required, being the same in the data variable and the tie point variable, are called non-interpolation dimensions. The size of a tie point reduced dimension must be less than or equal to the size of the corresponding interpolation dimension.
In one place, you say "For each interpolation dimension, the number of interpolation zones is equal to the number of tie points minus the number of interpolation areas," and in another place, "An interpolation zone must span at least two points of each of its corresponding interpolation dimensions." It seems to me that "at least" is wrong - it should be "exactly two".
"The dimensions of an interpolation parameter variable must be a subset of zero or more of the ...".
I suggest a rewriting of the part about the dimensions of interpolation paramater variable, for clarity, if I've understood it correctly, as follows:
Where an interpolation zone dimension is provided, the variable provides a single value along that dimension for each interpolation zone, assumed to be defined at the centre of interpolation zone.
Where a tie point reduced dimension is provided, the variable provides a value for each tie point along that dimension. The value applies to the two interpolation zones on either side of the tie point, and is assumed to be defined at the interpolation zone boundary (figure 3).
In both cases, the implementation of the interpolation method should assume that an interpolation parameter variable applies equally to all interpolation zones along any interpolation dimension which it does not span.
For "The bounds of a tie point must be the same as the bounds of the corresponding target grid cells," I would suggest, "The bounds of a tie point must be the same as the bounds of the target grid cells whose coordinates are specified as the tie point."
I don't understand this sentence: "In this case, though, the tie point index variables are the identifying target domain cells to which the bounds apply, rather than bounds values themselves." A tie point index variable could not possibly contain bounds values.
In Example 8.5, you need only one (or maybe two) data variables since they're all the same in structure.
Dear @JonathanGregory
Thank you very much for your rich and detailed comments and suggestions, very appreciated.
The team behind the proposal met today and discussed all the points you raised. We have prepared or are in the process of preparing replies to each of the points. However, before sharing these here, we would like to update the proposal text accordingly via pull requests, in order to see if the changes have other effects on the overall proposal, which we have not yet identified.
Best regards, Anders
Dear All,
Following a discussion yesterday in the team behind the proposal, we propose the 'computational_precision` attribute to be optional. Here is the proposed text, which now has a reference to [IEEE_754]. Feel free to comment.
Anders
8.3.8 Computational Precision
The accuracy of the reconstituted coordinates will depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision with which the interpolation method is applied.
To ensure that the results of the coordinate reconstitution process are reproducible and of predictable accuracy, the creator of the compressed dataset may specify the floating-point arithmetic precision by setting the interpolation variable’s computational_precision
attribute to one of the following values:
(table) 32: 32-bit floating-point arithmetic (default), comparable to the binary32 standard in [IEEE_754] 64: 64-bit floating-point arithmetic, comparable to the binary64 standard in [IEEE_754]
For the coordinate reconstitution process, the floating-point arithmetic precision should match or exceed the precision specified by computational_precision
attribute, or match or exceed 32-bit floating-point arithmetic if the computational_precision
attribute has not been set.
Bibliography References
[IEEE_754] "IEEE Standard for Floating-Point Arithmetic," in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.
Thank you, Anders. I very happy with this.
A minor suggestion - perhaps change:
"...may specify the floating-point arithmetic precision by setting ..."
to
... may specify the floating-point arithmetic precision to be used in the interpolation calculations by setting ...
just to be extra clear which precision is being specified.
Good idea David.
Should we perhaps use computation instead of calculation to match the attribute name? Here I have updated the first two paragraphs and added an example:
8.3.8 Computational Precision
"The accuracy of the reconstituted coordinates will depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision used in the interpolation method computations.
To ensure that the results of the coordinate reconstitution process are reproducible and of predictable accuracy, the creator of the compressed dataset may specify the floating-point arithmetic precision to be used in the interpolation method computations by setting the interpolation variable’s computational_precision
attribute to one of the following values:
(table) "32": 32-bit floating-point arithmetic (default), comparable to the binary32 standard in [IEEE_754] "64": 64-bit floating-point arithmetic, comparable to the binary64 standard in [IEEE_754]
For the coordinate reconstitution process, the floating-point arithmetic precision should match or exceed the precision specified by computational_precision
attribute, or match or exceed 32-bit floating-point arithmetic if the computational_precision
attribute has not been set.
As an example, computational_precision = "64"
would specify that the floating-point arithmetic precision should match or exceed 64-bit floating-point arithmetic.
Bibliography References
[IEEE_754] IEEE Standard for Floating-Point Arithmetic, in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.
That looks good to me, Anders. The word computation is good.
I agree. This specification of precision is good.
Editorial suggestion: In the statement,
To ensure that the results of the coordinate reconstitution process are reproducible and of
predictable accuracy, the creator of the compressed dataset may specify the floating-point
arithmetic precision to be used in the interpolation method computations by ....
I think we should replace "reproducible and of predictable accuracy" with "reproducible with sufficient accuracy" (or something similar). The accuracy might for some algorithms be improved using a higher precision than specified by the computational_precision
attribute, but such higher accuracy might be considered unwarranted for a given dataset. So the accuracy really isn't totally determined by the attribute (i.e., it isn't predictable) because the user is free to perform the calculation at a higher precision.
(Hope this is correct and understandable.)
Hi @taylor13,
Your point is valid. I guess there would be two alternative solutions:
Personally, I think that ensuring that the results of the coordinate reconstitution process are reproducible and of predictable accuracy is very valuable, and my preference would be option 1.
I believe that if a data creator has judged that computational_precision = "32"
is sufficient and appropriate for the data product, it would typically also imply that there is only a limited scope for real improvements on the user side by going to 64-bit floating-point arithmetic. That would also support option 1.
What do you think?
Anders
Dear @JonathanGregory
We have progressed with preparing the replies to your proposals. Although there are still a couple of open points, we thought it would be useful to share what we already have.
We have numbered your proposal as Proposed Change 1-16 and treated each of these separately below. For each of the Proposed Changes, you will find a reply to the proposed change as well as the related commit(s).
We are still working on a reply to Proposed Change 15, the other replies are completed.
We are still working on completing the corresponding document updates in the form of commits for the Proposed Change 1, 2, 8, 13 and 14, the other document commits are complete.
We will notify you once all replies and document commits are complete.
Best regards Anders
Proposed Change 1 – Combining the tie_point_dimensions and tie_point_indices attributes
There is one point where I have a suggestion for changing the content of the proposal, although probably you've already discussed this possibility. If I understand correctly, you must always have both the tie_point_dimensions and tie_point_indices attributes of the interpolation variable, and they must refer to the same tie point dimensions. Therefore I think a simpler design, easier for the both data-writer and data-reader to use, would combine these two attributes into one attribute, whose contents would be "interpolation_dimension: tie_point_interpolation_dimension tie_point_index_variable [interpolation_zone_dimension] [interpolation_dimension: ...]".
Reply to Proposed Change 1
We agree with combining the tie_point_dimensions and tie_point_indices attributes in a single attribute as you suggest, but propose to put the tie_point_index_variable before the dimensions:
interpolated_dimension: tie_point_index_variable tie_point_interpolation_dimension [interpolation_subarea_dimension] [interpolated_dimension: ...].
Commit(s) related to Proposed Change 1 e5feea3 9807518
Proposed Change 2 – Naming combined tie_point_dimensions and tie_point_indices to tie_points and existing tie_points to interpolation
Also, I have some suggestions for naming: • If you adopt my suggestion for a single attribute to replace tie_point_dimensions and tie_point_indices, an obvious name for it would be tie_points. You've used that name for the attribute of the data variable. However, I would suggest that the attribute of the data variable could equally well be called interpolation, since it names the interpolation variable, and signals that interpolation is to be used.
Reply to Proposed Change 2
We propose renaming the tie_points
attribute of the data variable to coordinate_interpolation
as this makes the name more descriptive. We propose to use the name tie_point_mapping
for the attribute of the interpolation variable resulting from combining the tie_point_dimensions
and tie_point_indices
attributes. We favor this compared to tie_points
, as the attribute does not contain or reference tie point coordinates variables.
Commit(s) related to Proposed Change 2 ea5268b f8cd983 e5feea3
Proposed Change 3 - Rename term "tie point interpolation dimension" to e.g. "tie point reduced dimension"
• Your terminology has "tie point interpolation dimension" and "interpolation dimension", but the former is not a special case of the latter. That could be confusing, in the same way that (unfortunately) in CF terminology an auxiliary coordinate variable is not a special kind of coordinate variable. I suggest you rename "tie point interpolation dimension" as e.g. "tie point reduced dimension" to avoid this misunderstanding.
Reply to Proposed Change 3 For the coordinate dimensions in the target domain, we suggest replacing the terms “interpolating dimension” and “non-interpolating dimension” with the terms “interpolated dimension” and “non-interpolated dimension”.
Further to this, we propose to change the term "tie point interpolation dimension" to "subsampled dimension".
We maintain the term "interpolation subarea dimension".
Commit(s) related to Proposed Change 3 7ab0c4f db0eb4e 3501992
Proposed Change 4 - Rename term "tie point variable" to "tie point coordinate variable"
• A similar possible confusion is that a tie point index variable is not a special kind of tie point variable. To avoid this confusion and add clarity, I suggest you could rename "tie point variable" as "tie point coordinate variable".
Reply to Proposed Change 4 We agree to rename the term "tie point variable" to "tie point coordinate variable".
Commit(s) related to Proposed Change 4 a6d37b4
Proposed Change 5 – Renaming of term "interpolation area"
• The terms "interpolation zone" and "interpolation area" are unhelpful because it's not obvious from the words which one is bigger, so it's hard to remember. If you stick with "zone" for the small one, for area it would be better to use something which is more obviously much bigger, such as "province" or "realm"! Or perhaps you could use "division" or "department", since the defining characteristic is the discontinuity.
Reply to Proposed Change 5 We propose replacing the terms "interpolation area(s)”, each consisting of one or more “interpolation zone(s)” with “continuous area(s)”, each consisting of one or more “interpolation subarea(s)”.
Commit(s) related to Proposed Change 5 d6d7ea3
Proposed Change 6 - Rewording of first paragraph of Section 8
In the first paragraph of Sect 8 we distinguish three methods of reduction of datset size. I would suggest minor clarifications: There are three methods for reducing dataset size: packing, lossless compression, and lossy compression. By packing we mean altering the data in a way that reduces its precision (but has no other effect on accuracy). By lossless compression we mean techniques that store the data more efficiently and result in no loss of precision or accuracy. By lossy compression we mean techniques that store the data more efficiently and retain its precision but result in some loss in accuracy.
Then I think we could start a new paragraph with "Lossless compression only works in certain circumstances ...". By the way, isn't it the case that HDF supports per-variable gzipping? That wasn't available in the old netCDF data format for which this section was first written, so it's not mentioned, but perhaps it should be now.
Reply to Proposed Change 6
We agree with your proposed text and have updated the text accordingly. Additionally, we have opened anissue (Rework intro to Section 8: Accuracy & precision · Issue #330 · cf-convention/cf-conventions (github.com)) to address per-variable gzipping as well as to verify that the usage if the terms precision and accuracy are correct.
Commit(s) related to Proposed Change 6 27d1733
Proposed Change 7 – Clarification of effect for domain variables
There are a few points where I found the text of Sect 8.3 possibly unclear or difficult to follow: • "This form of compression may also be used on a domain variable with the same effect." I think this is an unclear addition. If I understand you correctly, insead of this final sentence you could begin the paragraph with "For some applications the coordinates of a data variable or a domain variable can require considerably more storage than the data in its domain."
Reply to Proposed Change 7
We propose to remove the sentence "This form of compression may also be used on a domain variable with the same effect." and not replace it, as the the definition of the Domain variable already allows for this compression.
Commit(s) related to Proposed Change 7 971bfbe
Proposed Change 8 – Rewording of section on Tie Point Dimensions Attribute
• Tie Point Dimensions Attribute. If you adopt my suggestion above, this subsection would change its name to "Tie points attribute". It would be good to begin the section by saying what the attribute is for. As it stands, it plunges straight into details. The second sentence in particular, about interpolation zones, bewildered me - I didn't know what it was talking about.
Reply to Proposed Change 8
As we have agreed to combine the tie_point_dimensions
and tie_point_indices
attributes (Proposed Change 1 and 2), then we must also reorganise the old sections Section 8.3.5, "Tie Point Dimensions Attribute", Section 8.3.6, "Tie Point Indices" to reflect this change. When doing this, we will also improve the wording.
Commit(s) related to Proposed Change 8 e5feea3 2becd52
Proposed Change 9
• I follow this sentence: "For instance, interpolation dimension dimension1 could be mapped to two different tie point interpolation dimensions with dimension1: tp_dimension1 dimension1: tp_dimension2." But I don't understand the next sentence: "This is necessary when different tie point variables for a particular interpolation dimension do not contain the same number of tie points, and therefore define different numbers of interpolation zones, as is the case in Multiple interpolation variables with interpolation parameter attributes." The situation described does not occur in the example quoted, I think. I wonder if it should say, "This occurs when data variables that share an interpolation dimension and interpolation variable have different tie points for that dimension."
Reply to Proposed Change 9 We will delete the reference to the example.
We will update the text as follows:
A single interpolated dimension may be associated with multiple tie point interpolation dimensions by repeating the interpolated dimension in the tie_point_mapping
attribute. For instance, interpolated dimension dimension1
could be mapped to two different tie point interpolation dimensions with dimension1: tp_index_variable1 tp_dimension1 dimension1: tp_index_variable2 tp_dimension2
. This is necessary when two or more tie point coordinate variables have different tie point index variables corresponding to the same interpolated dimension. A tie point coordinate variable must span at most one of the tie point interpolation dimensions associated with a given interpolation dimension.
Commit(s) related to Proposed Change 9 3d4348f
Proposed Change 10
• Instead of "A tie point variable must span at most one of the tie point interpolation dimensions associated with a given interpolation dimension." I would add a sentence to the first para of "Interpolation and non-interpolation dimension", which I would rewrite as follows: For each interpolation variable identified in the tie_points attribute, all the associated tie point variables must share the same set of one or more dimensions. Each of the dimensions of a tie point variable must be either a dimension of the data variable, or a dimension of which is to be interpolated to a dimension of the data variable. A tie point variable must not have more than one dimension corresponding to any given dimension of the data variable, and may have fewer dimensions than the data variable. Dimensions of the tie point variable which are interpolated are called tie point reduced dimensions, and the corresponding data variable dimensions are called interpolation dimensions, while those for which no interpolation is required, being the same in the data variable and the tie point variable, are called non-interpolation dimensions. The size of a tie point reduced dimension must be less than or equal to the size of the corresponding interpolation dimension.
Reply to Proposed Change 10
We propose the new wording: For each interpolation variable identified in the coordinate_interpolation attribute, all of the associated tie point coordinate variables must share the same set of one or more dimensions. This set of dimensions must correspond to the set of dimensions of the uncompressed coordinate or auxiliary coordinate variables, such that each of these dimensions must be either the uncompressed dimension itself, or a dimension that is to be interpolated to the uncompressed dimension.
Dimensions of the tie point coordinate variable which are to be interpolated are called tie point interpolation dimensions, and the corresponding data variable dimensions are called interpolated dimensions, while those for which no interpolation is required, being the same in the data variable and the tie point coordinate variable, are called non-interpolated dimensions. The dimensions of a tie point coordinate variable must contain at least one tie point interpolation dimension, for each of which the corresponding interpolated dimension cannot be included.
Commit(s) related to Proposed Change 10 190fdff 7ab0c4f
Proposed Change 11
• In one place, you say "For each interpolation dimension, the number of interpolation zones is equal to the number of tie points minus the number of interpolation areas," and in another place, "An interpolation zone must span at least two points of each of its corresponding interpolation dimensions." It seems to me that "at least" is wrong - it should be "exactly two".
Reply to Proposed Change 11 Both sentences are true, but we can see that it is easily misunderstood.
The “span at least two points” refers to points in the interpolated dimension of the target domain. With the proposed Proposed Change 3, the proposed rewording of the second sentence is "An interpolation zone must span at least two points in each of its corresponding interpolated dimensions"
Commit(s) related to Proposed Change 11 d715552 5cfae45
Proposed Change 12
• "The dimensions of an interpolation parameter variable must be a subset of zero or more of the ...".
Reply to Proposed Change 12 We agree.
Commit(s) related to Proposed Change 12 fdeef67
Proposed Change 13
• I suggest a rewriting of the part about the dimensions of interpolation parameter variable, for clarity, if I've understood it correctly, as follows: Where an interpolation zone dimension is provided, the variable provides a single value along that dimension for each interpolation zone, assumed to be defined at the centre of interpolation zone. Where a tie point reduced dimension is provided, the variable provides a value for each tie point along that dimension. The value applies to the two interpolation zones on either side of the tie point, and is assumed to be defined at the interpolation zone boundary (figure 3).
In both cases, the implementation of the interpolation method should assume that an interpolation parameter variable applies equally to all interpolation zones along any interpolation dimension which it does not span.
Reply to Proposed Change 13 We propose changing this existing text:
“The dimensions of an interpolation parameter variable must be a subset of zero or more the tie point variable dimensions, with the possibility of a tie point interpolation dimension being replaced with the corresponding interpolation zone dimension. The interpretation of an interpolation parameter variable depends on which of its dimensions are tie point interpolation dimensions, and which are interpolation zone dimensions: • If no tie point interpolation dimensions are spanned, then the variable provides values for every interpolation zone. This case is akin to values being defined at the centre of interpolation zones. • If at least one dimension is a tie point interpolation dimension, then the variable’s values are to be shared by the interpolation zones that are adjacent along each of the specified tie point interpolation dimensions. This case is akin to the values being defined at the interpolation zone boundaries, and therefore equally applicable to the interpolation zones that share that boundary (figure 3).” In both cases, the implementation of the interpolation method should assume that an interpolation parameter variable is broadcast to any interpolation zones that it does not span."
with this new text:
“The interpolation parameter variable dimensions must include, for all of the interpolation dimensions, either the associated tie point interpolation dimension or the associated interpolation subarea dimension. Additionally, any subset of zero or more of the non-interpolation dimensions of the tie point coordinate variable are permitted as interpolation parameter variable dimensions.
The application of an interpolation parameter variable is independent of its non-interpolation dimensions, but depends on its set of tie point interpolation dimensions and interpolation subarea dimensions:
In figure 3, the fourth example will be deleted and the broadcast type application of interpolation parameter variable values is no longer supported, as it was difficult to define accurately.
Commit(s) related to Proposed Change 13 ba4a65e 3501992
Proposed Change 14
• For "The bounds of a tie point must be the same as the bounds of the corresponding target grid cells," I would suggest, "The bounds of a tie point must be the same as the bounds of the target grid cells whose coordinates are specified as the tie point."
Reply to Proposed Change 14
We agree with the proposed new text: "The bounds of a tie point must be the same as the bounds of the target grid cells whose coordinates are specified as the tie point."
Commit(s) related to Proposed Change 14 aceb987
Proposed Change 15
• I don't understand this sentence: "In this case, though, the tie point index variables are the identifying target domain cells to which the bounds apply, rather than bounds values themselves." A tie point index variable could not possibly contain bounds values.
Reply to Proposed Change 15 A completely rewritten section "Interpolation of Cell Boundaries" has been introduced, see f3de508.
Commit(s) related to Proposed Change 15 f3de508
Proposed Change 16
• In Example 8.5, you need only one (or maybe two) data variables since they're all the same in structure.
Reply to Proposed Change 16
In Example 8.5, we propose deleting the data variables I01_radiance
and I01_reflectance
and to keep I04_radiance
and I04_brightness_temperature
to demonstrate the reuse of the interpolation variable for data variables with different units.
Commit(s) related to Proposed Change 16 f439ee5
Hi again @JonathanGregory
Just to add that the figures have not yet been updated, I think we will do this when all text changes have ben agreed.
Anders
Dear All,
I believe the following paragraph from our chapter 8 is no longer relevant, after we have moved all the dimension related attributes from the data variable to the interpolation variable.
The tie point variables lat
and lon
spanning dimension tp_dimension1
and tie point variable time
spanning dimension tp_dimension2
must have each their interpolation variable.
Would you agree?
Anders
The same interpolation variable may be multiply mapped from the different sets of tie point coordinate variables. For instance, if tie point variables
lat
andlon
span dimensiontp_dimension1
and tie point variabletime
spans dimensiontp_dimension2
, and all three are to interpolated according to interpolation variablelinear
, then the *coordinate_interpolation
** attribute could belat: lon: linear time: linear
. In this case it is not possible to simultaneously map all three tie point coordinate variables to the linear interpolation variable because they do not all span the same axes.
Hi Anders,
I believe the following paragraph from our chapter 8 is no longer relevant
I do agree.
David
I have removed the paragraph "The same interpolation variable may be multiply mapped ...." as proposed here.
Commit 485d3b8
Dear @AndersMS and colleagues
Thanks very much for taking my comments so seriously and for the modifications and explanations. I agree with all these improvements, with two reservations:
Do you somewhere state that the size of a tie point interpolated dimension must be less than or equal to the size of the corresponding interpolation dimension? I suggested this sentence somewhere but you haven't included it there. Maybe it is somewhere else. It seems obvious but is nonetheless worth stating.
While I appreciate you want to relate things to interpolation, I would urge you to use a different word from "interpolated", because you're depending on a very attentive reader in sentences such as "A single interpolated dimension may be associated with multiple tie point interpolation dimensions." My suggestion of "reduced" is not necessarily a good one, but it is noticeably different from "interpolation". Also, "interpolated" doesn't seem quite right to me. You mean, it's going to be interpolated. It hasn't yet been interpolated, though.
I will study the appendix and conformance document next week sometime.
Best wishes
Jonathan
Dear @JonathanGregory
Dear Jonathan,
Thank you for the feedback.
yes, we had a sentence saying that the size of a tie point interpolation dimension must be less than or equal to the size of the corresponding interpolated dimension. I actually deleted it, since it is a consequence of other constraints, rather than a constraint of its own. But it makes sense to state it and I will re-introduce the sentence.
I like the the term interpolated dimension that we proposed- it has made it easier for me to read and write the text. And I am hesitant to introduce an additional term like "reduced" in the text. Note that it is "tie point interpolation dimension" against "interpolated dimension", so the difference is more than just the difference between interpolation and interpolated. The way I memorize it, is that a tie point interpolation dimension is available for interpolation, whereas the corresponding dimension in the target domain is, once it exists, an interpolated dimension. Non-interpolated dimensions are the same in both the tie point domain and the target domain and are never interpolated.
I will discus the last point with the rest of the group when we meet tomorrow.
Best regards, Anders
Dear @AndersMS
In your proposed change 10, you used the word "uncompressed", and "compression" is in the title of this proposal. I think it would be clear to speak of a "compressed dimension" of the tie point variable corresponding to an "uncompressed dimension" of the data variable, or perhaps an "expanded dimension", with the other dimensions being non-interpolation/non-interpolated.
Best wishes
Jonathan
Dear @JonathanGregory
That's an interesting suggestion, than you. We will discuss it in the group tomorrow.
Best regards, Anders
Dear @JonathanGregory et al. (@AndersMS @davidhassell @oceandatalab @ajelenak) Concerning terminology, following discussion in the group, these terms seem good candidates:
"non-interpolated dimension" is repeated because it is shared across the 2 domains.
Would this be an improvement in your view?
computational_precision
This attribute should be mandatory for data producers to specify the precision in which interpolation should take place. This means that there should be no default value; the creator specifies it. It is up to the user to interpret that and using a precision that deviates from the recommendation would not prompt an arrest through the CF police but would mean that they might have deviations in the interpolated data. This should be clearly described so that the reader of the Conventions understands the potential impacts (no need to wax eloquent here though).
Discussion on bounds will be on the agenda.
Dear @AndersMS. Daniel @erget et al.,
Concerning terminology, following discussion in the group, these terms seem good candidates:
At tie-point level: "subsampled dimension", "non-interpolated dimension" At reconstituted level: "interpolated dimension", "non-interpolated dimension"
Yes, that terminology seems clear and self-explanatory to me. Thanks for your patience and carefulness.
Actionees!
I take on myself an action to review the rest of the proposal (Appendix and conformance document) this week.
Best wishes
Jonathan
Dear All,
Considering that we have now renamed the term tie point interpolation dimension to subsampled dimension, should we possibly change the title
Lossy Compression by Coordinate Sampling
to
Lossy Compression by Coordinate Subsampling
and the replace the occurrences of sample/sampled in the text with subsample/subsampled?
Anders
Hi,
Here is a new take on the computational precision paragraph:
The accuracy of the reconstituted coordinates will depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision used in the interpolation method computations.
Implementation details of the interpolation methods and hardware can also have an impact on the accuracy of the reconstituted coordinates.
The creator of the compressed dataset must check that the coordinates reconstituted using the interpolation parameters specified in the file have sufficient accuracy compared to the coordinates at full resolution.
Although it may depend on the software and hardware used by the creator, the floating-point arithmetic precision used during this validation step must be specified in the computational_precision
attribute of the interpolation method as an indication of potential floating-point precision issues during the interpolation computations.
The computational_precision
attribute is mandatory and accepts the following values:
(table) "32": 32-bit floating-point arithmetic, comparable to the binary32 standard in [IEEE_754] "64": 64-bit floating-point arithmetic, comparable to the binary64 standard in [IEEE_754]
For the coordinates reconstitution process, using a floating-point arithmetic precision matching or exceeding the precision specified by computational_precision
is likely to produce results with an accuracy similar to what the creator obtained during the validation of the dataset, but it cannot be guaranteed due to the software/hardware factor.
As an example, computational_precision = "64"
would specify that, using the same software and hardware as the creator of the compressed dataset, sufficient accuracy could not be reached when using a floating-point precision lower than 64-bit floating-point arithmetic in the interpolation computations required to reconstitute the coordinates.
Bibliography References
[IEEE_754] IEEE Standard for Floating-Point Arithmetic, in IEEE Std 754-2019 (Revision of IEEE 754-2008) , vol., no., pp.1-84, 22 July 2019, doi: 10.1109/IEEESTD.2019.8766229.
The accuracy of the interpolation methods depends not only on the choices made by the data producer (tie points density, area subdivisions, interpolation method parameters, etc...) but also on the software (programming language, libraries) and on the hardware (CPU/FPU) used by the data consumers.
The data producers only know about their own software and hardware, so the computational_precision attribute can only mean that the data producer used this floating point precision when they validated these data using their implementation of the interpolation method, not that using this floating point precision on any software/hardware combination will produce exactly the same results.
I think the computational_precision attribute can only be considered as a hint provided by the data producer regarding numerical issues they encountered when trying to reconstruct the target variables at their full resolution with their implementation of the interpolation method: if the computational_precision exceeds the precision of the data type (e.g. a "64" computational_precision used when interpolating a float variable), then users know that the data producer did not obtain satisfying results when using a lower precision, hence they should be wary of underflow/overflow errors when they interpolate these data. So computational_precision is more of an informational hint than a compulsory instruction given to the users (unless @erget 's CF police becomes a reality), and it is not a reproductibility guarantee either.
Yet it is still a useful piece of information and no one except the data producer can provide it since you need access to the original data at their native resolution to make actual checks on the accuracy of the interpolation method. As the information cannot be derived from the content of the file it makes sense to require that data producers include this attribute systematically: the computational_precision should be mandatory.
Sylvain
@AndersMS: yes I think replacing "sample/sampled" with "subsample/subsampled" would make the text more consistent.
Dear Sylvain (@oceandatalab)
Thank you very much for your proposed wording of the Computational Precision text, which I think is a sound way to formulate the meaning and usage of the computational_precision
attribute.
I like the detailed rationale you have provided and support having the computational_precision
attribute of the interpolation variable mandatory.
Possibly we could shorten the text slightly and still convey the message? Would the following possibly do the job?
8.3.8 Computational Precision
The accuracy of the reconstituted coordinates will mainly depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision used in the interpolation method computations.
The accuracy of the reconstituted coordinates may also depend on details of the interpolation method implementation and on the computer platform, meaning that the results of the coordinate reconstitution process may not be fully reproducible.
However, to enable the data user to reconstitute the coordinates to an accuracy comparable to the accuracy intended by the data creator, the data creator shall specify the floating-point arithmetic precision used during the preparation and validation of the compressed coordinates by setting the interpolation variable’s computational_precision
attribute to one of the following values:
(table) "32": 32-bit floating-point arithmetic, comparable to the binary32 standard in [IEEE_754] "64": 64-bit floating-point arithmetic, comparable to the binary64 standard in [IEEE_754]
As an example, a computational_precision = "64"
would provide the guidance to the data user that using 64-bit floating-point arithmetic will reconstitute the coordinates with an accuracy comparable to the accuracy intended by the data creator.
@oceandatalab (Sylvain) & @AndersMS - I am in favour of the shorter text; in fact, perhaps one could combine these 3 paragraphs into 1:
The accuracy of the reconstituted coordinates will mainly depend on the degree of subsampling, the choice of interpolation method and the choice of the floating-point arithmetic precision used in the interpolation method computations.
The accuracy of the reconstituted coordinates may also depend on details of the interpolation method implementation and on the computer platform, meaning that the results of the coordinate reconstitution process may not be fully reproducible.
However, to enable the data user to reconstitute the coordinates to an accuracy comparable to the accuracy intended by the data creator, the data creator shall specify the floating-point arithmetic precision used during the preparation and validation of the compressed coordinates by setting the interpolation variable’s computational_precision attribute to one of the following values:
Please understand that as a suggestion from my side - if you feel the 3 paragraphs are better I wouldn't feel strongly enough to call the CF police.
Thank you for the comments @AndersMS and @erget.
I like the concise version too, I would just keep my version of the "As an example ..." paragraph even if it is more verbose because it states exactly what the attribute means, hopefully leaving no room for misinterpretation. The "{...] using 64-bit floating-point arithmetic will reconstitute [...]" in the shorter version is misleading from my point of view because it eludes the software/hardware factor (though I agree it will not be an issue in most cases).
As for regrouping the 3 paragraphs into one, I think we should keep them separated so that the content of paragraph 2 stands out: it is really important to state that exact reproducibility is not what is offered here so that users don't have unrealistic expectations.
Hi all,
Sylvain's descriptions and rational are very good, I think. I am wondering, however, if we are making too bold claims about accuracy when we have no control over the interpolation method's implementation. A user's technique may differ from the creator's (that's OK), but if one technique was numerically ill-conditioned and the other not, even using the same precision could lead to inaccurate results.
With that in mind, here's another suggestion (I think I prefer the one-paragraph approach, as the it helps connect the constituent points, but I don't have a strong opinion on that):
The accuracy of the reconstituted coordinates depends mainly on the degree of subsampling and the choice of interpolation method, both of which are set by the creator of the dataset. The accuracy will also depend, however, on how the interpolation method is implemented and on the computer platform carrying out the computations .There are no restrictions on the choice of interpolation method implementation, for neither the data creator nor the data user, but the floating-point arithmetic precision used by the data creator during the preparation and validation of the compressed coordinates must be specified by setting the interpolation variable’s computational_precision
attribute to one of the following values:
(table) "32": 32-bit floating-point arithmetic, comparable to the binary32 standard in [IEEE_754] "64": 64-bit floating-point arithmetic, comparable to the binary64 standard in [IEEE_754]
Using the given computational precision in the interpolation computations is a necessary, but not sufficient, condition for the data user to be able to reconstitute the coordinates to an accuracy comparable to that intended by the data creator. For instance, a computational_precision
value of "64
" would specify that, using the same software and hardware as the creator of the compressed dataset, sufficient accuracy could not be reached when using a floating-point precision lower than 64-bit floating-point arithmetic in the interpolation computations required to reconstitute the coordinates.
Dear All,
As proposed above I will go ahead and change all occurrences and forms of sampled with subsampled in the present PR #326, including the headings of chapter 8.3, Append J and chapter 8.3 of the conformance document, unless I receive reservations against the proposal by tomorrow end of business.
The new title of Chapter 8.3 would then become Lossy Compression by Coordinate Subsampling.
Best regards, Anders
Hi @davidhassell,
I am in favor of your version of the "computational precision" paragraph: it conveys all the required information while remaining concise and yet clearly warns users about the limited scope of the computational_precision
attribute.
Title
Lossy Compression by Coordinate Sampling
Moderator
@JonathanGregory
Moderator Status Review [last updated: YYYY-MM-DD]
Brief comment on current status, update periodically
Requirement Summary
The spatiotemporal, spectral, and thematic resolution of Earth science data are increasing rapidly. This presents a challenge for all types of Earth science data, whether it is derived from models, in-situ, or remote sensing observations.
In particular, when coordinate information varies with time, the domain definition can be many times larger than the (potentially already very large) data which it describes. This is often the case for remote sensing products, such as a swath measurements from a polar orbiting satellite (e.g. slide 4 in https://cfconventions.org/Meetings/2020-workshop/Subsampled-coordinates-in-CF-netCDF.pdf).
Such datasets are often prohibitively expensive to store, and so some form of compression is required. However, native compression, such as is available in the HDF5 library, does not generally provide enough of a saving, due to the nature of the values being compressed (e.g. few missing or repeated values).
An alternative form of compression-by-convention amounts to storing only a small subsample of the coordinate values, alongside an interpolation algorithm that describes how the subsample can be used to generate the original, unsampled set of coordinates. This form of compression has been shown to out-perform native compression by "orders of magnitude" (e.g. slide 6 in https://cfconventions.org/Meetings/2020-workshop/Subsampled-coordinates-in-CF-netCDF.pdf).
Various implementations following this broad methodology are currently in use (see https://github.com/cf-convention/discuss/issues/37#issuecomment-608459133 for examples), however, the steps that are required to reconstitute the full resolution coordinates are not necessarily well defined within a dataset.
This proposal offers a standardized approach covering the complete end-to-end process, including a detailed description of the required steps. At the same time it is a framework where new methods can be added or existing methods can be extended.
Unlike compression by gathering, this form of compression is lossy due to rounding and approximation errors in the required interpolation calculations. However, the loss in accuracy is a function of the degree to which the coordinates are subsampled, and the choice of interpolation algorithm (of which there are configurable standardized and non-standardized options), and so may be determined by the data creator to be within acceptable limits. For example, in one application with cell sizes of approximately 750 metres by 750 metres, interpolation of a stored subsample comprising every 16th value in each dimension was able to recreate the original coordinate values to a mean accuracy of ~1 metre. (Details of this test are available.)
Whilst remote sensing applications are the motivating concern for this proposal, the approach presented has been designed to be fully general, and so can be applied to structured coordinates describing any domain, such as one describing model outputs.
Technical Proposal Summary
See PR #326 for details. In summary:
The approach and encoding is fully described in the new section 8.3 "Lossy Compression by Coordinate Sampling" to Chapter 8: Reduction of Dataset Size.
A new appendix J describes the standardized interpolation algorithms, and includes guidance for data creators.
Appendix A has been updated for a new data and domain variable attribute.
The conformance document has new checks for all of the new content.
The new "interpolation variable" has been included in the Terminology in Chapter 1.
The list of examples in toc-extra.adoc has been updated for the new examples in section 8.3.
Benefits
Anyone may benefit who has prohibitively large domain descriptions for which absolute accuracy of cell locations is not an issue.
Status Quo
The storage of large, structure domain descriptions is either prohibitively expensive, or is handled non-standardized ways
Associated pull request
PR #326
Detailed Proposal
PR #326
Authors
This proposal has been put together by (in alphabetic order)
Aleksandar Jelenak Anders Meier Soerensen Daniel Lee David Hassell Lucile Gaultier Sylvain Herlédan Thomas Lavergne