Should fragments be allowed different data types?

davidhassell commented 3 years ago

Should fragments be allowed to have different data types to the aggregation variable?

This question was raised by @bnlawrence over at PR #14 (https://github.com/davidhassell/cfa-conventions/pull/14#issuecomment-835259466, https://github.com/davidhassell/cfa-conventions/pull/14#issuecomment-835403860), but perhaps warrants it's own issue. The discussion thus far is as follows:

The conventions currently say:

A fragment may have a different data type to that of the aggregated variable. The fragment's data must be cast to the aggregated variable's data type [by the application program that is managing the aggregation].

A conversation ensued:

@bnlawrence: Are you saying that we can have 32 bit reals in the fragments and a 64 bit real aggregation variable? I would have thought we don't want that (it goes beyond the conceptual framework I had in mind, which doesn't mean it's wrong of course).

@davidhassell: Yes, I am generally allowing (e.g.) 32 bit reals in the fragments and a 64 bit real aggregation variable. This ended up in the spec simply because it's one of those things that cf-python has been doing for years "because it can", rather than because it has been requested or is necessary.

I can see why this feature could be undesirable, as an individual number in the aggregated data might be different to the corresponding number in a fragment. Also, by disallowing it we could remove mentions of the special case of CF packing in the compression section.

However, from the analysis viewpoint, you may well not have control over the fragments, so, e.g., saving an aggregation at a different precision to the fragments would not be possible. Also from an analysis stance, may be you want to aggregate fragments from different sources (models?) that happened to be stored in different precisions. In memory you would do exactly this - i.e. cast to a common precision. I'd be very interested to hear what you think of these use cases. Are they real enough ... ?

davidhassell commented 3 years ago

Example 7 in https://github.com/davidhassell/cfa-conventions/blob/6baae68d1f54408a443e50046300c12de346f5db/source/cfa.md is of note, here. In this packed aggregation variable case, we need the fragments to have the same data type as aggregation variable. I reckon. See the CF rules on packing.

nmassey001 commented 3 years ago

While I'd normally think "keep things as simple as possible", and disallow this, I can see the value in doing it for the use case of aggregating from different models / instrument sources, with different precisions. Would we define "compatible types"? e.g. fragments can be a mix of type float16, float32 or float64, but not a mix of types float32, int32, complex32?

davidhassell commented 3 years ago

This was discussed off-line by @nmassey001, @sadielbartholomew and @davidhassell

Guaranteeing that the numbers in a fragment are transferred unchanged to the aggregated data is robust and well defined.

We thought of a variety of cases where allowing the user to do data type casting could scientifically change the information contained in the fragment's data (as opposed to just adding some "noise"):

The fragment's data has undergone a precision-preserving bit grooming (or similar)
The fragment's data contains values outside of the dynamic range of the aggregation variable's data type
The aggregation variable defines linear packing by netCDF add_offset and scale_factor attributes
probably some other I have forgotten about!

The use case of an aggregation of multiple models with models with different data types (https://github.com/davidhassell/cfa-conventions/issues/17#issue-890941878) was considered to not be sufficiently compelling because this does not line up with long term archival practices, it is unlikely that the models are otherwise aggregatable (although pre-processing may have put them on the same grid, e.g. CMIP6), and such an aggregation is easily done by the user during analysis (well it is is if you use cf-python, at least!).

In summary, we think that @bnlawrence is right and we should insist that fragment data types are the same as the aggregation variable data type.

Have I missed anything?

PR to follow.

NCAS-CMS / cfa-conventions

Should fragments be allowed different data types? #17