Accounting for missing values in active storage operations.

davidhassell commented 2 years ago

No answers yet, just an statement of need.

Missing values need to be accounted for during active operations. For instance, a land-surface temperature minimum needs to ignore a missing_value of -1e20 over the oceans. Therefore the missing values (of which there can be 0 to many) need to be passed to the active storage, similarly to how the data type needs to be passed.

Things get complicated because there are many different ways of specifying missing values (https://docs.unidata.ucar.edu/nug/current/attribute_conventions.html), some of which are not simple numbers:

The value of the _FillValue
The value or values of the missing_value (which may be a scalar or vector)
Any value strictly less than the valid_min number, or the first of the valid_range numbers
Any value strictly greater than the valid_max number, or the second of the valid_range numbers

All of these methods are used in the wild.

The fixed missing values are typically floats which need to match exactly with values in the data, so a string decimal representation created by the client might not convert back to the exact binary representation on the storage. Does DAP deal with this, I wonder?

bnlawrence commented 2 years ago

Oh bugger. I had forgotten about all the edge cases ...

I don't think DAP has to handle this does it? Insofar as they have the NetCDF file itself and the NetCDF semantics available server-side ... the active storage will not.

bnlawrence commented 2 years ago

This could be an 80/20 situation:

If we handle _FillValue and missing_value (scalar), do we think there are many cases where both might be present?
- Oh bugger, yes we do, since all cases of missing_value are likely to occur in the presence of a _FillValue, so we are always dealing with a vector of possible values to treat as missing.
If we handle min and max, do we think there are may cases where range might be present?

In the situation where we can't handle it, we default to normal storage operations of course ...

bnlawrence commented 2 years ago

We should at least force "normal" operations for now, if any of these are present in metadata.

davidhassell commented 2 years ago

Sounds like a good way forward. CMIP6 metadata mandates that you should use both _FillValue and missing_value and that they both should have the same value. This is of course not necessarily general practice elsewhere, but for model data I would have thought it is (almost) always the case.

Looking further ahead, providing a single number to the storage is probably no harder than providing "a few" but, as you say, no need to worry about that at this moment.

bnlawrence commented 2 years ago

I suggest we make a few dummy files by extending dummy data.py to explore the range of these possible missing value options, and that we introduce some code to detect them all ... and reject them for active storage processing for now. When we have that, we can start unpicking them one-by-one, starting with the CMIP6 use case.

valeriupredoi commented 2 years ago

Very good question, David! I think the missing data value (wheter it be _FillValue or missing_value) should be extracted from the file's metadata as Bryan says, so we should check for either, if they are both present but the actual float value differs then we choose 1.e+20 :grin:

davidhassell commented 2 years ago

if they are both present but the actual float value differs then we choose 1.e+20

In this case, we process on the client, surely, as netCDF4-python deals with all cases.

bnlawrence commented 2 years ago

Yes, we need to process on the client in all cases where the server can't handle it directly ...

bnlawrence commented 2 years ago

What about error handling? What if a chunk is all missing? I think the right answer would be to return a missing value, and that has to be handled above.

davidhassell commented 2 years ago

What if a chunk is all missing? I think the right answer would be to return a missing value, and that has to be handled above.

(Edited - sent prematurely)

I think that makes sense, as that also handles the case that all chunks are missing, for which the reduced answer is the mdi. That implies that the methods (like np.sum) should be their masked counterpats (like np.ma.sum)

bnlawrence commented 2 years ago

After today's conversation, we decided a reasonable option to avoid a potentially infinite length vector of "missing values", would be to support up to four numbers of missing information: valid_min, valid_max, missing_value, and _FillValue. If there was a vector of missing numbers in play, we'd simply default to "non-computational" storage.

bnlawrence commented 2 years ago

@valeriupredoi Can you please look and see if we have access to those missing value attributes in the zarr dataset object itself? (i.e. will it be easy for us to pass them to _decode_chunk?

valeriupredoi commented 2 years ago

they are inside the bellows - see eg here but accessing and manipulating them from the API is a different dish of curry. I will investigate in more detail next week, ESMValX-releases permitting :+1:

bnlawrence commented 2 years ago

Argh, the interpretation for _FillValue is not as straightforward as you might think. See this issue, although I think the netcdf user guide has since been updated (and netcdf4-python no longer does that, so I don't think we want to replicate the use of _FillValue as a max or min ... but recording this here so we put something in the code so anyone falling over this in the future will be aware.

NCAS-CMS / PyActiveStorage

Accounting for missing values in active storage operations. #18