Open davidhassell opened 2 years ago
Oh bugger. I had forgotten about all the edge cases ...
I don't think DAP has to handle this does it? Insofar as they have the NetCDF file itself and the NetCDF semantics available server-side ... the active storage will not.
This could be an 80/20 situation:
_FillValue
and missing_value
(scalar), do we think there are many cases where both might be present?
missing_value
are likely to occur in the presence of a _FillValue
, so we are always dealing with a vector of possible values to treat as missing.If we handle min and max, do we think there are may cases where range might be present?
In the situation where we can't handle it, we default to normal storage operations of course ...
We should at least force "normal" operations for now, if any of these are present in metadata.
Sounds like a good way forward. CMIP6 metadata mandates that you should use both _FillValue
and missing_value
and that they both should have the same value. This is of course not necessarily general practice elsewhere, but for model data I would have thought it is (almost) always the case.
Looking further ahead, providing a single number to the storage is probably no harder than providing "a few" but, as you say, no need to worry about that at this moment.
I suggest we make a few dummy files by extending dummy data.py
to explore the range of these possible missing value options, and that we introduce some code to detect them all ... and reject them for active storage processing for now. When we have that, we can start unpicking them one-by-one, starting with the CMIP6 use case.
Very good question, David! I think the missing data value (wheter it be _FillValue
or missing_value
) should be extracted from the file's metadata as Bryan says, so we should check for either, if they are both present but the actual float value differs then we choose 1.e+20 :grin:
if they are both present but the actual float value differs then we choose 1.e+20
In this case, we process on the client, surely, as netCDF4-python deals with all cases.
Yes, we need to process on the client in all cases where the server can't handle it directly ...
What about error handling? What if a chunk is all missing? I think the right answer would be to return a missing value, and that has to be handled above.
What if a chunk is all missing? I think the right answer would be to return a missing value, and that has to be handled above.
(Edited - sent prematurely)
I think that makes sense, as that also handles the case that all chunks are missing, for which the reduced answer is the mdi. That implies that the methods (like np.sum
) should be their masked counterpats (like np.ma.sum
)
After today's conversation, we decided a reasonable option to avoid a potentially infinite length vector of "missing values", would be to support up to four numbers of missing information: valid_min, valid_max, missing_value, and _FillValue. If there was a vector of missing numbers in play, we'd simply default to "non-computational" storage.
@valeriupredoi Can you please look and see if we have access to those missing value attributes in the zarr dataset object itself? (i.e. will it be easy for us to pass them to _decode_chunk
?
they are inside the bellows - see eg here but accessing and manipulating them from the API is a different dish of curry. I will investigate in more detail next week, ESMValX-releases permitting :+1:
Argh, the interpretation for _FillValue
is not as straightforward as you might think. See this issue, although I think the netcdf user guide has since been updated (and netcdf4-python no longer does that, so I don't think we want to replicate the use of _FillValue
as a max or min ... but recording this here so we put something in the code so anyone falling over this in the future will be aware.
No answers yet, just an statement of need.
Missing values need to be accounted for during active operations. For instance, a land-surface temperature minimum needs to ignore a
missing_value
of-1e20
over the oceans. Therefore the missing values (of which there can be 0 to many) need to be passed to the active storage, similarly to how the data type needs to be passed.Things get complicated because there are many different ways of specifying missing values (https://docs.unidata.ucar.edu/nug/current/attribute_conventions.html), some of which are not simple numbers:
_FillValue
missing_value
(which may be a scalar or vector)valid_min
number, or the first of thevalid_range
numbersvalid_max
number, or the second of thevalid_range
numbersAll of these methods are used in the wild.
The fixed missing values are typically floats which need to match exactly with values in the data, so a string decimal representation created by the client might not convert back to the exact binary representation on the storage. Does DAP deal with this, I wonder?