Best practices for missing values (aka _FillValues) within VLENs?

czender commented 6 years ago

Does Unidata have guidelines to represent missing values within VLENs, rather than the VLENs themselves? My understanding is that VLEN _FillValues must themselves be VLENs.
This is fine. I want to understand the best practices for indicating and treating missing values within each VLEN ragged array. Here I use _FillValue for the attribute name, and "missing value" for values to "ignore" within a VLEN.

The semantics are confusing. I hope my description is clear. As I see it, VLENs are incomplete without a well-understood way of specifying a "missing value" for elements within the VLEN. Hopefully this is well-trod ground and I can learn from previous internal Unidata discussions on this topic. Otherwise I can revise this and post to the netCDF group for discussion.

There are at least three options:

Treat element values that equal the base-type NC_FILL_* as missing. This allows a disconnect between the _FillValue (a VLEN) and the value treated as "missing" for data within a VLEN. This means NC_FILL_* for the VLEN base type is always missing, regardless of whether the user wants that. It prevents users from assigning a custom value for missing data within a VLEN.
Treat the first element of the VLEN _FillValue as missing. This connects the VLEN _FillValue to the "missing value" for elements within a VLEN. The user controls _FillValue, and can set it to any desired value, e.g., _FillValue={-999.0}.
Use _FillValue for the VLEN, and missing_value for elements within the VLEN. Both are independently configurable.
Other/Combinations of the above.

Recommendations?

WardF commented 6 years ago

Interesting question; it hasn't come up in my time here yet, so I'll have to consult the documentation. @DennisHeimbigner any thoughts?

WardF commented 6 years ago

You have asked an excellent question and we are still reviewing it.

DennisHeimbigner commented 6 years ago

Charlie-

The short answer is that there is no special policy about VLEN and fillvalue because none is needed.

But I need to make sure I understand the question.

There are two possible places where VLEN and fillvalue might interact.

First, suppose I have the following definitions.

netcdf ... {
dimensions: d1 = 1;
types: int(*) istar_t;
variables: istart_t v(d1);
...
}

I can set a _FillValue for v in the usual way. I might do some code like this where ncid is that of the file and varid is the id of variable v.

{
    int seq[3] = {1,2,3};
    nc_vlen_t fillval;
    fillval.len = 3;
    fillval.p = (void*)seq;
    nc_def_var_fill(ncid,varid,NC_FILL,&fillval);
}

BTW, the default fill value for a vlen is a zero-length vlen: {0,NULL}.

But I suspect this is not what you are referring to.

The important thing to note is that the data associated with the VLEN instance (pointed to by nc_vlen_t.p) is a sequence, not an array. That means that there can be no "holes" in the sequence. This is because there is no way in the netcdf API to read or write a single element in the sequence. One can only read or write the whole sequence as one object [IMO, which is why using UNLIMITED is better].

This means that there can be no concept of a missing-value that needs a fillvalue. All of the values in the sequence are defined as having whatever value was put there by the user's program. If the user chooses some value to represent a missing value, then that is their choice and the netcdf library has no knowledge of it.

Possibly you are being misled by the two functions nc_get_vlen_element and nc_put_vlen_element in netcdf.h. These functions are technically there only to support access to VLENs by Fortran code. It appears the documentation for these functions is incorrect or at best misleading. They do not allow the read/write of a specific element in a VLEN sequence. I will submit an issue to correct the documentation.

Let me know if I am not addressing your question.

czender commented 6 years ago

Thank you, Dennis. I understand that there is no policy needed by the netCDF library. However, users will need a best practice to indicate which values in a VLEN sequence are to be arithmetically ignored. Consider storing all temperature measurements from a set of N drifting buoys in a VLEN array of size N. Each sequence has its own number of measurements. A VLEN array is presumably a suitable storage strategy for ragged data like this. Each sequence can have gaps with bad data that should be ignored. What is the best practice for indicating the value that indicates bad data? For all atomic NC_TYPEs we do this with _FillValue, but that does not work directly for elements of VLEN sequence.

The best practice is something that NCO, not netCDF, needs to implement in order to produce meaningful averages etc. of VLEN sequences. I think the lack of a best practice will lead to confusion and continued non-adoption of VLENs. This may be an OK outcome, but the first satellite data (S5P) to use VLEN is now using it for radiance data and we need some best practices to handle the data. I can define a best practice by implementing it in NCO. This is Unidata's chance to voice its preference. Voicing no preference is also fine. I just want there to be a consensus among those who are affected by this issue.

Some would say using "missing_value" to indicate ignorable elements. That would open another can of worms, though at least it would be defensible. Currently I favor using the first element of any user-defined _FillValue to indicate data to arithmetically ignore in a VLEN sequence.

DennisHeimbigner commented 6 years ago

I see, and as a best practice it is important for users. My immediate reaction is to define the preferred fill value for a VLEN sequence of values is to use the default fillvalue for the basetype of the VLEN. However, the problem is what to do if the user specifies an explicit fill value for some variable whose type is a VLEN. The problem is that there is defined way to specify a fillvalue for a type (as opposed to a variable). So there is no way to set a fill value for the basetype of the vlen. This can only be defined by some convention, as you note.

You propose this:

Currently I favor using the first element of any user-defined _FillValue to indicate data to arithmetically ignore in a VLEN sequence. but I am not sure I understand it. Can you give an example?

czender commented 6 years ago

_FillValue={1,bp} where bp points to the basetype sequence whose first value would be used, e.g., 1.0e36 for floating point types. I could also be convinced that using the default NC_FILL for the basetype should be treated prima facie as indicating to ignore such data. However, I prefer placing the choice and responsibility of the indicator on the dataset producer, to avoid implicit assumptions.

DennisHeimbigner commented 6 years ago

If I understand this correctly, then your solution rings a bell. It is in fact in line with how we specify fill values for variables whose base type is compound. That is, the attribute _FillValue must be of the same compound type and a single instance. The values of the fields of the single instance are used to (recursively) define the fill value for the fields of the compound type. I will need to do some research to see if this approach for compound types is well described in the documentation and if there is some corresponding documentation for VLENs. I suspect neither compound nor VLENfill values are described (at least in detail). Note that your solution also should work recursively if the VLEN base type is a compound or VLEN type itself. In sum, this seems like the right solution and is consistent with existing (probably undocumented) practice for compound types.

czender commented 6 years ago

Thank you for the extensive feedback and explanation of _FillValue for compound. Based on that, NCO will follow the method I propose above to indicate ignorable elements for VLEN. As you say, it is in the same spirit as the _FillValue solution for compound.

Unidata / netcdf-c

Best practices for missing values (aka _FillValues) within VLENs? #1011