Closed sethaxen closed 9 months ago
If the _FillValue attribute is not set in a NetCDF file, it seems this package defaults to all arrays being read with Union{Missing,T} eltypes, i.e. a missing fill value.
This should not be the case (unless a missing_value
is defined). Can you give a (small) example where you see this behaviour? Currently we look for the _FillValue
and missing_value
attributes.
julia> using NCDatasets;
# example from xarray
typeof(NCDataset("/home/abarth/.local/lib/python3.8/site-packages/xarray/tests/data/example_1.nc")["temp"][:])
Array{Float32, 4}
Ah, I don't have an example of this specific issue right now, but here's a perhaps related one. In this dataset, _FillValue
is set to NaN
, yet the eltype NCDatasets returns still has Missing
:
julia> using NCDatasets
julia> f = Base.download("https://ndownloader.figshare.com/files/24067472");
julia> ds = NCDataset(f);
julia> g = ds.group["posterior"]["g"]
g (2 × 500 × 4)
Datatype: Float64
Dimensions: g_coef × draw × chain
Attributes:
_FillValue = NaN
julia> eltype(Array(g))
Union{Missing, Float64}
julia> close(ds);
OK, but in this example _FillValue
is indeed set. We use missing
because, NaN does not work for integers for example.
In your case, what you can do is one of the following:
julia> g = cfvariable(ds.group["posterior"],"g",fillvalue = nothing)
g (2 × 500 × 4)
Datatype: Float64
Dimensions: g_coef × draw × chain
Attributes:
_FillValue = NaN
julia> eltype(g)
Float64
julia> g = ds.group["posterior"]["g"].var
g (2 × 500 × 4)
Datatype: Float64
Dimensions: g_coef × draw × chain
Attributes:
_FillValue = NaN
julia> eltype(g)
Float64
The second approach ignores all CF conversions (add_offset, scale_factor and time conversion).
fillvalue = nothing
ignores the fillvalue that is set, right? What would be a good way to mimic xarray behavior, to convert the missing data to NaN
instead of missing
? Regardless of whether _FillValue
is NaN
or say -1
? This would only make sense for floating point data.
For the case where _FillValue
is NaN
, like your example, that would amount to the same thing as cfvariable(ds.group["posterior"],"g",fillvalue = nothing)
Why choose Union{T,Missing}
instead of Union{T,Float64}
when _FillValue=NaN
? For T<:Float64
, there is then no type union, and for Int
arrays, one avoids injecting missing
s when the file specified NaN
s?
fillvalue = nothing ignores the fillvalue that is set, right?
It is correct. If somebody would want to extend cfvariable
to support e.g.
cfvariable(ds.group["posterior"],"g",sentinelvalue = NaN)
where all _FillValue
gets replaced by sentinelvalue
, that would be nice (It is unlikely that I find the time myself in the near term and I am not sure if this approach would be very convenient to use).
for Int arrays, one avoids injecting missings when the file specified NaNs?
I don't think that you cannot specify a FillValue of NaN
for an Int array. The error message would be Not a valid data type or _FillValue type mismatch
. Also julia quickly promotes Int to Floats when combined in an array with NaN
s
julia> vcat([1,2],[NaN])
3-element Vector{Float64}:
1.0
2.0
NaN
This leads to issue similar to these: https://github.com/pydata/xarray/issues/1194
If we would use Union{T,Float64}
when _FillValue=NaN
then a user would need to check constantly with ismissing
and with isnan
if a value is valid or not or keep the _FillValue
in the NetCDF file around.
Originally NCDatasets used DataArrays.jl which got deprecated in favor of Union{T,Missing}
. This blog post explains well the rational of this approach.
Missing is also used in other packages like DataFrames.jl.
We have also the function nomissing
if you want to use a different sentinel value:
v_with_nan = nomissing(ds["var"][:],NaN)
v_with_nan = nomissing(ds["var"][:]) # error if there is a missing value
The type signaling that an array may contain missing value can also be used for dispatch.
method(a::Vector{Union{Missing,Float64}}) = fast_method(fill_missing(a))
method(a::Vector{Float64}) = fast_method(a)
If we would use NaN
as the only missing value (and substitute it to a different value when writing to a file), we would also consider its an impact on the size:
julia> Base.summarysize(Vector{Union{Int8,Missing}}(undef,100))
240
julia> Base.summarysize(Vector{Union{Int8,Float32}}(undef,100)) # for NaN32
540
julia> Base.summarysize(Vector{Union{Int8,Float64}}(undef,100))
940
Beside, integer and floats, NCDatasets can also return an array of Char
, String
, and DateTime
. Having a NaN as missing value among those looks weird to me.
That being said: in other packages that I wrote, I also use NaN as missing value, because I can be sure to deal only with floating point numbers. However, for NCDatasets, I think a more generic approach is better.
Previous discussion: https://github.com/Alexander-Barth/NCDatasets.jl/issues/132
Thanks for the detailed answer. I agree that missing
makes most sense as the default missing value.
Good to hear you'd be open to something like
cfvariable(ds.group["posterior"],"g",sentinelvalue = NaN)
Though I suspect for most cases this is also fine:
v_with_nan = nomissing(ds["var"][:],NaN)
I implemented alternative to missing values in a branch of NCDatasets.jl and CommonDataModel.jl (called emv
).
missing
is still the default, but a user can provide an different value like NaN
.
Here is the documentation for this features (will possible caveats):
https://github.com/Alexander-Barth/NCDatasets.jl/blob/emv/docs/src/fillvalue.md
Do you have a suggestion how this keyword argument can be named (rather than _experimental_missing_value
) ?
(fillvalue
and missing_value
are already used to override the attributes defined in the NetCDF files). Maybe sentinelvalue
as mentioned before? Or any better name?
This should not be a breaking changes and tests in GRIBDatasets
still pass.
CC: @visr @sethaxen @rafaqz @Datseris @joa-quim
As it that can be used per variable, maybe just var_missing_value
?
Also in fillvalue.md
it should be "allows one to..." instead of "allows to"
The function nomissing
should be named replace_missing
imo.
It is difficult to keep track of three names: missing_value
, fillvalue
and the new sentinelvalue
(and I find it weird how some have _
and some do not, better to be consistent with naming and have _
in all). How can it be that we really need three keyword arguments? I believe I have not understood the reason for that and what the three names each achieve that can't be done by a combination of the other two.
(edit: and of course, thanks for the work and for tagging me!)
Thanks this will be good to have.
In Rasters.jl the keyword will be replace_missing=missing
to suggest that we are actively changing the missing value from what it is on disk (and theres already a function with that name that dies that).
Not that important but I would prefer the name nodata
or noadavalue
or variations of this with underscores. The issue for me with replace_missing
is that the function will not be replacing any missing because there are no missings
in the nertCDF file.
I had a look to all keyword arguments in NCDatasets (including un-exported functions). We usually do not use _. The only exceptions are missing_value
, scale_factor
and add_offset
. They are using in cfvariable
do shadow the corresponding attributes as defined by the CF convention (and we use the same names as defined in the CF convention with _, but we did change _FillValue
to fillvalue
as the leading underscore can be understood as a private keyword argument).
julia> function list_kwargs(mod)
mm = reduce(vcat,[methods(getproperty(mod,p)) for p in names(mod,all=true)])
return unique(reduce(vcat,[Base.kwarg_decl(a) for a in mm]))
end
list_kwargs (generic function with 1 method)
julia> list_kwargs(NCDatasets)
30-element Vector{Symbol}:
:...
:format
:share
:diskless
:persist
:memory
:_experimental_missing_value
:attrib
:parentdataset
Symbol("kwargs...")
:locale
:_v
:_parentname
:fillvalue
:missing_value
:scale_factor
:add_offset
:units
:calendar
:chunksizes
:shuffle
:deflatelevel
:checksum
:nofill
:typename
:parents
:newfname
:size
:nelems
:preemption
How can it be that we really need three keyword arguments?
The CF standard defines two ways to specify an absent data with the attributes _FillValue
and missing_value
(corresponding to keyword arguments fillvalue
and missing_value
of the function cfvariable
). A CF conforming software has to use both attributes (this is also the case python-netCDF4). There can thus be multiple values to mark absent data. The function cfvariable
allows use to override these attributes in case you have a malformed NetCDF file (or you want to mask additional values). So the fillvalue
and missing_value
correspond to the values in the NetCDF file. It would have been simpler if the CF conventions would not have standardized missing_value
and just use _FillValue
but this is unfortunately not the case.
The new keyword argument (_experimental_missing_value
to be renamed) correspond to the what should be the value marking absent data in the julia array.
So I would rather have new name without underscore.
Other idea: maskedvalue
, masking
(python-netCDF4 used masked arrays to handle fill values)
@lupemba : this is the change that I mentioned to you this morning, for your information :-)
Just to chip in, I'd also enjoy an easy way to get an Array{T}
instead of Array{Union{Missing,T}}
back. I'd vote for a ignore_missing
which just reads out the bits as they are and interprets them as T
in combination with replace_missing
which you can set to NaN
as described above. Would an interface like the following work?
julia> using NCDatasets
julia> ds = NCDataset("test_missing.nc","c")
julia> data = Float32[1 2; 3 4]
2×2 Matrix{Float32}:
1.0 2.0
3.0 4.0
julia> v = defVar(ds,"temperature",data,("lon","lat"), attrib=Dict("_FillValue"=> 1.0f0))
julia> close(ds)
Then read it as
julia> ds = NCDataset("test_missing.nc")
julia> ignore_missing(ds["temperature"])[:,:]
2×2 Matrix{Float32}:
1.0 2.0
3.0 4.0
julia> replace_missing(ds["temperature"],NaN)[:,:]
2×2 Matrix{Float32}:
NaN 2.0
3.0 4.0
I wanted to CC also @tcarion as this additional option would also be available to users GRIBDatasets
.
To change the value for absent data at a dataset level, GRIBDatasets
would need to specify a function like:
_experimental_missing_value(ds::GRIBDataset) = ... something ...
where "something" would be the value of a keyword argument to GRIBDatasets
, as in here:
Note method _experimental_missing_value
name will be changed, maybe the same as the keyword argument of cfvariable
or with a default
prefix (such as maskedvalue
or defaultmaskedvalue
).
This method currently defaults to missing
:
https://github.com/JuliaGeo/CommonDataModel.jl/blob/4a3a9fc3fac07c053b584b3a9c5e242785ecb261/src/dataset.jl#L123
So this will not a breaking change. If you prefer to maintain missing
as the only option, you do not need to do anything.
The default replacement value for missing data could then be stored in the GRIBDataset
struct:
This is at least how I implemented it in NCDatasets
. Once we agree on a name, I can make a PR to GRIBDatasets
. This should be just a minimal change.
I am here at the Julia EO 2024 workshop where I showed the NCDatasets.jl and CommonDataModel.jl package. People are quite happy here to hear that they can read GRIB and netcdf files with the same API :-)
I like @milankl 's suggestion with the exception that it should be
ignore_missing(ds["temperature"])
without [:,:]
. This isn't necessary, one calls the function anyways because one wants to actually load the numerical data.
I also like @milankl 's suggestion.
@Datseris I think the original proposal works better with lazy loading. I might what to ignore/replace missing for a variable and then just load subsets of it later.
Example
temperatures = ignore_missing(ds["temperature"]) #no loading of data
subset_a = temperatures[1:1000] # load first data
if condition(subset_a)
subset_b = temperatures[1001:2000] # load more data
end
I explained to @milankl that we have already ds["temperature"].var
for ignore_missing(ds["temperature"])
off-line. I am not sure that we need a separate API for that. Having replace_missing
functionality available in the cfvariable
functions allows to do different related operation in one step. For example if someone forgot to declare the fill value attribute e.g. -9999.
of a NetCDF variable , one could declare it a posteriori and use NaN as replacement when loading:
ncvar = cfvariable(ds,"data_with_forgotten_fillvalue",fillvalue =-9999., maskedvalue = NaN)
Or put differently, a user might be surprised that overriding an attribute it done via a keyword argument but setting the replacement value is done via a function call.
Also we have already currently nomissing
(see the doc that I linked), it would be a bit confusing to have also ignore_missing
and replace_missing
.
@Alexander-Barth Thanks for the clarification. I have now read the documentation and the question is now much clearer. https://github.com/Alexander-Barth/NCDatasets.jl/blob/emv/docs/src/fillvalue.md
I have two suggestion. First one, is is possible to just name the keyword argument the same as the function nomissing
. I think this would make an intuitive interface since they do the same thing. So A and B would be the same in this case.
A = nomissing(ds["var"][:,:],NaN)
B = cfvariable(ds,"var",nomissing= NaN)[:,:]
The second suggestion is just to use missing_replacement=NaN
. Maybe there should also be a fill_replacement
since the fill value and missing value is two different things in the CF conventions.
ds = NCDataset("example.nc","r", missing_replacement=NaN, fill_replacement=0)
hi Simon :-) For coherence with other keyword arguments I think we should avoid _ in names (unless there are attributes from the CF convention).
The NetCDF attributes _FillValue
and missing_value
are currently treated the same way (all corresponding values are replaced by Missing
). As far as I know python-netCDF4, xarray and R's ncdf4 do the same thing. It would be kind of a headache to implement this. I would rather wait once we have specific use case.
Indeed the keyword argument can also be called nomissing
. I am wondering if we should soft deprecate the function nomissing(ds["var"][:,:],NaN)
as the new approach would typically require less memory and avoid an intermediate array. But is some cases it is still useful. When you call it without a replacement value:
data = nomissing(ds["var"][:,:])
it just check if there are no missing value in the array and transforms the for e.g. Array{Union{Float64,Missing},2}
into Array{Float64,2}
. If there is a missing, an error is raised. In this case nomissing
more similar to an assert statement.
This is now implemented as described here https://alexander-barth.github.io/NCDatasets.jl/dev/other/#Fill-values-and-missing-values
Thank you for your suggestions and feedback (on-line and off-line).
If the
_FillValue
attribute is not set in a NetCDF file, it seems this package defaults to all arrays being read withUnion{Missing,T}
eltypes, i.e. amissing
fill value. If writing the NetCDF file, I can change_FillValue
to change this behavior, but is there also a way to specify the default fill value when the file is opened in read-only mode?My main use case is either having the default fill value be
NaN
or having only a typeunion withMissing
if values are actually missing.