Alexander-Barth / NCDatasets.jl

Load and create NetCDF files in Julia
MIT License
149 stars 32 forks source link

Specifying the fill value when reading a file #188

Closed sethaxen closed 9 months ago

sethaxen commented 2 years ago

If the _FillValue attribute is not set in a NetCDF file, it seems this package defaults to all arrays being read with Union{Missing,T} eltypes, i.e. a missing fill value. If writing the NetCDF file, I can change _FillValue to change this behavior, but is there also a way to specify the default fill value when the file is opened in read-only mode?

My main use case is either having the default fill value be NaN or having only a typeunion with Missing if values are actually missing.

Alexander-Barth commented 2 years ago

If the _FillValue attribute is not set in a NetCDF file, it seems this package defaults to all arrays being read with Union{Missing,T} eltypes, i.e. a missing fill value.

This should not be the case (unless a missing_value is defined). Can you give a (small) example where you see this behaviour? Currently we look for the _FillValue and missing_value attributes.

julia> using NCDatasets;
# example from xarray 
typeof(NCDataset("/home/abarth/.local/lib/python3.8/site-packages/xarray/tests/data/example_1.nc")["temp"][:])
Array{Float32, 4}
sethaxen commented 2 years ago

Ah, I don't have an example of this specific issue right now, but here's a perhaps related one. In this dataset, _FillValue is set to NaN, yet the eltype NCDatasets returns still has Missing:

julia> using NCDatasets

julia> f = Base.download("https://ndownloader.figshare.com/files/24067472");

julia> ds = NCDataset(f);

julia> g = ds.group["posterior"]["g"]
g (2 × 500 × 4)
  Datatype:    Float64
  Dimensions:  g_coef × draw × chain
  Attributes:
   _FillValue           = NaN

julia> eltype(Array(g))
Union{Missing, Float64}

julia> close(ds);
Alexander-Barth commented 2 years ago

OK, but in this example _FillValue is indeed set. We use missing because, NaN does not work for integers for example.

In your case, what you can do is one of the following:

julia> g = cfvariable(ds.group["posterior"],"g",fillvalue = nothing)
g (2 × 500 × 4)
  Datatype:    Float64
  Dimensions:  g_coef × draw × chain
  Attributes:
   _FillValue           = NaN

julia> eltype(g)
Float64

julia> g = ds.group["posterior"]["g"].var
g (2 × 500 × 4)
  Datatype:    Float64
  Dimensions:  g_coef × draw × chain
  Attributes:
   _FillValue           = NaN

julia> eltype(g)
Float64

The second approach ignores all CF conversions (add_offset, scale_factor and time conversion).

visr commented 2 years ago

fillvalue = nothing ignores the fillvalue that is set, right? What would be a good way to mimic xarray behavior, to convert the missing data to NaN instead of missing? Regardless of whether _FillValue is NaN or say -1? This would only make sense for floating point data.

For the case where _FillValue is NaN, like your example, that would amount to the same thing as cfvariable(ds.group["posterior"],"g",fillvalue = nothing)

sethaxen commented 2 years ago

Why choose Union{T,Missing} instead of Union{T,Float64} when _FillValue=NaN? For T<:Float64, there is then no type union, and for Int arrays, one avoids injecting missings when the file specified NaNs?

Alexander-Barth commented 2 years ago

fillvalue = nothing ignores the fillvalue that is set, right?

It is correct. If somebody would want to extend cfvariable to support e.g.

cfvariable(ds.group["posterior"],"g",sentinelvalue = NaN)

where all _FillValue gets replaced by sentinelvalue, that would be nice (It is unlikely that I find the time myself in the near term and I am not sure if this approach would be very convenient to use).

for Int arrays, one avoids injecting missings when the file specified NaNs?

I don't think that you cannot specify a FillValue of NaN for an Int array. The error message would be Not a valid data type or _FillValue type mismatch. Also julia quickly promotes Int to Floats when combined in an array with NaNs

julia> vcat([1,2],[NaN])
3-element Vector{Float64}:
   1.0
   2.0
 NaN

This leads to issue similar to these: https://github.com/pydata/xarray/issues/1194

If we would use Union{T,Float64} when _FillValue=NaN then a user would need to check constantly with ismissing and with isnan if a value is valid or not or keep the _FillValue in the NetCDF file around.

Originally NCDatasets used DataArrays.jl which got deprecated in favor of Union{T,Missing}. This blog post explains well the rational of this approach. Missing is also used in other packages like DataFrames.jl.

We have also the function nomissing if you want to use a different sentinel value:

v_with_nan = nomissing(ds["var"][:],NaN)
v_with_nan = nomissing(ds["var"][:]) # error if there is a missing value

The type signaling that an array may contain missing value can also be used for dispatch.

method(a::Vector{Union{Missing,Float64}}) = fast_method(fill_missing(a))
method(a::Vector{Float64}) = fast_method(a)

If we would use NaN as the only missing value (and substitute it to a different value when writing to a file), we would also consider its an impact on the size:

julia> Base.summarysize(Vector{Union{Int8,Missing}}(undef,100))
240

julia> Base.summarysize(Vector{Union{Int8,Float32}}(undef,100)) # for NaN32
540

julia> Base.summarysize(Vector{Union{Int8,Float64}}(undef,100))
940

Beside, integer and floats, NCDatasets can also return an array of Char, String, and DateTime. Having a NaN as missing value among those looks weird to me.

That being said: in other packages that I wrote, I also use NaN as missing value, because I can be sure to deal only with floating point numbers. However, for NCDatasets, I think a more generic approach is better.

Previous discussion: https://github.com/Alexander-Barth/NCDatasets.jl/issues/132

visr commented 2 years ago

Thanks for the detailed answer. I agree that missing makes most sense as the default missing value.

Good to hear you'd be open to something like

cfvariable(ds.group["posterior"],"g",sentinelvalue = NaN)

Though I suspect for most cases this is also fine:

v_with_nan = nomissing(ds["var"][:],NaN)
Alexander-Barth commented 10 months ago

I implemented alternative to missing values in a branch of NCDatasets.jl and CommonDataModel.jl (called emv). missing is still the default, but a user can provide an different value like NaN. Here is the documentation for this features (will possible caveats):

https://github.com/Alexander-Barth/NCDatasets.jl/blob/emv/docs/src/fillvalue.md

Do you have a suggestion how this keyword argument can be named (rather than _experimental_missing_value) ? (fillvalue and missing_value are already used to override the attributes defined in the NetCDF files). Maybe sentinelvalue as mentioned before? Or any better name?

This should not be a breaking changes and tests in GRIBDatasets still pass.

CC: @visr @sethaxen @rafaqz @Datseris @joa-quim

ctroupin commented 10 months ago

As it that can be used per variable, maybe just var_missing_value?

Also in fillvalue.md it should be "allows one to..." instead of "allows to"

Datseris commented 10 months ago

The function nomissing should be named replace_missing imo.

It is difficult to keep track of three names: missing_value, fillvalue and the new sentinelvalue (and I find it weird how some have _ and some do not, better to be consistent with naming and have _ in all). How can it be that we really need three keyword arguments? I believe I have not understood the reason for that and what the three names each achieve that can't be done by a combination of the other two.

(edit: and of course, thanks for the work and for tagging me!)

rafaqz commented 10 months ago

Thanks this will be good to have.

In Rasters.jl the keyword will be replace_missing=missing to suggest that we are actively changing the missing value from what it is on disk (and theres already a function with that name that dies that).

joa-quim commented 10 months ago

Not that important but I would prefer the name nodata or noadavalue or variations of this with underscores. The issue for me with replace_missing is that the function will not be replacing any missing because there are no missings in the nertCDF file.

Alexander-Barth commented 10 months ago

I had a look to all keyword arguments in NCDatasets (including un-exported functions). We usually do not use _. The only exceptions are missing_value, scale_factor and add_offset. They are using in cfvariable do shadow the corresponding attributes as defined by the CF convention (and we use the same names as defined in the CF convention with _, but we did change _FillValue to fillvalue as the leading underscore can be understood as a private keyword argument).

julia> function list_kwargs(mod)
          mm = reduce(vcat,[methods(getproperty(mod,p)) for p in names(mod,all=true)])
          return unique(reduce(vcat,[Base.kwarg_decl(a) for a in mm]))
       end
list_kwargs (generic function with 1 method)

julia> list_kwargs(NCDatasets)
30-element Vector{Symbol}:
 :...
 :format
 :share
 :diskless
 :persist
 :memory
 :_experimental_missing_value
 :attrib
 :parentdataset
 Symbol("kwargs...")
 :locale
 :_v
 :_parentname
 :fillvalue
 :missing_value
 :scale_factor
 :add_offset
 :units
 :calendar
 :chunksizes
 :shuffle
 :deflatelevel
 :checksum
 :nofill
 :typename
 :parents
 :newfname
 :size
 :nelems
 :preemption

How can it be that we really need three keyword arguments?

The CF standard defines two ways to specify an absent data with the attributes _FillValue and missing_value (corresponding to keyword arguments fillvalue and missing_value of the function cfvariable). A CF conforming software has to use both attributes (this is also the case python-netCDF4). There can thus be multiple values to mark absent data. The function cfvariable allows use to override these attributes in case you have a malformed NetCDF file (or you want to mask additional values). So the fillvalue and missing_value correspond to the values in the NetCDF file. It would have been simpler if the CF conventions would not have standardized missing_value and just use _FillValue but this is unfortunately not the case.

The new keyword argument (_experimental_missing_value to be renamed) correspond to the what should be the value marking absent data in the julia array.

So I would rather have new name without underscore.

Other idea: maskedvalue, masking (python-netCDF4 used masked arrays to handle fill values)

Alexander-Barth commented 10 months ago

@lupemba : this is the change that I mentioned to you this morning, for your information :-)

milankl commented 10 months ago

Just to chip in, I'd also enjoy an easy way to get an Array{T} instead of Array{Union{Missing,T}} back. I'd vote for a ignore_missing which just reads out the bits as they are and interprets them as T in combination with replace_missing which you can set to NaN as described above. Would an interface like the following work?

julia> using NCDatasets
julia> ds = NCDataset("test_missing.nc","c")
julia> data = Float32[1 2; 3 4]
2×2 Matrix{Float32}:
 1.0  2.0
 3.0  4.0

julia> v = defVar(ds,"temperature",data,("lon","lat"), attrib=Dict("_FillValue"=> 1.0f0))
julia> close(ds)

Then read it as

julia> ds = NCDataset("test_missing.nc")
julia> ignore_missing(ds["temperature"])[:,:]
2×2 Matrix{Float32}:
 1.0       2.0
 3.0       4.0

julia> replace_missing(ds["temperature"],NaN)[:,:]
2×2 Matrix{Float32}:
 NaN       2.0
 3.0       4.0
Alexander-Barth commented 10 months ago

I wanted to CC also @tcarion as this additional option would also be available to users GRIBDatasets. To change the value for absent data at a dataset level, GRIBDatasets would need to specify a function like:

_experimental_missing_value(ds::GRIBDataset) = ... something ...

where "something" would be the value of a keyword argument to GRIBDatasets, as in here:

https://github.com/Alexander-Barth/NCDatasets.jl/commit/514559ed39b193b2f27a83d32ebdf1840c4f2403#diff-4123c3f14d4a74a8fd32785a9b3f65d0cb1ea2ada4137952345f0751bbdcbfeaR69

Note method _experimental_missing_value name will be changed, maybe the same as the keyword argument of cfvariable or with a default prefix (such as maskedvalue or defaultmaskedvalue).

This method currently defaults to missing: https://github.com/JuliaGeo/CommonDataModel.jl/blob/4a3a9fc3fac07c053b584b3a9c5e242785ecb261/src/dataset.jl#L123

So this will not a breaking change. If you prefer to maintain missing as the only option, you do not need to do anything.

The default replacement value for missing data could then be stored in the GRIBDataset struct:

https://github.com/JuliaGeo/GRIBDatasets.jl/blob/9c7e442ed5b0e0cdbc6b2b5690b6f05e1386b09f/src/dataset.jl#L144

This is at least how I implemented it in NCDatasets. Once we agree on a name, I can make a PR to GRIBDatasets. This should be just a minimal change.

I am here at the Julia EO 2024 workshop where I showed the NCDatasets.jl and CommonDataModel.jl package. People are quite happy here to hear that they can read GRIB and netcdf files with the same API :-)

Datseris commented 10 months ago

I like @milankl 's suggestion with the exception that it should be

ignore_missing(ds["temperature"])

without [:,:]. This isn't necessary, one calls the function anyways because one wants to actually load the numerical data.

lupemba commented 10 months ago

I also like @milankl 's suggestion.

@Datseris I think the original proposal works better with lazy loading. I might what to ignore/replace missing for a variable and then just load subsets of it later.

Example

temperatures = ignore_missing(ds["temperature"]) #no loading of data
subset_a = temperatures[1:1000] # load first data
if condition(subset_a)
    subset_b = temperatures[1001:2000] # load more data
end
Alexander-Barth commented 10 months ago

I explained to @milankl that we have already ds["temperature"].var for ignore_missing(ds["temperature"]) off-line. I am not sure that we need a separate API for that. Having replace_missing functionality available in the cfvariable functions allows to do different related operation in one step. For example if someone forgot to declare the fill value attribute e.g. -9999. of a NetCDF variable , one could declare it a posteriori and use NaN as replacement when loading:

ncvar = cfvariable(ds,"data_with_forgotten_fillvalue",fillvalue =-9999., maskedvalue = NaN)

Or put differently, a user might be surprised that overriding an attribute it done via a keyword argument but setting the replacement value is done via a function call.

Also we have already currently nomissing (see the doc that I linked), it would be a bit confusing to have also ignore_missing and replace_missing.

lupemba commented 10 months ago

@Alexander-Barth Thanks for the clarification. I have now read the documentation and the question is now much clearer. https://github.com/Alexander-Barth/NCDatasets.jl/blob/emv/docs/src/fillvalue.md

I have two suggestion. First one, is is possible to just name the keyword argument the same as the function nomissing. I think this would make an intuitive interface since they do the same thing. So A and B would be the same in this case.

A =  nomissing(ds["var"][:,:],NaN)
B = cfvariable(ds,"var",nomissing= NaN)[:,:]

The second suggestion is just to use missing_replacement=NaN. Maybe there should also be a fill_replacement since the fill value and missing value is two different things in the CF conventions.

ds = NCDataset("example.nc","r", missing_replacement=NaN, fill_replacement=0)
Alexander-Barth commented 10 months ago

hi Simon :-) For coherence with other keyword arguments I think we should avoid _ in names (unless there are attributes from the CF convention).

The NetCDF attributes _FillValue and missing_value are currently treated the same way (all corresponding values are replaced by Missing). As far as I know python-netCDF4, xarray and R's ncdf4 do the same thing. It would be kind of a headache to implement this. I would rather wait once we have specific use case.

Indeed the keyword argument can also be called nomissing. I am wondering if we should soft deprecate the function nomissing(ds["var"][:,:],NaN) as the new approach would typically require less memory and avoid an intermediate array. But is some cases it is still useful. When you call it without a replacement value:

data = nomissing(ds["var"][:,:])

it just check if there are no missing value in the array and transforms the for e.g. Array{Union{Float64,Missing},2} into Array{Float64,2}. If there is a missing, an error is raised. In this case nomissing more similar to an assert statement.

Alexander-Barth commented 9 months ago

This is now implemented as described here https://alexander-barth.github.io/NCDatasets.jl/dev/other/#Fill-values-and-missing-values

Thank you for your suggestions and feedback (on-line and off-line).