Closed ryofurue closed 9 months ago
(A workaround is to explicitly set
_FillValue
when you create the file.)
That's the intended behaviour. If you have a variable with missing values, you have to declare a fill value. The 1e37 was inserted by the netcdf c library because the corresponding elements where not initialized. Consider the case in Julia when you have an array with non- initialized elements. They are not getting transformed to missings either.
That's the intended behaviour. If you have a variable with missing values, you have to declare a fill value.
That's not the case!
using NCDatasets
tvals = collect(map(Float64, 0:3))
var = [3.0, missing, 9.1, 4.2]
NCDataset("tmp.nc", "c") do ds
defVar(ds, "time", tvals, ("time",))
defVar(ds, "var", var, ("time",))
end
NCDatasets
is kind enough to set _FillValue
for the user! It's much better than the underlying netCDF library! It's much kinder to the user.
The 1e37 was inserted by the netcdf c library because the corresponding elements where not initialized.
But the netcdf C library doesn't recognize the CF convention. If the CF convention were part of the netcdf C library, it should set _FillValue=1e37
for the variable, to avoid creating a non-CF file.
NCDatasets
is not just a wrapper to the netcdf C library. It creates (or tries to create whenever possible) netCDF files that conform to the CF convention.
Consider the case in Julia when you have an array with non- initialized elements. They are not getting transformed to missings either.
What you are saying is: In your example, NCDatasets
cannot set _FillValue
for the user even if it wants to help the user. (I'm sure that you are not saying you do not want to help the user.)
But, in the example I initially posted, NCDatasets
can help the user by automatically setting _FillValue = 1e37
, can't it? Recall the example I show above in this message of mine, where NCDatasets
helps the user by automatically converting missing
to a floating point value and setting that value to _FillValue
. Almost exactly the same treatment can be applied to my initial example.
I think NCDatasets
should create a correct CF-conforming netCDF file whenever possible with minimal efforts (and knowledge) from the user.
In order to have type stable code, one must know at the call of defVar
if the element type is Float64
or Union{Missing,Float64}
(likewise for other types).
In the example you copied:
using NCDatasets
tvals = collect(map(Float64, 0:3))
var = [3.0, missing, 9.1, 4.2]
NCDataset("tmp.nc", "c") do ds
defVar(ds, "time", tvals, ("time",))
defVar(ds, "var", var, ("time",))
end
defVar
knows that be must set the fill value.
However, when you compare these two examples:
example 1.
using NCDatasets
tvals = collect(map(Float64, 0:2))
var = tvals
NCDataset("tmp.nc", "c") do ds
defDim(ds,"time", Inf)
defVar(ds, "time", tvals, ("time",))
v = defVar(ds, "var", Float64, ("time",))
v[3] = var[3] # <- Skipping other time steps
end
example 2.
using NCDatasets
tvals = collect(map(Float64, 0:2))
var = tvals
NCDataset("tmp.nc", "c") do ds
defDim(ds,"time", Inf)
defVar(ds, "time", tvals, ("time",))
v = defVar(ds, "var", Float64, ("time",))
v[3] = var[3] # <- Skipping other time steps
v[1] = 1
v[2] = 1
end
In example 1, clearly _FillValue must be set but not in example 2. The function defVar does not know whether the user will define all elements or not.
But the netcdf C library doesn't recognize the CF convention. If the CF convention were part of the netcdf C library, it should set _FillValue=1e37 for the variable, to avoid creating a non-CF file.
The only way I see to implement this is to keep track of all elements that have been written to and add the attribute _FillValue
if necessary when the file is closed. But this could use potentially a lot of memory, make it unsuitable for the case where the arrays are langer than the available RAM and slow down write operations with NCDatasets.
Setting _FillValue unconditionally, would be quite annoying as future users of these files would need to handle arrays with Union{Float64,Missing} even if there is no missing value.
Yes, what you say makes a lot of sense. I now see the problem better.
But,
Setting _FillValue unconditionally, would be quite annoying as future users of these files would need to handle arrays with Union{Float64,Missing} even if there is no missing value.
Would it really cause any practical problems? Who would be annoyed in what way?
need to handle arrays with Union{Float64,Missing} even if there is no missing value.
I don't get this point. In most use cases, you just don't care whether it's Union{Float64,Missing}
or Float64
when there is no missing value. When there are missing values, you are glad that it's Union{Float64,Missing}
:-)
To me, Union{Float64,Missing}
is one of the greatest features of Julia.
So, I would say, set _FillValue = NaN
to all floating-point variables unless the user opts out.
This is all consistent with the netcdf C library: It puts undefined values as 1e37. So, the CF convention should say that all variable shall have _FillValue = 1e37
by default. As long as one uses netCDF files, one should regard missing values as an integral part.
I notice that in this issue tracker, there is a problem with calling an external Python function with Union{Float64,Missing}
. But that has nothing to do with NCDatasets
. Missing
is a part of the Julia language and so the writer of the python-calling function has to be prepared to deal with Missing
to map Julia's Missing
to Python's mask.
Having an array of Union{Float64,Missing}
requires the compiler add a special cases for missing elements, so it is typically slower (see below). You need 1 byte more per element and it is problematic on a GPU (https://github.com/Alexander-Barth/NCDatasets.jl/issues/132). Beside the issue with PyPlot and PythonPlot, it makes also the interoperability with C libraries difficult. A Matrix{Float64}
can be passed to a C function without copying which is not the case for a Matrix{Union{Float64,Missing}}
.
Union{Float64,Missing}
has its place when there are actually missing value, but I think it would be heavy-handed to impose it in every case.
Additional point to consider: the CF convention does not allow _FillValue for coordinate variables (https://cfconventions.org/cf-conventions/cf-conventions.html#attribute-appendix). The user is free rename a variable, so that it becomes a coordinate variable after the variable is defined in the NetCDF file.
julia> using Missings
julia> A = zeros(1000_0000);
julia> function addone!(A)
for i in eachindex(A)
A[i] += 1
end
end
addone! (generic function with 1 method)
julia> B = allowmissing(A);
julia> @code_typed addone!(A)
CodeInfo(
1 ── %1 = Base.arraysize(A, 1)::Int64
│ %2 = Base.slt_int(%1, 0)::Bool
│ %3 = Core.ifelse(%2, 0, %1)::Int64
│ %4 = Base.slt_int(%3, 1)::Bool
└─── goto #3 if not %4
2 ── goto #4
3 ── goto #4
4 ┄─ %8 = φ (#2 => true, #3 => false)::Bool
│ %9 = φ (#3 => 1)::Int64
│ %10 = φ (#3 => 1)::Int64
│ %11 = Base.not_int(%8)::Bool
└─── goto #10 if not %11
5 ┄─ %13 = φ (#4 => %9, #9 => %23)::Int64
│ %14 = φ (#4 => %10, #9 => %24)::Int64
│ %15 = Base.arrayref(true, A, %13)::Float64
│ %16 = Base.add_float(%15, 1.0)::Float64
│ Base.arrayset(true, A, %16, %13)::Vector{Float64}
│ %18 = (%14 === %3)::Bool
└─── goto #7 if not %18
6 ── goto #8
7 ── %21 = Base.add_int(%14, 1)::Int64
└─── goto #8
8 ┄─ %23 = φ (#7 => %21)::Int64
│ %24 = φ (#7 => %21)::Int64
│ %25 = φ (#6 => true, #7 => false)::Bool
│ %26 = Base.not_int(%25)::Bool
└─── goto #10 if not %26
9 ── goto #5
10 ┄ return nothing
) => Nothing
julia> @code_typed addone!(B)
CodeInfo(
1 ── %1 = Base.arraysize(A, 1)::Int64
│ %2 = Base.slt_int(%1, 0)::Bool
│ %3 = Core.ifelse(%2, 0, %1)::Int64
│ %4 = Base.slt_int(%3, 1)::Bool
└─── goto #3 if not %4
2 ── goto #4
3 ── goto #4
4 ┄─ %8 = φ (#2 => true, #3 => false)::Bool
│ %9 = φ (#3 => 1)::Int64
│ %10 = φ (#3 => 1)::Int64
│ %11 = Base.not_int(%8)::Bool
└─── goto #20 if not %11
5 ┄─ %13 = φ (#4 => %9, #19 => %42)::Int64
│ %14 = φ (#4 => %10, #19 => %43)::Int64
│ %15 = Base.arrayref(true, A, %13)::Union{Missing, Float64}
│ %16 = (isa)(%15, Missing)::Bool
└─── goto #7 if not %16
6 ── goto #10
7 ── %19 = (isa)(%15, Float64)::Bool
└─── goto #9 if not %19
8 ── %21 = π (%15, Float64)
│ %22 = Base.add_float(%21, 1.0)::Float64
└─── goto #10
9 ── Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{}
└─── unreachable
10 ┄ %26 = φ (#6 => true, #8 => false)::Bool
│ %27 = φ (#6 => false, #8 => true)::Bool
│ %28 = φ (#8 => %22)::Float64
└─── goto #12 if not %26
11 ─ Base.arrayset(true, A, $(QuoteNode(missing)), %13)::Vector{Union{Missing, Float64}}
└─── goto #15
12 ─ goto #14 if not %27
13 ─ Base.arrayset(true, A, %28, %13)::Vector{Union{Missing, Float64}}
└─── goto #15
14 ─ Core.throw(ErrorException("fatal error in type inference (type bound)"))::Union{}
└─── unreachable
15 ┄ %37 = (%14 === %3)::Bool
└─── goto #17 if not %37
16 ─ goto #18
17 ─ %40 = Base.add_int(%14, 1)::Int64
└─── goto #18
18 ┄ %42 = φ (#17 => %40)::Int64
│ %43 = φ (#17 => %40)::Int64
│ %44 = φ (#16 => true, #17 => false)::Bool
│ %45 = Base.not_int(%44)::Bool
└─── goto #20 if not %45
19 ─ goto #5
20 ┄ return nothing
) => Nothing
julia> @btime addone!(A)
3.034 ms (0 allocations: 0 bytes)
julia> @btime addone!(B)
9.824 ms (0 allocations: 0 bytes)
Having an array of Union{Float64,Missing} requires the compiler add a special cases for missing elements, so it is typically slower
I'm aware of that, but do you use a netCDF array in performance critical loops? I wouldn't.
I would copy a section of the array from the netCDF file into the main memory for performance:
v = copy(ds["myvar"][:, :, 3,:]) # v is Array{Union{Float64,Missing}}
# work on v[:,:,:] in the loop
So, if the performance degradation of Missing is really problematic for you, you would change the above code to
v = replace(ds["myvar"][:,:,3,:], missing=>NaN) # v is Array{Float64}
# work on v in the loop.
I mean, if you need performance, you copy the data from the netCDF file into a native array anyway. So, if the overhead of Missing really matters, you just replace missing
to NaN
when copying your data. The cost of this replacement is negligible compared to the read from the file into memory.
Or, do you really mean that a netCDF array is as fast as a native array and so people actually use netCDF arrays in performance critical loops without copying them into native arrays?
A Matrix{Float64} can be passed to a C function without copying which is not the case for a Matrix{Union{Float64,Missing}}.
But you have to copy the data from the netCDF file to a native array anyway. I mean, you have to copy ds["myvar"][:,:,:]
into Array{Float64,3}
and send the latter to the C function. Is that right? The copy is necessary whether the netCDF array has Missing or not. So, in a situation where Missing doesn't work, you copy into a Float64 array, not into a Union{Missing, Float64} array.
Additional point to consider: the CF convention does not allow _FillValue for coordinate variables . . .
I agree that that's a real problem. (You could scan the coordinate variables and delete _FillValue before saving it, but it's ugly, I admit.)
I'm aware of that, but do you use a netCDF array in performance critical loops? I wouldn't.
Once you index a NCVariable.CFVariable
(e.g. ds["var"][:,:]) you get an plain Julia array (and directly suitable for number crunching) and no longer a NetCDF array.
In your example the copy
is unnecessary.
If you use only small arrays, then the need to copy would not matter to you. But for very large arrays (or systems with low memory), this can be an issue (I had past issue report about this).
Concerning the copy, assuming that you copy an array from a netcdf file (without fill value), currently the flow is:
input.nc -> Array{Float64,3} -> output.nc
With a fill value set it is:
input.nc -> Array{Float64,3} -> Array{Union{Float64,Missing},3} -> Array{Float64,3} -> output.nc
The netCDF C code can only deal with Array{Float64,3}
.
I am closing this issue because I think that the current behavior best serves the most users.
Sorry for my silence for this issue. This message is just for discussion. I don't mean to re-open the issue.
If you use only small arrays, then the need to copy would not matter to you. But for very large arrays (or systems with low memory), this can be an issue (I had past issue report about this).
What do you mean by "this"? If the entire array doesn't fit into the memory, you want to use the netCDF array without copying it. But in that case, you'll use the netCDF array without copying it into the memory. That's what you are saying. I that right?
But you don't want to that in a tight loop because a lot of time will be then spend to read pieces of the big array from the file.
So, it seems to me you are confusing these things:
In cases 1 & 2, you handle Missing
when copying the (slice of the) array into memory.
In case 3, the performance degradation due to Missing
doesn't matter. If you use a big netCDF array within the loop, the time spent to read the pieces from the file overwhelms the time to handle Missing
. If you don't like the low performance, you have to go to method 1 or 2.
So, that was my argument. Performance hit due to Missing
doesn't matter for a netCDF array. If it matters, you copy the data from the file and remove Missing
when copying.
Describe the bug
In some cases,
NCDatasets
writes its own "missing" values without setting_FillValue
. A netCDF file created like that doesn't work correctly. (A workaround is to explicitly set_FillValue
when you create the file.)To Reproduce
Environment
versioninfo()
: See below.using Pkg; Pkg.status(mode=PKGMODE_MANIFEST)
: See belowFull output
versioninfo()
: