Open Alexander-Barth opened 7 years ago
Yes, this should have been done a while ago. The scale_factor part should be quite simple. The other part would need a good API, maybe a keyword argument would be sufficient like: ncread(...,missingas=:dataarray)
or ncread(...,missingas=:NaN)
I hope to find some time during the Christmas break to do this.
Yes, this would be very useful. For the lower-level API, might one could have something like this:
nc = NetCDF.open("some_file.nc")
regular_array = nc.vars["variable_name"][:,:]; # as implemented now
instance_of_data_array = nc.data["variable_name"][:,:]; # data array with scaling
If this API looks fine, I can try to implement it and make a pull request (if I succeed).
Just wanted to note that rather than using DataArray
, we probably should use NullableArray
instead. DataFrames
master has already switched over, NullableArray
should work better with Julia's type system.
Has there been any progress on this @Alexander-Barth ? Is it something that I could easily submit a PR for, where would you want it to go? @meggart
I started doing this and then it turned out to be more difficult than I thought to do properly. The main problem is that currently NcVar
s have type parameters like Julia arrays, which means that e.g. reading from an NcVar{Float32,3}
returns an Array{Float32,3}
and methods like readvar! and ncread! can work in-place.
When we have a scale factor and offset, it is common that the return type is not the same as the NetCDF storage type, for example a variable might be NC_UBYTE on disk with a scale_factor of 3.f0, so the data has to be converted to floating-point when reading.
This all got very confusing and I was not really happy with the resulting interface, so I stopped working on it until I have some time to think about it. Maybe you have some good ideas how to do this, here are my attempts so far: https://github.com/JuliaGeo/NetCDF.jl/commit/1713963675cd0112ed08012f02a2f437e6d3a149
Another question would be how to deprecate the old behaviour. I think a change like this will inevitably break lots of existing code.
Indeed, it is not easy to implement this efficiently. Maybe one can implement two sets of types (for NetCDF variable (NcVarRaw, NcVarScaled) and NetCDF files (NcFileRaw, NcFileScaled). Indexing a NcFileRaw type would produce a NcVarRaw and indexing a NcVarRaw would return the same type as in the NetCDF files (ignoring all scaling and _FillValues).
However, indexing NcFileScaled (or a better name) and NcVarScaled would return always a DataArray/NullableArray of Float64 for numbers (with scaling and masking _FillValues) and a DataArray/NullableArray of chars for characters (ignoring the scaling but masking _FillValues).
I'm completely new to Julia, but I work with netCDFs a lot so I need this functionality if I'm going to continue. Has there been any more progress on this? Otherwise I'm willing to have an attempt at implementing something.
In case, you are interested. Here is my attempt: https://github.com/Alexander-Barth/NCDatasets.jl
@Alexander-Barth I have looked at your new interface, it looks good, and I think I will try it soon. Do you think it would make sense to factor out the netcdf_c.jl and build.jl into its own package (NetCDF-C) so that you could depend on this one? Then it would be easier to experiment with new interfaces for reading and writing NetCDF files, while minimizing code duplication.
In fact, I started to change netcdf_c.jl quite bit and I am not sure if you are interested in these changes. I think that this file is an automatically generated file (right?). The changes I made are mainly returning e.g. scalar values instead of changing the elements of a one element vector. I also changed some parameter types from Cint to Csize_t if in the corresponding C API size_t is used.
@Alexander-Barth Interested in trying this out.
Ok, I think then it is better to leave the auto-generated file as it is. In general, I think it would be good if you keep posting progress on your new interface, especially once you register it as a package, then we could mention it as an alternative in the README.
Ok, I agree. The other aspect (similar to scaling) that I have implemented is returning time variables (if they have a unit attribute of the form "days/hours/.. since start date) as arrays of DateTime
:
https://github.com/Alexander-Barth/NCDatasets.jl/blob/master/src/NCDatasets.jl#L146
In fact, I recently registered NCDatasets. I added the following to its README file:
https://github.com/Alexander-Barth/NCDatasets.jl/commit/daeda94c3de25d3876b436c720353a1abb1e8431 I hope that this is OK for you.
@Alexander-Barth thanks for showing NCDatasets, it wasn't on my radar before. We should indeed link to it in our readme as well.
Regarding netcdf_c.jl
, it is currently auto generated, followed by several manual steps, which is is briefly documented in wrap.jl. Ideally this becomes fully automatic, such that updating to a newer NetCDF version is easy as possible, but it requires some work on both wrap.jl and Clang.jl. Manual edits are much quicker to get something that works though.
In my opinion we should aim towards either:
Having NCDatasets.jl depend on NetCDF.jl, to reduce double effort, and NCDatasets can use the NetCDF low level API to build any more convenient API. This is more or less what is the idea behind GDAL.jl and ArchGDAL.jl, where GDAL.jl is pretty much a 1 to 1 translation of the GDAL API to Julia, on which ArchGDAL.jl can build to make a friendlier API. The advantage is that if someone has a need for a different API, they can relatively easily build a new API on top of GDAL.jl, without having to do the (not very interesting) plumbing.
Try to merge the packages together? I understand that starting a separate package can be a great way to quickly work out new ideas, so I'm happy that you did that, and we can learn from each other. That said, I'm not sure there are any fundamental differences in where we want to go.
What are your thoughts? @meggart @Alexander-Barth
Another motivating factor for having one of the packages depend on the other is to have a single binary (and build file) shared across both packages
Initially, I considered indeed to depend NCDatasets.jl on NetCDF.jl but I did not choose to follow to road:
ncidp = Vector{Cint}(1)
nc_create(path,cmode,ncidp)
# do something with ncidp[1]
While I wanted to use simply ncid = nc_create(path,cmode)
.
Your intermediate interface (which is indeed defines nc_create the way I want to call it) makes extensive use of global variables. I try to avoid global variables (essentially to make the code easier to maintain and to avoid potential thread-safety issues).
Way too often I am in the situation where I want to install software A, but which requires software B, but which require software C, ... (not just for julia package but in general). And all of them have different requirements and which makes it very hard for the user to install them sometimes. The benefit of the dependency must be quite high compared to cost of the potential maintenance headaches in the future. Given the simplicity of the function like nc_create
I do not think that the benefit is large enough.
The library libnetcdf.so
is installed for both packages with apt-get or conda and it is thus shared between both packages.
About the merging option:
Of course I see also the overall benefit of combining all forces in a single NetCDF package, but I did not see how I could change NetCDF.jl the way I wanted it to be without massively breaking the user's current code.
Thanks for the clarification and sorry for the long response time. A general comment first: Writing this package was my first attempt at learning Julia in 2012, so there are indeed many ways to improve the code, and maybe a complete re-write as you started would make sense.
Regarding the globals, I see the problem when it comes to threading, the main reason they were introduced was to avoid unnecessary memory allocations. I don't have any objections to reverting this, I think this was over-optimizing and we should, now that Julia has threads put thread safety over saving allocations.
I am still not completely convinced of your dependency argument, since the NetCDF-C.jl package would be a very small package without any other Julia dependencies, containing just a single netcdf-c.jl source file.
I support the idea of having separate packages for imperative reading and writing data vs a Object-Dataset like approach, and definitely think you should go on and register your package as an alternative.
In Python and Matlab/Octave, the NetCDF variables scale_factor and add_offset are applied when these attributes are defined, according to this simple rule:
variable2 = scale_factor * variable + add_offset
As far as I can see, this is also the default for the R package RNetCDF. Is it possible to add this behaviour to ncread and to getindex/setindex! functions? (I actually prefer the later API because indexing is much more natural).
In a similar vein, is it also possible to set values to NaN which are equal to _FillValue or maybe use DataArrays?