scale_factor, add_offset and _FillValue

Alexander-Barth commented 7 years ago

In Python and Matlab/Octave, the NetCDF variables scale_factor and add_offset are applied when these attributes are defined, according to this simple rule:

variable2 = scale_factor * variable + add_offset

As far as I can see, this is also the default for the R package RNetCDF. Is it possible to add this behaviour to ncread and to getindex/setindex! functions? (I actually prefer the later API because indexing is much more natural).

In a similar vein, is it also possible to set values to NaN which are equal to _FillValue or maybe use DataArrays?

meggart commented 7 years ago

Yes, this should have been done a while ago. The scale_factor part should be quite simple. The other part would need a good API, maybe a keyword argument would be sufficient like: ncread(...,missingas=:dataarray) or ncread(...,missingas=:NaN)

I hope to find some time during the Christmas break to do this.

Alexander-Barth commented 7 years ago

Yes, this would be very useful. For the lower-level API, might one could have something like this:

nc = NetCDF.open("some_file.nc")
regular_array = nc.vars["variable_name"][:,:]; # as implemented now
instance_of_data_array = nc.data["variable_name"][:,:]; # data array with scaling

If this API looks fine, I can try to implement it and make a pull request (if I succeed).

visr commented 7 years ago

Just wanted to note that rather than using DataArray, we probably should use NullableArray instead. DataFrames master has already switched over, NullableArray should work better with Julia's type system.

acrosby commented 7 years ago

Has there been any progress on this @Alexander-Barth ? Is it something that I could easily submit a PR for, where would you want it to go? @meggart

meggart commented 7 years ago

I started doing this and then it turned out to be more difficult than I thought to do properly. The main problem is that currently NcVars have type parameters like Julia arrays, which means that e.g. reading from an NcVar{Float32,3} returns an Array{Float32,3} and methods like readvar! and ncread! can work in-place.

When we have a scale factor and offset, it is common that the return type is not the same as the NetCDF storage type, for example a variable might be NC_UBYTE on disk with a scale_factor of 3.f0, so the data has to be converted to floating-point when reading.

This all got very confusing and I was not really happy with the resulting interface, so I stopped working on it until I have some time to think about it. Maybe you have some good ideas how to do this, here are my attempts so far: https://github.com/JuliaGeo/NetCDF.jl/commit/1713963675cd0112ed08012f02a2f437e6d3a149

Another question would be how to deprecate the old behaviour. I think a change like this will inevitably break lots of existing code.

Alexander-Barth commented 7 years ago

Indeed, it is not easy to implement this efficiently. Maybe one can implement two sets of types (for NetCDF variable (NcVarRaw, NcVarScaled) and NetCDF files (NcFileRaw, NcFileScaled). Indexing a NcFileRaw type would produce a NcVarRaw and indexing a NcVarRaw would return the same type as in the NetCDF files (ignoring all scaling and _FillValues).

However, indexing NcFileScaled (or a better name) and NcVarScaled would return always a DataArray/NullableArray of Float64 for numbers (with scaling and masking _FillValues) and a DataArray/NullableArray of chars for characters (ignoring the scaling but masking _FillValues).

squaregoldfish commented 6 years ago

I'm completely new to Julia, but I work with netCDFs a lot so I need this functionality if I'm going to continue. Has there been any more progress on this? Otherwise I'm willing to have an attempt at implementing something.

Alexander-Barth commented 6 years ago

In case, you are interested. Here is my attempt: https://github.com/Alexander-Barth/NCDatasets.jl

meggart commented 6 years ago

@Alexander-Barth I have looked at your new interface, it looks good, and I think I will try it soon. Do you think it would make sense to factor out the netcdf_c.jl and build.jl into its own package (NetCDF-C) so that you could depend on this one? Then it would be easier to experiment with new interfaces for reading and writing NetCDF files, while minimizing code duplication.

Alexander-Barth commented 6 years ago

In fact, I started to change netcdf_c.jl quite bit and I am not sure if you are interested in these changes. I think that this file is an automatically generated file (right?). The changes I made are mainly returning e.g. scalar values instead of changing the elements of a one element vector. I also changed some parameter types from Cint to Csize_t if in the corresponding C API size_t is used.

acrosby commented 6 years ago

@Alexander-Barth Interested in trying this out.

meggart commented 6 years ago

Ok, I think then it is better to leave the auto-generated file as it is. In general, I think it would be good if you keep posting progress on your new interface, especially once you register it as a package, then we could mention it as an alternative in the README.

Alexander-Barth commented 6 years ago

Ok, I agree. The other aspect (similar to scaling) that I have implemented is returning time variables (if they have a unit attribute of the form "days/hours/.. since start date) as arrays of DateTime: https://github.com/Alexander-Barth/NCDatasets.jl/blob/master/src/NCDatasets.jl#L146

In fact, I recently registered NCDatasets. I added the following to its README file:

https://github.com/Alexander-Barth/NCDatasets.jl/commit/daeda94c3de25d3876b436c720353a1abb1e8431 I hope that this is OK for you.

visr commented 6 years ago

@Alexander-Barth thanks for showing NCDatasets, it wasn't on my radar before. We should indeed link to it in our readme as well.

Regarding netcdf_c.jl, it is currently auto generated, followed by several manual steps, which is is briefly documented in wrap.jl. Ideally this becomes fully automatic, such that updating to a newer NetCDF version is easy as possible, but it requires some work on both wrap.jl and Clang.jl. Manual edits are much quicker to get something that works though.

In my opinion we should aim towards either:

Having NCDatasets.jl depend on NetCDF.jl, to reduce double effort, and NCDatasets can use the NetCDF low level API to build any more convenient API. This is more or less what is the idea behind GDAL.jl and ArchGDAL.jl, where GDAL.jl is pretty much a 1 to 1 translation of the GDAL API to Julia, on which ArchGDAL.jl can build to make a friendlier API. The advantage is that if someone has a need for a different API, they can relatively easily build a new API on top of GDAL.jl, without having to do the (not very interesting) plumbing.

Try to merge the packages together? I understand that starting a separate package can be a great way to quickly work out new ideas, so I'm happy that you did that, and we can learn from each other. That said, I'm not sure there are any fundamental differences in where we want to go.

What are your thoughts? @meggart @Alexander-Barth

yeesian commented 6 years ago

Another motivating factor for having one of the packages depend on the other is to have a single binary (and build file) shared across both packages

Alexander-Barth commented 6 years ago

Initially, I considered indeed to depend NCDatasets.jl on NetCDF.jl but I did not choose to follow to road:

the interface in netcdf_c.jl is a bit inconveniet to use e.g.

ncidp = Vector{Cint}(1)
nc_create(path,cmode,ncidp)
# do something with ncidp[1]

While I wanted to use simply ncid = nc_create(path,cmode).

Your intermediate interface (which is indeed defines nc_create the way I want to call it) makes extensive use of global variables. I try to avoid global variables (essentially to make the code easier to maintain and to avoid potential thread-safety issues).
Way too often I am in the situation where I want to install software A, but which requires software B, but which require software C, ... (not just for julia package but in general). And all of them have different requirements and which makes it very hard for the user to install them sometimes. The benefit of the dependency must be quite high compared to cost of the potential maintenance headaches in the future. Given the simplicity of the function like nc_create I do not think that the benefit is large enough.
The library libnetcdf.so is installed for both packages with apt-get or conda and it is thus shared between both packages.

About the merging option:

NetCDF.jl has already 3 interfaces (high-level, medium-level and the low-level wrapper). Adding a 4th interface would be quite confusing.
There is the risk that the user combines the different interfaces which be probably not work.

Of course I see also the overall benefit of combining all forces in a single NetCDF package, but I did not see how I could change NetCDF.jl the way I wanted it to be without massively breaking the user's current code.

meggart commented 6 years ago

Thanks for the clarification and sorry for the long response time. A general comment first: Writing this package was my first attempt at learning Julia in 2012, so there are indeed many ways to improve the code, and maybe a complete re-write as you started would make sense.

Regarding the globals, I see the problem when it comes to threading, the main reason they were introduced was to avoid unnecessary memory allocations. I don't have any objections to reverting this, I think this was over-optimizing and we should, now that Julia has threads put thread safety over saving allocations.

I am still not completely convinced of your dependency argument, since the NetCDF-C.jl package would be a very small package without any other Julia dependencies, containing just a single netcdf-c.jl source file.

I support the idea of having separate packages for imperative reading and writing data vs a Object-Dataset like approach, and definitely think you should go on and register your package as an alternative.

JuliaGeo / NetCDF.jl

scale_factor, add_offset and _FillValue #39