JuliaGeo / NetCDF.jl

NetCDF support for the julia programming language
http://juliageo.org/NetCDF.jl/
MIT License
115 stars 28 forks source link

scale_factor, add_offset and _FillValue #39

Open Alexander-Barth opened 7 years ago

Alexander-Barth commented 7 years ago

In Python and Matlab/Octave, the NetCDF variables scale_factor and add_offset are applied when these attributes are defined, according to this simple rule:

variable2 = scale_factor * variable + add_offset

As far as I can see, this is also the default for the R package RNetCDF. Is it possible to add this behaviour to ncread and to getindex/setindex! functions? (I actually prefer the later API because indexing is much more natural).

In a similar vein, is it also possible to set values to NaN which are equal to _FillValue or maybe use DataArrays?

meggart commented 7 years ago

Yes, this should have been done a while ago. The scale_factor part should be quite simple. The other part would need a good API, maybe a keyword argument would be sufficient like: ncread(...,missingas=:dataarray) or ncread(...,missingas=:NaN)

I hope to find some time during the Christmas break to do this.

Alexander-Barth commented 7 years ago

Yes, this would be very useful. For the lower-level API, might one could have something like this:

nc = NetCDF.open("some_file.nc")
regular_array = nc.vars["variable_name"][:,:]; # as implemented now
instance_of_data_array = nc.data["variable_name"][:,:]; # data array with scaling

If this API looks fine, I can try to implement it and make a pull request (if I succeed).

visr commented 7 years ago

Just wanted to note that rather than using DataArray, we probably should use NullableArray instead. DataFrames master has already switched over, NullableArray should work better with Julia's type system.

acrosby commented 7 years ago

Has there been any progress on this @Alexander-Barth ? Is it something that I could easily submit a PR for, where would you want it to go? @meggart

meggart commented 7 years ago

I started doing this and then it turned out to be more difficult than I thought to do properly. The main problem is that currently NcVars have type parameters like Julia arrays, which means that e.g. reading from an NcVar{Float32,3} returns an Array{Float32,3} and methods like readvar! and ncread! can work in-place.

When we have a scale factor and offset, it is common that the return type is not the same as the NetCDF storage type, for example a variable might be NC_UBYTE on disk with a scale_factor of 3.f0, so the data has to be converted to floating-point when reading.

This all got very confusing and I was not really happy with the resulting interface, so I stopped working on it until I have some time to think about it. Maybe you have some good ideas how to do this, here are my attempts so far: https://github.com/JuliaGeo/NetCDF.jl/commit/1713963675cd0112ed08012f02a2f437e6d3a149

Another question would be how to deprecate the old behaviour. I think a change like this will inevitably break lots of existing code.

Alexander-Barth commented 7 years ago

Indeed, it is not easy to implement this efficiently. Maybe one can implement two sets of types (for NetCDF variable (NcVarRaw, NcVarScaled) and NetCDF files (NcFileRaw, NcFileScaled). Indexing a NcFileRaw type would produce a NcVarRaw and indexing a NcVarRaw would return the same type as in the NetCDF files (ignoring all scaling and _FillValues).

However, indexing NcFileScaled (or a better name) and NcVarScaled would return always a DataArray/NullableArray of Float64 for numbers (with scaling and masking _FillValues) and a DataArray/NullableArray of chars for characters (ignoring the scaling but masking _FillValues).

squaregoldfish commented 6 years ago

I'm completely new to Julia, but I work with netCDFs a lot so I need this functionality if I'm going to continue. Has there been any more progress on this? Otherwise I'm willing to have an attempt at implementing something.

Alexander-Barth commented 6 years ago

In case, you are interested. Here is my attempt: https://github.com/Alexander-Barth/NCDatasets.jl

meggart commented 6 years ago

@Alexander-Barth I have looked at your new interface, it looks good, and I think I will try it soon. Do you think it would make sense to factor out the netcdf_c.jl and build.jl into its own package (NetCDF-C) so that you could depend on this one? Then it would be easier to experiment with new interfaces for reading and writing NetCDF files, while minimizing code duplication.

Alexander-Barth commented 6 years ago

In fact, I started to change netcdf_c.jl quite bit and I am not sure if you are interested in these changes. I think that this file is an automatically generated file (right?). The changes I made are mainly returning e.g. scalar values instead of changing the elements of a one element vector. I also changed some parameter types from Cint to Csize_t if in the corresponding C API size_t is used.

acrosby commented 6 years ago

@Alexander-Barth Interested in trying this out.

meggart commented 6 years ago

Ok, I think then it is better to leave the auto-generated file as it is. In general, I think it would be good if you keep posting progress on your new interface, especially once you register it as a package, then we could mention it as an alternative in the README.

Alexander-Barth commented 6 years ago

Ok, I agree. The other aspect (similar to scaling) that I have implemented is returning time variables (if they have a unit attribute of the form "days/hours/.. since start date) as arrays of DateTime: https://github.com/Alexander-Barth/NCDatasets.jl/blob/master/src/NCDatasets.jl#L146

In fact, I recently registered NCDatasets. I added the following to its README file:

https://github.com/Alexander-Barth/NCDatasets.jl/commit/daeda94c3de25d3876b436c720353a1abb1e8431 I hope that this is OK for you.

visr commented 6 years ago

@Alexander-Barth thanks for showing NCDatasets, it wasn't on my radar before. We should indeed link to it in our readme as well.

Regarding netcdf_c.jl, it is currently auto generated, followed by several manual steps, which is is briefly documented in wrap.jl. Ideally this becomes fully automatic, such that updating to a newer NetCDF version is easy as possible, but it requires some work on both wrap.jl and Clang.jl. Manual edits are much quicker to get something that works though.

In my opinion we should aim towards either:

Having NCDatasets.jl depend on NetCDF.jl, to reduce double effort, and NCDatasets can use the NetCDF low level API to build any more convenient API. This is more or less what is the idea behind GDAL.jl and ArchGDAL.jl, where GDAL.jl is pretty much a 1 to 1 translation of the GDAL API to Julia, on which ArchGDAL.jl can build to make a friendlier API. The advantage is that if someone has a need for a different API, they can relatively easily build a new API on top of GDAL.jl, without having to do the (not very interesting) plumbing.

Try to merge the packages together? I understand that starting a separate package can be a great way to quickly work out new ideas, so I'm happy that you did that, and we can learn from each other. That said, I'm not sure there are any fundamental differences in where we want to go.

What are your thoughts? @meggart @Alexander-Barth

yeesian commented 6 years ago

Another motivating factor for having one of the packages depend on the other is to have a single binary (and build file) shared across both packages

Alexander-Barth commented 6 years ago

Initially, I considered indeed to depend NCDatasets.jl on NetCDF.jl but I did not choose to follow to road:

While I wanted to use simply ncid = nc_create(path,cmode).

About the merging option:

Of course I see also the overall benefit of combining all forces in a single NetCDF package, but I did not see how I could change NetCDF.jl the way I wanted it to be without massively breaking the user's current code.

meggart commented 6 years ago

Thanks for the clarification and sorry for the long response time. A general comment first: Writing this package was my first attempt at learning Julia in 2012, so there are indeed many ways to improve the code, and maybe a complete re-write as you started would make sense.

Regarding the globals, I see the problem when it comes to threading, the main reason they were introduced was to avoid unnecessary memory allocations. I don't have any objections to reverting this, I think this was over-optimizing and we should, now that Julia has threads put thread safety over saving allocations.

I am still not completely convinced of your dependency argument, since the NetCDF-C.jl package would be a very small package without any other Julia dependencies, containing just a single netcdf-c.jl source file.

I support the idea of having separate packages for imperative reading and writing data vs a Object-Dataset like approach, and definitely think you should go on and register your package as an alternative.