Performance and array semantics

simonbyrne commented 5 years ago

Calling functions which iterate over array elements (e.g. Statistics.mean) on netcdf datasets can be very slow. It would be useful to have some performance tips to e.g. first convert a dataset to an Array.

On that note, I noticed that simply calling Array(dataset) is also slow. I take it from the examples in the manual that the suggested way to convert is to call dataset[:]. However this has different behaviour from ordinary Julia multidimensional arrays, which return a vector:

julia> X = rand(3,3);

julia> X[:]
9-element Array{Float64,1}:
 0.1443316789104081
 0.8768272466421989
 0.01240381950170022
 0.6732249425627772
 0.4128628021781906
 0.2112718766221251
 0.9459472658879167
 0.8996044010964837
 0.47806340748050236

Alexander-Barth commented 5 years ago

Thank you for your insightful comments!

Indeed, Statistics.mean (and friends) are not overloaded and thus they are fetching every element individually from the netCDF file which is slow. I added a section in the documentation about performance as suggested.

It never occurred to me, to load an array in RAM with the Array constructor, but it makes sense and I added a method for this which give a speed-up as expected. If myvar is a 100x100 Float64 Matrix, before we had:

julia> @btime Array(ds["myvar"]) 
  115.359 ms (170042 allocations: 8.93 MiB)

Now after my recent commit:

julia> @btime Array(ds["myvar"]);
  69.404 μs (49 allocations: 170.43 KiB)

It is true that ncvar[:] is not consistent with the behaviour of regular Julia arrays, but it was a price that I was willing to pay for a concise syntax. In fact, the first NetCDF library that I used was the netcdf matlab toolbox (now abandoned) which implemented exactly the same (slightly inconsistent) behaviour in matlab.

What alternatives could be implemented?

ds["myvar"][:,:]  # does not work so well if the dimensionality of myvar is not known a priori
Array(ds["myvar"]) # works now, but it feels a bit wordy to me
ds["myvar"][] # like references, but it looks a bit obscure to most people I guess

simonbyrne commented 5 years ago

Great, thanks!

I don't think there's anything wrong with Array(ds["myvar"]). If you want something that would work for scalars as well, it would probably work to overload copy?

Alexander-Barth commented 5 years ago

I had a look at HDF5.jl (https://github.com/JuliaIO/HDF5.jl/blob/master/doc/hdf5.md#reading-and-writing-data) and they use read, for example:

variable = read(ds["myvar"])

Alexander-Barth / NCDatasets.jl

Performance and array semantics #33