Closed alex-s-gardner closed 1 year ago
Thanks for sharing the report. I really can not reproduce your fast python timings, with my internet connection both version take ages. In my tests I reduced the number of time steps to read and got very similar timings:
grid_cell_v = ds.v.isel(mid_date=slice(0,100), x=1, y=1)
grid_cell_v.mean()
foo = v[1,1,1:100]
Are your sure you are not running into some lazy vs eager evaluation issues, for example that dask is doing the mean computation lazily? If you can confirm this is not the case and you see an actual result printed by the python version this fast, the next things to check would be:
1 Multithreaded decompression: AFAIK numcodecs is doing multithreading during data decompression by default. You can globally determine the number of threads Blosc is with using Blosc; Blosc.set_num_threads(n=CPU_CORES)
v2 = view(v,1,1,1:100)
function threaded_read(xin)
xout = similar(xin)
Threads.@threads for i in map(i->i.indices,Zarr.DiskArrays.eachchunk(xin))
xout[i...] = xin[i...]
end
xout
end
xout = similar(v2)
threaded_read(v2)
Alternatively using @async
or similar might result in similar speedups. If this is the helps, I would be very interested in the benchmarks and one could try to make this the default when reading data.
@meggart thanks for the quick response. So I did a bit more testing it seems there were multiple reasons for the difference between Python (xarray) and Julia (Zarr)
I think there is some lazy vs eager differences, accounting for this the Python read time is ~23s [not 5s]
The equivalent read time for Julia (v[1,1,:]) takes 80s
Using your threaded_read(v2) only takes 20s... so a 4x speedup that's equivalent to Python read times.
Maybe multithreading should be made the default?
Maybe multithreading should be made the default?
Yes, especially useful for clusters with nodes > 30 cores.
Can we view this as resolved through #106 ?
I haven't had a chance to test but if multi-threading is now the default with #106 then I see this issue as closed
Thanks a ton for the great package... very happy to see support for Zarr being added to Julia. Zarr is taking off so I suspect this package is going to get a lot of use. Working with Zarr.jl I've noticed very slow read times relative to what I can achieve with xarray (with zarr engine) in Python.
Reading a column of data in Julia takes 1.7 min vrs. 5 sec using xarray in Python... I really want to start moving to Julia and am very keen to see the tools improve.
------------------ Julia Zarr Read ------------------
%-------------- Python Zarr read ------------------