JuliaIO / Zarr.jl

Other
118 stars 23 forks source link

Very slow S3 read realtive to xarray #65

Closed alex-s-gardner closed 1 year ago

alex-s-gardner commented 2 years ago

Thanks a ton for the great package... very happy to see support for Zarr being added to Julia. Zarr is taking off so I suspect this package is going to get a lot of use. Working with Zarr.jl I've noticed very slow read times relative to what I can achieve with xarray (with zarr engine) in Python.

Reading a column of data in Julia takes 1.7 min vrs. 5 sec using xarray in Python... I really want to start moving to Julia and am very keen to see the tools improve.

------------------ Julia Zarr Read ------------------

using Zarr, AWS

# set up aws configuration
AWS.global_aws_config(AWSConfig(creds=nothing, region = "us-west-2"))

# path to datacube
path = "s3://its-live-data.jpl.nasa.gov/datacubes/v1/N60W040/ITS_LIVE_vel_EPSG3413_G0120_X-150000_Y-2250000.zarr"

# map into datacube
z = zopen(path, consolidated=true)

# map to specific variables
v = z["v"]

# read data
foo = v[1,1,:] #takes ~1.7 min

%-------------- Python Zarr read ------------------

import s3fs 
import xarray as xr

output_dir = "s3://its-live-data.jpl.nasa.gov/datacubes/v1/N60W040/ITS_LIVE_vel_EPSG3413_G0120_X-150000_Y-2250000.zarr"

s3_in = s3fs.S3FileSystem(anon=True)

cube_store = s3fs.S3Map(root=output_dir, s3=s3_in, check=False)
with xr.open_dataset(cube_store, decode_timedelta=False, engine='zarr', consolidated=True) as ds:
    grid_cell_v = ds.v.isel(x=1, y=1)
    grid_cell_v.mean()
meggart commented 2 years ago

Thanks for sharing the report. I really can not reproduce your fast python timings, with my internet connection both version take ages. In my tests I reduced the number of time steps to read and got very similar timings:

grid_cell_v = ds.v.isel(mid_date=slice(0,100), x=1, y=1)
grid_cell_v.mean()
foo = v[1,1,1:100]

Are your sure you are not running into some lazy vs eager evaluation issues, for example that dask is doing the mean computation lazily? If you can confirm this is not the case and you see an actual result printed by the python version this fast, the next things to check would be:

1 Multithreaded decompression: AFAIK numcodecs is doing multithreading during data decompression by default. You can globally determine the number of threads Blosc is with using Blosc; Blosc.set_num_threads(n=CPU_CORES)

  1. Multithreaded data transfer: Another difference is that xarray reads different chunks in a multi-threaded manner, which could help if you have a fast connection to the data. If you want to try something like this:
    v2 = view(v,1,1,1:100)
    function threaded_read(xin)
    xout = similar(xin)
    Threads.@threads for i in map(i->i.indices,Zarr.DiskArrays.eachchunk(xin))
        xout[i...] = xin[i...]
    end
    xout
    end
    xout = similar(v2)
    threaded_read(v2)

    Alternatively using @async or similar might result in similar speedups. If this is the helps, I would be very interested in the benchmarks and one could try to make this the default when reading data.

alex-s-gardner commented 2 years ago

@meggart thanks for the quick response. So I did a bit more testing it seems there were multiple reasons for the difference between Python (xarray) and Julia (Zarr)

I think there is some lazy vs eager differences, accounting for this the Python read time is ~23s [not 5s]

The equivalent read time for Julia (v[1,1,:]) takes 80s

Using your threaded_read(v2) only takes 20s... so a 4x speedup that's equivalent to Python read times.

Maybe multithreading should be made the default?

Balinus commented 1 year ago

Maybe multithreading should be made the default?

Yes, especially useful for clusters with nodes > 30 cores.

meggart commented 1 year ago

Can we view this as resolved through #106 ?

alex-s-gardner commented 1 year ago

I haven't had a chance to test but if multi-threading is now the default with #106 then I see this issue as closed