SlideRuleEarth / sliderule

Server and client framework for on-demand science data processing in the cloud
https://slideruleearth.io
Other
27 stars 12 forks source link

Using H5Coro in stand alone mode #270

Closed betolink closed 1 year ago

betolink commented 1 year ago

I wonder if is it possible to extract H5Coro into a stand alone library and use it the same way we use h5py (understanding that the H5Coro doesn't support certain operations). My primary interest would be to accelerate the access of HDF files on S3.

If we use the h5py library with S3 a very common access pattern would be something like this (with fsspec file-like objects)


# S3FS sessions get created with temporary S3 credentials and it supports async reads
with s3fs.open( "s3://HDF_URL") as file_stream:
    # reads are sequential in h5py so we are not taking advantage of S3FS async reads, very slow performance.
    with h5py.File(file_stream, 'r') as h5file:
        data =  h5file['/group1']

Would it be eventually possible to use H5Coro as a drop-in replacement of h5py/S3FS for read only operations on S3?

# reads would be either consolidated or concurrent 
with h5coro.File("s3://HDF_URL", 'r') as h5file:
    data =  h5file['/group1']
scottyhq commented 1 year ago

Wanted to link to your nice discussion on the topic of general cloud access to NASA data @betolink https://github.com/nsidc/earthaccess/discussions/251 :) We've discussed the utility of pulling out h5coro as a stand-alone tool, but it would likely be scoped to ICESat-2 only, rather than tackling all possible HDF files out there. But perhaps someone could run with it or extend over time...

jpswinski commented 1 year ago

@betolink at the prompting of @scottyhq and @tsutterley, we've created a pure Python implementation of H5Coro. It is still very early on in its development, but as of last week we opened up the git repo and are ready to start to let others take a look at it.

You can find the git repo at: https://github.com/ICESat2-SlideRule/h5coro

Alternately, you can install the python package h5coro via pip or conda (from conda-forge).

As of right now, I've been able to use it to successfully read ICESat-2 ATL03 and ATL06 data, and GEDI L2 data. It is also showing a significant speed up over using s3fs, though it still isn't as fast as using SlideRule. There are still features that need to be added, and the interface needs some work to make it an easier drop-in for h5py... but those are all coming.

Let us know if you have any suggestions or find any issues. I'd be happy to continue the discussion here, or offline.

jpswinski commented 1 year ago

Future discussions on h5coro to take place within the repo at https://github.com/ICESat2-SlideRule/h5coro