ICESAT-2HackWeek / h5cloud

BSD 3-Clause "New" or "Revised" License
7 stars 2 forks source link

xarray is reading more data than h5coro #4

Closed andypbarrett closed 1 year ago

andypbarrett commented 1 year ago

I was looking at h5coro last night. It doesn't read attributes, just a single numpy array for the variable of interest.

xarray reads the coordinates and attributes associated with a variable, which may be several more datasets from the HDF5 file. It also reads attributes.

So I think we need to compare h5coro against h5py.

abarciauskas-bgse commented 1 year ago

This makes sense to me in terms of benchmarking apples to apples.

Should we perhaps compare xarray or pandas data frames with h5coro + some python object creation steps?

I was imagining we would want to understand the total time to create something that can be used for analysis. For example, @asteiker in https://github.com/nsidc/cloud-optimized-icesat2/issues/2 suggested time series of Jacobshavn surface height. If you just have a numpy array of surface height, without additional metadata (timestamps), you couldn't create a meaningful time series or visualization (geo coordinates)

However, if we can just provide that numpy array of a timeseries of data, perhaps that is the simplest way to compare the performance between data access methods.

abarciauskas-bgse commented 1 year ago

Interested in what @rwegener2 @wildintellect and @jpswinski think as well.

weiji14 commented 1 year ago

Just got this ATL03 read to work with the H5DataFrame class from https://github.com/MAAP-Project/gedi-subsetter

# !pip install git+https://github.com/MAAP-Project/gedi-subsetter.git
import h5py
import earthaccess

from gedi_subset.h5frame import H5DataFrame

auth = earthaccess.login()
s3 = earthaccess.get_s3fs_session(daac="NSIDC")
%%timeit
s3url_atl03 = "s3://nsidc-cumulus-prod-protected/ATLAS/ATL03/005/2020/01/01/ATL03_20200101053635_00840606_005_01.h5"
h5 = h5py.File(name=s3.open(s3url_atl03, 'rb'))
df = H5DataFrame(h5["gt2l/heights"]) # group='/gt2l/heights'
# print(df["h_ph"])
df["h_ph"].mean()

reported timings

765 ms ± 63.7 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Better than xarray's 1min3s, and not too far off from h5coro's 595ms reported at https://github.com/ICESAT-2HackWeek/h5cloud/blob/1b792e0b8af1217ae23615458cd3c73e8ef96c70/first-benchmark.ipynb

Let me tidy the code up, and I can submit a PR. But yes, you'll want to compare h5coro against h5py to be fair.

weiji14 commented 1 year ago

Ok, updated benchmarks at #5.

abarciauskas-bgse commented 1 year ago

We had a few discussions on this and I think we concluded that we need tests which both read data and produce a meaningful result. @andypbarrett do you think we can close this issue for now?

andypbarrett commented 1 year ago

Yes. It was more a note the ensure that we remember reading data and creating data structures are two different operations.

On Fri, Aug 11, 2023, 1:51 PM Aimee Barciauskas @.***> wrote:

We had a few discussions on this and I think we concluded that we need tests which both read data and produce a meaningful result. @andypbarrett https://github.com/andypbarrett do you think we can close this issue for now?

— Reply to this email directly, view it on GitHub https://github.com/ICESAT-2HackWeek/h5cloud/issues/4#issuecomment-1675390415, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAR72P6BTRPHJVXZZMHSDFDXU2LMZANCNFSM6AAAAAA3KHBINY . You are receiving this because you were mentioned.Message ID: @.***>