Investigate why access performance isn't improved uniformly by repacking metadata

betolink commented 1 year ago

On the files we tested over Antarctica, repacking the metadata with h5repack didn't improve access times in a dramatic way, specially for xarray and h5py. These granules contained a lot of data and each was around 6GB, with ~7MB of metadata. They were selected and processed using this notebook

e.g. ATL03_20181120182818_08110112_006_02.h5 ~7GB in size and 7MB of metadata

Note: The S3 bucket with the original data is gone but can be easily recreated.

However for other granules with less data, repacking represented a 10X improvement for xarray

e.g. ATL03_20220201060852_06261401_005_01.h5 ~500MB in size and 3MB of metadata

After applying h5repack to both files the access time to the first one is not improved for xarray but it is improved from 1 minute to 5 seconds for the second granule, why?


group = '/gt2l/heights'
variable = 'h_ph'

with s3.open(file, 'rb') as file_stream:
     ds = xr.open_dataset(file_stream, group=group, engine='h5netcdf')
     variable_mean = ds[variable].mean()

I'm going to repack the original files and put them on a more durable bucket, along with more examples from other NASA datasets.

Maybe @ajelenak has some clues on why this may be happening.

### Tasks
- [ ] https://github.com/ICESAT-2HackWeek/h5cloud/issues/28
- [ ] https://github.com/ICESAT-2HackWeek/h5cloud/issues/29
- [ ] https://github.com/ICESAT-2HackWeek/h5cloud/issues/27
- [ ] https://github.com/ICESAT-2HackWeek/h5cloud/issues/25
- [ ] https://github.com/ICESAT-2HackWeek/h5cloud/issues/24
- [ ] https://github.com/ICESAT-2HackWeek/h5cloud/issues/23
- [ ] https://github.com/ICESAT-2HackWeek/h5cloud/issues/26

### Tasks

ajelenak commented 1 year ago

Hi @betolink,

Repacking the file is the necessary first step but then the instructions to use the features available in the repacked file must be passed to libhdf5. I know it can be done from h5py, but have not yet verified if the same is possible from xarray and h5netcdf. It probably is because I've seen xarray code where backend storage engine options are set in the open_dataset() call.

The variable mean calculation example will read all the data only once for the /gt2l/heights/h_ph dataset and then discard it, which means that available libhdf5 caches may not help much in this use case.

betolink commented 1 year ago

The curious thing is that in some instances repacked files get faster times compared to their non repacked original version without passing any special parameter to h5py or xarray.

ajelenak commented 1 year ago

That's probably because of the paged aggregation applied to the repacked file that forces libhdf5 to only make S3 requests of the file page size. Those pages then bring much more data (likely quite a few chunks in one request) compared to the original file where libhdf5 can make S3 requests starting from as little as 8 bytes.

betolink commented 9 months ago

We had a very interesting conversation/brainstorming session with @ajelenak during AGU23, he is developing tools to trace the behavior of h5py over the network: https://github.com/ajelenak/ros3vfd-log-info that we'll use to have a better idea of how repacking and doing page aggregation makes an impact on file access times. I'm not sure if this tool can be used with the h5py -> fsspec or just for the rosv3 driver.

ajelenak commented 9 months ago

Currently it can only parse libhdf5's ros3 driver logs. I was interested in those because they are the most accurate information about where in a file and how many bytes libhdf5 is reading. An fsspec log parser can certainly be added. Do you have one to share?

betolink commented 8 months ago

Working on it! @ajelenak fsspec logs are too verbose and I'm figuring out how can we create a filter before they get flushed to match what this tool needs.

ICESAT-2HackWeek / h5cloud

Investigate why access performance isn't improved uniformly by repacking metadata #19