HDFGroup / nasa_cloud

Apache License 2.0
3 stars 2 forks source link

Improve hsload --link performance #5

Closed jreadey closed 1 year ago

jreadey commented 1 year ago

Use H5Dchunk_iter rather than chunk_info in: https://github.com/HDFGroup/h5pyd/blob/master/h5pyd/_apps/utillib.py. Testing shows that this can have a speed up of over 500x for datasets with large number of chunks. Verify the hdf5lib version and fall back to chunk_info if H5Dchunk_iter is not available.

mattjala commented 1 year ago

My initial test actually showed a slight slowdown when I tried it on the ATL03 file - 25.3 minutes with get_chunk_info, 26.8 minutes with chunk_iter.

ajelenak commented 1 year ago

Try with the cloud-optimized version of that file.

mattjala commented 1 year ago

Try with the cloud-optimized version of that file.

Are you talking about using a paging strategy with page size 4-8 Mb? The ATL file is using the H5F_FSPACE_STRATEGY_FSM_AGGR strategy, and the documentation says the strategy and page size are immutable for an already created file. If there are external tools that can alter this, I'm not aware of them.

mattjala commented 1 year ago

Implemented in HDFGroup/h5pyd#148