NCAS-CMS / pyfive

A pure Python HDF5 file reader
BSD 3-Clause "New" or "Revised" License
1 stars 1 forks source link

HDF5 file layout and performance #3

Open bnlawrence opened 7 months ago

bnlawrence commented 7 months ago

Here are two comparisons of opening a file on a posix file system using h5py and pyfive:

python opening_speed.py 
File Opening Time Comparison
h5py:    0.015273
pyfive:  0.005531
Additional times:  0.000124,  0.003239

File Opening Time Comparison
h5py:    0.054081
pyfive:  0.387869
Additional times:  0.000317,  0.000853

This has almost certainly got something to do with the way the file is lain down in terms of where indexes etc go, but the performance difference is heavily exacerbated when the file is on S3 ... it would be good to have the capability to diagnose this sort of thing. Could we modify pyfive to provide a "layout diagnostic view"?

The additional times are

h3 = time.time()
v = f2['var']
d = v._dataobjects
h4 = time.time()
d._get_chunk_addresses()
h5 = time.time()

h4-h3 and h5-h4, where f2 is the open pyfive file instance. It suggests the b-tree read itself is very fast.

bnlawrence commented 7 months ago

For the record, these files differ signficantly. Ncdumps are
file1.txt and file2.txt.

The former is smaller and only has one variable. The header and b-tree layouts will be significantly different.