HDFGroup / h5pyd

h5py distributed - Python client library for HDF Rest API
Other
114 stars 38 forks source link

Speed improvements to loading HDF5 trees #25

Open rayosborn opened 7 years ago

rayosborn commented 7 years ago

In the nexusformat API, we load the entire HDF5 file tree by recursively walking through the groups in h5py, without reading in data values except for scalars and small arrays. On a local file, we can load files containing hundreds of objects without a significant time delay. For example, a file with 80 objects (groups, datasets, and attributes) takes 0.05s to load on my laptop. However, on h5pyd, the same load takes over 20s.

A call to load all the items in an HDF5 group requires two GET requests, and sometimes three, for each object, so there could be an improvement if all the metadata (shape, dtype, etc.) for each object were returned in a single call, and an even more significant one if all the items in a group could be returned with one GET request. Loading one group of 10 objects took 29 requests in my tests.

Binary data reads are fast, though.

jreadey commented 7 years ago

I've added some caching logic to the group class. Try out this latest checkin: https://github.com/HDFGroup/h5pyd/commit/19994179a7bcbc23304057647e2fa953f9ccf57c.

This is not a single operation recursive load, but I saw a speed up of about ~4x speed up walking the tree for the sample Nexus file. This is with using the hsls.py script in the app directory.

jreadey commented 7 years ago

@rayosborn - did you get a chance to try this out?

rayosborn commented 7 years ago

I have tested it, but I wasn't sure of the previous speeds because I forgot to do a proper timing before upgrading. I need to revert to the old version. However, I don't think I saw a factor four. It might have been a factor of two.

jreadey commented 7 years ago

There will be some variability based on the latency between client and server. My testing was with a server running on the same LAN. Also, the test driver is different.

Did the NexPy GUI need a lot of mods to work with h5serv? I could set it up in my environment.

rayosborn commented 7 years ago

I haven't made any changes to the NeXpy GUI yet. In the latest development version on my own clone of the nexusformat API, the nxremote branch has an added file, which subclasses the NXFile class for remote access. I was thinking of pushing this version to PyPI, since it is a test feature that only users with h5pyd would even be able to access. I'll let you know when I've done that.

jreadey commented 7 years ago

If you push the branch to github, I can just grab from there.

How would I use it to list the contents of a Nexus file?

rayosborn commented 7 years ago

The nxremote branch has been published on my Github. You can load a file by typing:

>>> a=nxloadremote(filepath, domain='exfac.org', server='some.server:5000')
>>> print(a.tree)

The file path is the path relative to the data directory. The module converts that to a domain name. The top domain is currently 'exfac.org' to match the test repository.

jreadey commented 1 year ago

@rayosborn - some updates on this old issue... By default h5pyd.File(filepath) will return all the meta data for the domain in the request response. H5pyd caches this, so any attribute read or link access won't need to talk to the server. There's a limit on the number of objects fetched on the server of 500. This is so the GET request doesn't take an inordinate amount of time for domains with lots of attributes and/or links.

To compare the performance not using the prefetch, you can use: h5pyd.File(filepath, use_cache=False). This will return just information on the root group.

rayosborn commented 1 year ago

Thanks, @jreadey. I can't look into this for a couple of weeks, but I plan to soon.