ADicksonLab / wepy

Weighted Ensemble simulation framework in Python
https://adicksonlab.github.io/wepy/index.html
MIT License
48 stars 20 forks source link

Reading weights from the HDF5 is slow #123

Open alexrd opened 7 months ago

alexrd commented 7 months ago

For weighted ensemble analyses it is common to require access to all of the weights at once. Currently, this takes tens of minutes to read the weights from reasonably-sized HDF5 files. In contrast, reading a newly computed observable can be done in seconds.

Could this potentially be helped by arranging them all in their own folder? E.g. ['runs/0/weights'] Or perhaps there is another way of ensuring that the weights are written to some contiguous region of the disk?

alexrd commented 7 months ago

This is related to #37 , but not entirely.

salotz commented 7 months ago

Can you post a specific snippet just so we can separate the different pieces that might be slow. I suspect there is multiple bottlenecks, one being #37.

The other problem which this issue addresses, is that the weights (or any field really) are appended to the datastructures as the simulation proceeds and we don't really know how much to pre-allocate ahead of time (usually). Under the hood every dataset/array in HDF5 is chunked into smaller arrays and storage gets allocated a chunk at a time even if you don't use the whole chunk. So not knowing how big the field array could get we just extend the arrays one cycle's worth of data at a time. For positions frames having chunk size correspond to a single frame isn't so bad, but for small arrays or scalars it can be a problem, like the weights, which is what leads to the observed slowness.

Having a single weights group runs/0/weights won't actually help much as it has the same problem really. You would get some extra chunking by having a full cycle's worth of weights chunked together. However I would guess the benefit of this is still small (as num_walkers is typically a lot smaller than the number of cycles) and as you mentioned the computed observables are fast even if they are scalar. This breaks down if you have variable number of walkers, for which better support is already implemented and lingering in #86 waiting to be pulled out itself for a new version to preserve backwards compatibility.

Solutions:

First we add support for specifying the chunk sizes for each field when creating a new HDF5 file. Then with that we can have the following strategies (not mutually exclusive):

  1. For small well-known fields like weights we set a chunk size (N) to be much larger than 1, maybe 100 or 1000. This should make things much faster to read from at the expense of always allocating num_walkers * N disk memory at the start of any simulation. This can be the "guess" default strategy for file initialization and is useful for time based simulations.
  2. If the user in the sim manager specifies a number of cycles we can have some reasonably large divisor of the specified number of frames. In most cases I would guess that 2000 cycles is a large simulation and is still not a very large array of floats.
  3. Add a CLI tool that can automatically rechunk and otherwise process a completed dataset. I've experimented with using the h5tools for doing this and you can rechunk the weights to be good, but it is quite a slow process starting with a poorly chunked dataset. Its much better for optimizing one. This can also trim unused chunk space that was allocated but not used.

I was thinking in general that we need a CLI tool for merging HDF5s, extracting data, listing info, etc. and this kind of tool should fit in nicely to that.

salotz commented 7 months ago

This is pretty high on priorities for me as well and the pre-allocation solution probably wouldn't be too hard to implement.