write_set_to_file: allow use of latest HDF5 library in h5py

brentmaas commented 8 months ago

Is your feature request related to a problem? Please describe. I've noticed that long running scripts that iteratively write data using amuse.io.write_set_to_file tend to become noticably slower the further in the run they get, even though the workload in each iteration should take constant time. After some digging, I've found that this is a known problem with older HDF5 library versions in h5py. The solution in that Stackoverflow thread suggests to add the parameter libver='latest' to the instantiation of h5py.File to use a more recent version of HDF5 where this issue is fixed.

This can be tested using a simple script:

import numpy as np
from amuse.io import write_set_to_file
import amuse.lab as al
import amuse.units.units as au
import time

particles = al.Particles(5)
particles.x = np.arange(5) | au.pc
t = time.time()
for i in range(1, 10001):
        # Emulate some kind of constant-time process that changes particles.x
        time.sleep(0.001)
        particles.x = (np.arange(5) + i) | au.pc

        write_set_to_file(particles.savepoint(i | au.yr), "test.amuse", "amuse", append_to_file=True)

        if i % 1000 == 0:
                t2 = time.time()
                print(f"{i-1000}-{i}: {1000/(t2-t):.1f} iterations per second")
                t = t2

On my unedited version of AMUSE, this will output the following:

0-1000: 417.8 iterations per second
1000-2000: 366.1 iterations per second
2000-3000: 327.4 iterations per second
3000-4000: 294.0 iterations per second
4000-5000: 267.4 iterations per second
5000-6000: 245.5 iterations per second
6000-7000: 220.0 iterations per second
7000-8000: 207.4 iterations per second
8000-9000: 200.0 iterations per second
9000-10000: 186.9 iterations per second

However, if I edit amuse.io.store_v2.py to add libver='latest' on lines 743, 745 and 747 according to the Stackoverflow solution, I get the following:

0-1000: 448.1 iterations per second
1000-2000: 439.7 iterations per second
2000-3000: 434.9 iterations per second
3000-4000: 440.3 iterations per second
4000-5000: 429.4 iterations per second
5000-6000: 434.1 iterations per second
6000-7000: 427.1 iterations per second
7000-8000: 439.2 iterations per second
8000-9000: 432.8 iterations per second
9000-10000: 422.9 iterations per second

Which is a significant speed improvement. The output file is also almost twice as small: 60 MB in comparison to the 111 MB of the first run.

Important to note is the reason the Stackoverflow solution claims for this behaviour not being the default behaviour in h5py: compatibility. While an unedited read_set_from_file still seems to read both output files fine, I cannot predict any breaking changes.

Describe the solution you'd like Given the clear positive effect, but unknown probability of breaking changes, I think adding a flag (e.g. h5py_use_latest_libver) to write_set_to_file would be the correct way to go, where the default (False) would keep the current behaviour and True would add libver='latest' to any underlying h5py.File instantiations. Alternatively, a parameter (e.g. h5py_libver) could be added which directly passes its value to libver in h5py.File so that other features of libver can also be used in case someone would want that.

Additional context Operating system version: Linux 6.5.9-arch2-1 Compiler version: GCC 13.2.1 Python version: 3.11.5 AMUSE version: commit 6510b63f4e6fe26e7828d629a2c12e2dd60120f3 (24th of September, 2023); effectively latest for the entirety of amuse.io H5py version: 3.10.0 Numpy version: 1.25.2

rieder commented 8 months ago

Thanks for alerting us to this @brentmaas! I think adding a flag may indeed be a good solution here, and probably we can set it to use the latest hdf5lib by default.

LourensVeen commented 8 months ago

Hi Brent, nice find, and thanks for reporting it!

I disagree about adding a flag. We don't know how many issues switching from an older to a newer library version will cause, but we also don't know how many issues switching from an older to a newer library version will solve (it may be more than just this one).

The solution to that is having a test suite with good coverage, and operating under the assumption that if it isn't being tested, it's broken anyway; therefore if a change doesn't break any tests (and fixes an issue for which you've just added a test) then it's made things better and should be applied. If a change increases the complexity of the software and the user interface (as adding a flag would do), then you need to consider whether that increase in complexity is worth the added functionality.

So, in this particular instance, I would just make the change. We have no evidence that doing so would break anything, and so no justification for the added complexity of adding a flag. If a user shows up whose script got broken, then they can temporarily downgrade to the old version until we implement an appropriate fix for their issue, be that a flag or something else.

rieder commented 8 months ago

Thanks @LourensVeen - you're probably right :). Let's make this change in a PR and see if anything breaks. @brentmaas since you reported it, would you want to create the PR?

brentmaas commented 8 months ago

Sure, I'll have a go at it!

rieder commented 8 months ago

Though I would say that passing additional kwargs from write_set_to_file to h5py doesn't seem like such a bad idea to me either... But perhaps that's not for this issue.

LourensVeen commented 8 months ago

Also, it looks like this will break compatibility with HDF versions older than 1.10, which was released in March 2018, but only if the user writes their files using 1.10 or later first. That doesn't seem a very likely scenario.

rieder commented 8 months ago

No, that doesn't seem likely. Perhaps if we would write initial condition files on one computer (with recent HDF5) and then use these on a supercomputer which uses an older hdf5.

LourensVeen commented 8 months ago

Ah yes, that could happen. Well, we'll see.

rieder commented 8 months ago

seems like there would be a workaround even for this case, e.g. by just writing the file with AMUSE format v1.

amusecode / amuse

write_set_to_file: allow use of latest HDF5 library in h5py #1001