Resuming run does not write progress to hdf5 file when terminated, restarting always repeats the same iterations

JohannesBuchner / UltraNest

Fit and compare complex models reliably and rapidly. Advanced nested sampling.

https://johannesbuchner.github.io/UltraNest/

Other

142 stars 30 forks source link

Resuming run does not write progress to hdf5 file when terminated, restarting always repeats the same iterations #58

Closed curriem closed 2 years ago

curriem commented 2 years ago

UltraNest version: 3.3.3
Python version: 3.8.5
Operating System: Rocky Linux 8.4

Description

I am in a situation where I have to run UltraNest on a checkpoint queue on a cluster, and the job gets interrupted every 5 hours. As a result, I have been attempting to use the resume feature, but I'm having some difficulty understanding how it works. It seems that it will resume at a certain number of ncalls that is lower than the ncalls that it left off on. It goes back to the same ncalls nearly every time it is interrupted and resumed, so the code is not able to move past that number of ncalls most of the time. I will attach my debug.log file here so you know what I mean. Thank you in advance for your help!

debug.log

JohannesBuchner commented 2 years ago

It's supposed to add to a HDF5 file, with pointstore.add in _create_point() in integrator.py https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/integrator.py#L1742

ncalls is also stored, so that upon resuming, it knows the total number of ncalls invested in this problem. It seems from the log that ncalls are still increasing, and the iteration is also increasing.

So it all looks good to me. Probably just a slow likelihood?

curriem commented 2 years ago

Thanks for the quick reply! You're right that it's a very slow likelihood.

What I'm confused about is that sometimes when it is interrupted and restarts, it seems to lose the progress it made. If you compare the last two interrupt/restart cycles in the debug.log file, for example, they seem to be almost identical—they both start with 203644 ncalls and progress to a similar ncalls/iteration—then all the progress is lost and it begins again at 203644 ncalls. What I think should happen is that the last restart cycle should instead begin where the previous cycle left off—iteration=4237, ncalls=226871—but I'm not seeing this. Maybe I'm misunderstanding how it works; could you take another look? I really appreciate your help!

JohannesBuchner commented 2 years ago

It repeatedly says "Resuming from 4662 stored points", which is what it finds in the points.hdf5 file. Can you confirm that the content of that file remains unchanged as it restarts?

Seems like the file system is not syncing, or the HDF5 library is not flushing?

The HDF5 writing code is here: https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/store.py#L152

Maybe add a self.fileobj.flush() to add()?

reference: https://docs.h5py.org/en/stable/high/file.html

curriem commented 2 years ago

Yes, the points file was remaining unchanged as it restarted.

Adding a self.fileobj.flush() to add() seems to fix the issue! Thank you!

JohannesBuchner commented 2 years ago

OK, I think I will not make a change to UltraNest though, because in other cases there are many points added per second, and then flush will probably slow everything down.

The problem should be addressed in the underlying hdf5 library, from my point of view.