Closed timothygebhard closed 6 months ago
The HDF5 file should be updated and synced on every nested sampling iteration. If you are very worried, maybe copy it away regularly during the run.
Otherwise, set the maximum number of likelihood evaluations to a value so that the run stops before the runtime limit is hit.
The HDF5 file should be updated and synced on every nested sampling iteration. If you are very worried, maybe copy it away regularly during the run.
I have tried this today, but as expected, I ended up with corrupted checkpoint files. FWIW, the SIGALRM
approach does not really seem to work when running with MPI anyway (I had to resort to sandboxing the run()
command into its own Process
which I could .join()
with a timeout — just in case somebody else wants to try and go down that road).
Otherwise, set the maximum number of likelihood evaluations to a value so that the run stops before the runtime limit is hit.
Thinking in this direction, I arrived at the following approach for now:
def run_sampler_with_timelimit(
sampler: ultranest.ReactiveNestedSampler,
max_runtime: int,
) -> bool:
start_time = time.time()
# Define a magic number of likelihood evaluations to run
# between checking the runtime constraint
MAGIC_NUMBER = 10_000
while True:
# Run for a given number of likelihood evaluations
n_call_before = copy(sampler.ncall)
sampler.run(
max_ncalls=n_call_before + MAGIC_NUMBER,
# ...
)
# Check if we have converged
if sampler.ncall == n_call_before:
print("Sampling complete!")
return True
# Check if the timeout is reached
if time.time() - start_time > max_runtime:
print("Timeout reached, stopping sampler!")
return False
Does that make any sense to you?
It succeeds in limiting the runtime even when running with MPI (although choosing the magic number does not feel very elegant, and there is quite some variance in the actual runtimes of different iterations), but looking at the logs it produces I am not 100% sure yet that this really works as intended.
Yes, something like that.
For the previous suggestion, I was more thinking along the lines of
for i in $(seq 100000); do cp points.h5 points_${i}.h5; sleep 1m; done
I am closing this for now. Please reopen if it is an issue.
Apologies for the late feedback — I've been running the code I posted above for the past week, and so far, it looks like it gets the job done 🙂 Thanks for the help!
Hi @JohannesBuchner!
I was wondering if you had any advice on how to limit the runtime of the
run()
call of aReactiveNestedSampler
. Basically, I want to run the sampler to convergence (however many likelihood evaluations etc. that takes), but due to limitations of the cluster I'm using, I need to (soft)-limit the runtime of each individual job and restart from a checkpoint in case the sampler hasn't finished yet.I am worried that a naive solution like this one might lead to corrupted checkpoint files if the
SIGALRM
interrupts UltraNest while it is saving its state to HDF.Thanks a lot in advance! — Timothy