JohannesBuchner / UltraNest

Fit and compare complex models reliably and rapidly. Advanced nested sampling.
https://johannesbuchner.github.io/UltraNest/
Other
142 stars 30 forks source link

[Question] How to limit the runtime of `run()` #129

Closed timothygebhard closed 3 months ago

timothygebhard commented 3 months ago

Hi @JohannesBuchner!

I was wondering if you had any advice on how to limit the runtime of the run() call of a ReactiveNestedSampler. Basically, I want to run the sampler to convergence (however many likelihood evaluations etc. that takes), but due to limitations of the cluster I'm using, I need to (soft)-limit the runtime of each individual job and restart from a checkpoint in case the sampler hasn't finished yet.

I am worried that a naive solution like this one might lead to corrupted checkpoint files if the SIGALRM interrupts UltraNest while it is saving its state to HDF.

Thanks a lot in advance! — Timothy

JohannesBuchner commented 3 months ago

The HDF5 file should be updated and synced on every nested sampling iteration. If you are very worried, maybe copy it away regularly during the run.

Otherwise, set the maximum number of likelihood evaluations to a value so that the run stops before the runtime limit is hit.

timothygebhard commented 3 months ago

The HDF5 file should be updated and synced on every nested sampling iteration. If you are very worried, maybe copy it away regularly during the run.

I have tried this today, but as expected, I ended up with corrupted checkpoint files. FWIW, the SIGALRM approach does not really seem to work when running with MPI anyway (I had to resort to sandboxing the run() command into its own Process which I could .join() with a timeout — just in case somebody else wants to try and go down that road).

Otherwise, set the maximum number of likelihood evaluations to a value so that the run stops before the runtime limit is hit.

Thinking in this direction, I arrived at the following approach for now:

def run_sampler_with_timelimit(
    sampler: ultranest.ReactiveNestedSampler,
    max_runtime: int,
) -> bool:

    start_time = time.time()

    # Define a magic number of likelihood evaluations to run
    # between checking the runtime constraint
    MAGIC_NUMBER = 10_000

    while True:

        # Run for a given number of likelihood evaluations
        n_call_before = copy(sampler.ncall)
        sampler.run(
            max_ncalls=n_call_before + MAGIC_NUMBER,
            # ...
        )

        # Check if we have converged
        if sampler.ncall == n_call_before:
            print("Sampling complete!")
            return True

        # Check if the timeout is reached
        if time.time() - start_time > max_runtime:
            print("Timeout reached, stopping sampler!")
            return False

Does that make any sense to you?

It succeeds in limiting the runtime even when running with MPI (although choosing the magic number does not feel very elegant, and there is quite some variance in the actual runtimes of different iterations), but looking at the logs it produces I am not 100% sure yet that this really works as intended.

JohannesBuchner commented 3 months ago

Yes, something like that.

For the previous suggestion, I was more thinking along the lines of for i in $(seq 100000); do cp points.h5 points_${i}.h5; sleep 1m; done

JohannesBuchner commented 3 months ago

I am closing this for now. Please reopen if it is an issue.

timothygebhard commented 3 months ago

Apologies for the late feedback — I've been running the code I posted above for the past week, and so far, it looks like it gets the job done 🙂 Thanks for the help!