ARM bug: overflow from excessive live points

JohannesBuchner / UltraNest

Fit and compare complex models reliably and rapidly. Advanced nested sampling.

https://johannesbuchner.github.io/UltraNest/

Other

153 stars 30 forks source link

ARM bug: overflow from excessive live points #123

Closed fcotizelati closed 8 months ago

fcotizelati commented 8 months ago

Summary: I've encountered an issue while running Bayesian inference on an ARM architecture where the sampler attempts to allocate an unrealistically high number of live points (9223372036854775407, nearly the maximum value for a 64-bit integer), leading to an overflow error and halting the process. This issue does not occur on an Intel architecture under identical conditions, suggesting a potential bug in UltraNest's handling of live points specifically on ARM architectures.

Environment details: UltraNest Version: 4.1.6 Python Version: 3.11 System: Darwin (release 23.3.0) Version: Darwin Kernel Version 23.3.0: Wed Dec 20 21:31:10 PST 2023; root:xnu-10002.81.5~7/RELEASE_ARM64_T6031 Machine: arm64

Attachments: I've attached the debug logs obtained through BXASolver for both Intel ('debug_intel.log') and ARM ('debug_ARM.log') architectures.

I would appreciate any guidance on resolving this issue or any temporary workarounds that might exist. Thank you for your help and for maintaining UltraNest.

debug_intel.log debug_ARM.log

JohannesBuchner commented 8 months ago

Thank you for reporting this.

In the log I see number of live points vary between 1 and inf, most (190/359 iterations) have 1

The inf makes me suspicious. I think that this line https://github.com/JohannesBuchner/UltraNest/blob/master/ultranest/integrator.py#L1722 computes inf, perhaps the log is evaluated for zero? The subsequent line tries to catch invalid computations, but +inf is not marked.

This would occur when widthratio = 0, which occurs in the line before when logweights[1:,0] - logweights[:-1,0] is zero.

This occurs when all posterior points have equal weight. Probably because the run was such that the likelihood returned always the same number. This can occur when no data points are analysed, or a very, very large portion of the prior is marked with a special invalid likelihood number (e.g., -1e300 in BXA when the Fit.statistic is not a finite number).

JohannesBuchner commented 8 months ago

Btw, you can circumvent this with the max_num_improvement_loops=0 argument to run().

JohannesBuchner commented 8 months ago

I think a possible solution could be

to change nlive[~(nlive > 1)] = 1 to nlive[~np.logical_and(nlive > 1, np.isfinite(nlive))] = 1

JohannesBuchner commented 8 months ago

I guess the reason you see this on ARM but not Intel is that np.array([inf]).astype(int) gives different values?

fcotizelati commented 8 months ago

Thank you for your prompt reply.

Changing the line in integrator.py to nlive[~np.logical_and(nlive > 1, np.isfinite(nlive))] = 1 while still retaining max_num_improvement_loops=-1 in the run() argument worked for me!

Now I will evaluate the robustness of the result, but in the meantime, it's excellent that the sampling procedure did not halt.

fcotizelati commented 8 months ago

I've just noticed that on the Intel architecture, infinity converts to -9223372036854775808, while on the ARM architecture it converts to 9223372036854775807. These are the minimum and maximum values for a 64-bit signed integer (int64).

So I guess the CPU architecture might have different default behaviors for converting floating-point numbers to integers, especially for special floating-point values like infinity.

JohannesBuchner commented 8 months ago

I released ultranest 4.1.7, please test it and let me know if it solves this issue for you.

fcotizelati commented 8 months ago

It does, thank you. And even in those initially problematic cases, the analysis returns definitely reasonable corner plots for bxa and posterior distributions for the excess variance for bexvar