Closed johannesulf closed 3 months ago
Nice work on putting together a minimal reproducer and identifying what I agree is a very sensible fix (including your idea to raise an error for unexpectedly negative numbers). Maybe something like OCCLIP = -1e-3
would catch all reasonable rounding errors, while also catching actual bugs. I have to do this sort of thing all the time lately since most JAX code is single precision, and this seems reasonable here as well.
Thanks for offering to lead the fix. I'm happy to review the PR so if you'd like a second pair of eyes on the code then just tag me when you push it up.
Credit goes to Kaustav Mitra for finding the bug and producing the minimal reproducer.
The following code creates an error. Note that all numbers matter. Different choices for model parameters likely won't produce the error.
The underlying reason seems to be that the Cacciatio09 satellite occupation model produces an extremely small negative mean expected occupation for some halos. This is likely due to some rounding error since the value is something like -3e-47. This negative expectation value than correctly creates an NaN when calling
pdtrik
here. However, in the same part of the code, this NaN is then explicitly converted into an in int which is -9223372036854775808. Ultimately, the expected galaxy table length is then negative, creating the crash.Here's a potential solution: Right before calling
pdtrik
to create draws from a Poisson distribution, we should check whether expectation values are negative. If so but they're effectively 0 to within some precision, we could force negative expectation values to be 0 exactly. If however, some values are negative beyond some reasonable value for numerical noise, the code should throw an error to alert the user that the model produced an unphysical occupation. @aphearin I can write this if you think that's a good solution.