astropy / halotools

Python package for studying large scale structure, cosmology, and galaxy evolution using N-body simulations and halo models
http://halotools.rtfd.org
99 stars 63 forks source link

Crash in mock generation due to rounding errors #1081

Closed johannesulf closed 2 weeks ago

johannesulf commented 1 month ago

The following code creates an error. Note that all numbers matter. Different choices for model parameters likely won't produce the error.

from halotools.empirical_models import PrebuiltHodModelFactory
from halotools.sim_manager import CachedHaloCatalog

model = PrebuiltHodModelFactory(
    'cacciato09', threshold=8.5, prim_haloprop_key='halo_mvir',
    prim_galprop_key='luminosity')
model.param_dict['log_M_1'] = 11.2
model.param_dict['log_L_0'] = 9.95
model.param_dict['gamma_1'] = 3.5
model.param_dict['gamma_2'] = 0.25
model.param_dict['a_1'] = 0.8
model.param_dict['a_2'] = 0.0

halocat = CachedHaloCatalog(simname='bolplanck', redshift=0,
                            halo_finder='rockstar')
model.populate_mock(halocat)
Traceback (most recent call last):

  File ~/.venv/lib64/python3.12/site-packages/spyder_kernels/py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File ~/Projects/halotools/scratch.py:16
    model.populate_mock(halocat)

  File ~/.venv/lib64/python3.12/site-packages/halotools/empirical_models/factories/hod_model_factory.py:1209 in populate_mock
    ModelFactory.populate_mock(self, halocat, **kwargs)

  File ~/.venv/lib64/python3.12/site-packages/halotools/empirical_models/factories/model_factory_template.py:254 in populate_mock
    self.mock.populate(**mockpop_kwargs)

  File ~/.venv/lib64/python3.12/site-packages/halotools/empirical_models/factories/hod_mock_factory.py:326 in populate
    self.allocate_memory(seed=seed)

  File ~/.venv/lib64/python3.12/site-packages/halotools/empirical_models/factories/hod_mock_factory.py:495 in allocate_memory
    self.galaxy_table[halocatkey] = np.zeros(

ValueError: negative dimensions are not allowed

The underlying reason seems to be that the Cacciatio09 satellite occupation model produces an extremely small negative mean expected occupation for some halos. This is likely due to some rounding error since the value is something like -3e-47. This negative expectation value than correctly creates an NaN when calling pdtrik here. However, in the same part of the code, this NaN is then explicitly converted into an in int which is -9223372036854775808. Ultimately, the expected galaxy table length is then negative, creating the crash.

Here's a potential solution: Right before calling pdtrik to create draws from a Poisson distribution, we should check whether expectation values are negative. If so but they're effectively 0 to within some precision, we could force negative expectation values to be 0 exactly. If however, some values are negative beyond some reasonable value for numerical noise, the code should throw an error to alert the user that the model produced an unphysical occupation. @aphearin I can write this if you think that's a good solution.

aphearin commented 1 month ago

Nice work on putting together a minimal reproducer and identifying what I agree is a very sensible fix (including your idea to raise an error for unexpectedly negative numbers). Maybe something like OCCLIP = -1e-3 would catch all reasonable rounding errors, while also catching actual bugs. I have to do this sort of thing all the time lately since most JAX code is single precision, and this seems reasonable here as well.

Thanks for offering to lead the fix. I'm happy to review the PR so if you'd like a second pair of eyes on the code then just tag me when you push it up.

johannesulf commented 1 month ago

Credit goes to Kaustav Mitra for finding the bug and producing the minimal reproducer.