21cmfast / 21cmFAST

Official repository for 21cmFAST: a code for generating fast simulations of the cosmological 21cm signal
MIT License
56 stars 37 forks source link

Possible issue: halo finder attempting to reserve 400Gb for halo list #382

Closed JasperSolt closed 2 months ago

JasperSolt commented 2 months ago

Hello, I'm running the halo sampler in the v4-prep branch to generate some lightcones, and I'm encountering the following issue:

Traceback (most recent call last):
  File "/ifs/CS/replicated/home/jsolt/EoR_NN/run_21cmFAST.py", line 139, in <module>
    lightcone = p21c.run_lightcone(
                ^^^^^^^^^^^^^^^^^^^
  File "/ifs/CS/replicated/home/jsolt/EoR_NN/torchenv/lib/python3.11/site-packages/py21cmfast/wrapper.py", line 3417, in run_lightcone
    halo_field = determine_halo_list(
                 ^^^^^^^^^^^^^^^^^^^^
  File "/ifs/CS/replicated/home/jsolt/EoR_NN/torchenv/lib/python3.11/site-packages/py21cmfast/wrapper.py", line 1190, in determine_halo_list
    return fields.compute(
           ^^^^^^^^^^^^^^^
  File "/ifs/CS/replicated/home/jsolt/EoR_NN/torchenv/lib/python3.11/site-packages/py21cmfast/outputs.py", line 369, in compute
    return self._compute(
           ^^^^^^^^^^^^^^
  File "/ifs/CS/replicated/home/jsolt/EoR_NN/torchenv/lib/python3.11/site-packages/py21cmfast/_utils.py", line 1485, in _compute
    exitcode = self._c_compute_function(*inputs, self())
                                                 ^^^^^^
  File "/ifs/CS/replicated/home/jsolt/EoR_NN/torchenv/lib/python3.11/site-packages/py21cmfast/_utils.py", line 764, in __call__
    self._init_cstruct()
  File "/ifs/CS/replicated/home/jsolt/EoR_NN/torchenv/lib/python3.11/site-packages/py21cmfast/_utils.py", line 738, in _init_cstruct
    self._init_arrays()
  File "/ifs/CS/replicated/home/jsolt/EoR_NN/torchenv/lib/python3.11/site-packages/py21cmfast/_utils.py", line 723, in _init_arrays
    setattr(self, k, fnc(shape, dtype=tp))
                     ^^^^^^^^^^^^^^^^^^^^
numpy.core._exceptions._ArrayMemoryError: Unable to allocate 392. GiB for an array with shape (35055203981, 3) and data type int32

The lightcones I'm attempting to run are 1 Gpc in size, with a box length of 256 and a redshift range of 6-16. I know that's large, but even so 400 Gb for a list of halos seems excessive. Is there a way to reduce the size of the halo list?

JasperSolt commented 2 months ago

I ran the halo finder on an initial conditions box with BOX_LEN=32 to try to troubleshoot the issue. I found that the generated halo field had the parameters 'buffer_size': 68467197 and 'n_halos': 27140773. This means the buffer is overshooting the expected number of halos by around 200%. Is it possible to add a parameter to manually set the buffer size, or just significantly reduce the buffer itself? The unused reserved memory makes generating larger cubes impossible.

JasperSolt commented 2 months ago

Over the weekend, I did a bit more testing. I attempted to complete halo finding for differently sized boxes, just to see when the memory overhang became too great. For each job I requested 250 Gb of memory on a compute node on Brown University's CSGrid cluster. I also tried to estimate how much memory the halo list theoretically should be using, just with the back-of-the-envelope calculation of memory = buffer size * 4 bytes for int32 * 3 coordinates. Additionally I messed around a bit with how the buffer_size is estimated in the python wrapper, so don't take too much stock in the absolute values listed here.

Here's what came of it:

Test 1 
    box_len = 128
    buffer_size = 2,190,950,249
    estimated memory = ~26 Gb
    n_halos = 1,738,884,214
    complete

Test 2 
    box_len = 192
    buffer_size = 9,859,276,307
    estimated memory = ~118 Gb
    n_halos = 5,868,611,828
    complete

Test 3 
    box_len = either 256 or 224 (forgot to write it down oop)
    buffer = 15,656,164,876
    estimated memory = ~188 Gb
    killed 

It seems at least believable to me that Test 3 would fail, given that the estimated memory cost of the halo list is only 60 Gb away from the requested 250 Gb. But maybe my perception is skewed cause I'm used to dealing with such large memory objects, and gigabyte values have lost all meaning to me. Wouldn't be the first time.

My question is, does the halo list need to be this long? Is there a way I could make the halo sampler "coarser-grained" (for lack of a better term) for larger simulations? Other semi-numerical sims I've worked with in the past that use halo finders don't have nearly this much memory overhead, and there's only so much memory I can request.

JasperSolt commented 2 months ago

I have discovered the sampler_min_mass parameter in global_params. Apologies for the foolishness