21cmfast / 21cmFAST

Official repository for 21cmFAST: a code for generating fast simulations of the cosmological 21cm signal
MIT License
58 stars 37 forks source link

[BUG] Memory issues in 21cmFast #337

Open jordanflitter opened 1 year ago

jordanflitter commented 1 year ago

Hello,

I've encountered numerous memory issues that prevent me from completing the simulation. I should stress that in my project I work with a lot more redshift samples than in the standard 21cmFast, which is why I'm having these issues. Here's a list of the bugs that I've found so far, sorted by their severity:

  1. In wrapper.py, in run_lightcone() and run_coeval(), every time a C-function is called (via initial_conditions(), perturb_field(), etc) the "write" argument is missing. As a result, boxes are saved in the cache even if the user explicitly specifies to not save anything in the cache with write=False.
  2. In wrapper.py, in run_lightcone() and run_coeval(), perturb_field() is called in a different redshift loop than the main one (where spintemperature() is called) and the list "perturb" grows in an uncontrolled manner. I believe this was done to support the old option of using the perturbed box with the lowest redshift through the entire calculation. However, this implementation can result a segmentation fault error when the size of the list is huge. I moved the call to perturb_field() within the main redshift loop and that fixed the segmentation fault error I had.
  3. In wrapper.py, in run_lightcone() and run_coeval(), when either spin_temperature() or ionize_box() are called, the cleanup argument logic implies that allocated memory is freed only at the last iteration. This logic appears to be deliberate (maybe for speeding up the calculation?), but I find it to be harmful when many redshift samples are needed. I have fixed this logic so memory is freed at each iteration, and the runtime seems to be not affected considerably by that change.
  4. In BrightnessTemperatureBox.c, init_ps() is called at line 55, but free_ps() isn't called at the end of the calculation. In fact, the C-code in BrightnessTemperatureBox.c never uses anything that was initialized by init_ps(), and so that line should be removed.
  5. In IonisationBox.c, init_ps() is called at line 112 but free_ps() isn't called at the end of the calculation. In addition, I found that the C-code in IonisationBox.c needs init_ps() only when mini-halos are turned on.
  6. In SpinTemperatureBox.c, the variable "log10_Mcrit_LW_ave_list" is allocated (only when mini-halos are turned on), but never freed.
  7. Finally, in wrapper.py, in run_lightcone() and run_coeval(), filenames of the calculated coeval boxes are appended to lists (again, in an uncontrolled manner), even if nothing was written to the cache (by fixing bug #1 in this list).

Fixing all the above bugs seem to reduce the memory requirements of 21cmFast considerably. And yet, there is still something that keeps me from finishing my simulation. As far as I understand, when we get into the main redshift loop in either run_lightcone() or run_coeval(), the used memory by the program should remain on a fixed value, because new coeval boxes overwrite the previous ones at each iteration. However, this is certainly not what's happening, even when the above bugs are fixed.

To investigate what's causing this behavior, I wrote a script that calls to initial_conditions() multiples times, and prints the memory used by the returned InitialConditions object (I should note that I use the same random seed at each run). I should note that I found sys.getsizeof() to be unreliable for this task and I used instead actualsize(), which can be found in this nice article. I was surprised to find that the memory used by each InitialConditions object was not constant, but rather growing with every call to initial_conditions(). Not only that, but the growth rate was not constant. I'm suspecting that it might have to do with the basic class "StructWrapper" that is used by the wrapper, but I am not sure. This is the point where I thought I should get some help from the github community :)

Thanks for anyone that can help me with that issue!

Jordan

BradGreig commented 1 year ago

Hi @jordanflitter, before being able to provide detailed answers to your question, can you please provide an example of what you are specifically aiming to compute (e.g. used variables, setup and version). That way we should be able to provide more useful information.

By memory issues, do you just mean you exhaust the available memory on your machine?

A brief response to some of your questions:

  1. I know this used to be the case, but this was fixed a while ago. I just did a quick simulation using run_lightcone setting write=False and nothing was saved to the cache. This makes me wonder if you have an older version? Or you are attempting to do something else.

  2. This has been identified in #221. This PR is an attempt to allow the user to estimate their memory requirements for their specific setup. It is not finalised yet, but you could try checking out that branch and it might give you some useful information for your setup.

The decision here was based on being able to purge the initial conditions, which individually require the most memory. But, if enough redshifts are required, this won't help for the identified reason. Likely a solution implementing both will be required (where the path taken depends on the user's setup).

  1. Hmm, this hasn't caused an issue previously. I've always found the memory requirements to be fairly stable with repeated calls to spin_temperature. I'll have to have a think about this some further.

4 - 6. This is probably true, but the memory requirements here are essentially zero, essentially w.r.t init_ps(). So this should not be causing any problems.

  1. Hmm, are you using the callback function? I think this should only be occurring if you are using the callback feature. I have never used this feature, so cannot speak for how it behaves with respect to memory.

Not sure if they are useful answers, but hopefully more assistance can be provided with more information on your desired setup

steven-murray commented 1 year ago

Thank you very much @jordanflitter for the detailed bug report. I agree with @BradGreig's comments, and I think a few of these can be fixed easily (and as he says, I'm pretty sure the first is already fixed on master). Making 21cmFAST memory-robust is important to us, so this will be a priority. Part of the problem is that there are different use-cases (as you well know) that require different code-paths to be optimal.

jordanflitter commented 1 year ago

Thanks @BradGreig and @steven-murray for showing interest in this issue. After more tests I did I found that the last issue (that prevented me from completing the simulation in my project) had to do with the modifications I did in my project, rather than another bug in 21cmFast :)

As for your questions, @BradGreig, I'm working with a modified code that is based on version 3.1.3. Because I saw that most of the lines that were causing me memory issues* are still found in the updated master code, I thought they could still lead to the same errors that I've encountered, though I understand that a fix to the first bug on my list was already implemented.

*By memory issues I mean that the program has exhausted its available memory. For high resolutions I run the code on a Sun Grid Engine (SGE) system and many of the jobs were either killed/aborted by the system (in some occasions, a numpy ArrayMemoryError was raised). Before I fixed the 2nd bug on my list I used to get a segmentation fault error. I should say that when I work with a low resolution box on my laptop I've never had any of these problems.

Before fixing the 3rd bug on my list, the memory usage by 21cmFast was increasing with each redshift step (here I used a poor resolution box of HII_DIM=25, no mini-halos, and I used the standard 84 redshift samples from z=35 to z=6. I made this plot with the wonderful tool psrecord. The first blue peak in this plot corresponds to the generation of the initial boxes)

21cmFAST_public

Fixing the 3rd bug helps flattening the blue curve, but not entirely. Then, by fixing the 4th-6th bugs on my list I was able to bring the blue curve on a nearly** constant value. To be honest, I haven't checked what happens if the 7th bug on my list is fixed, I imagine it has a negligible effect on the total memory.

**Unfortunately, after making all these changes, the used memory by the 21cmFast is still not perfectly constant, and it gradually increases with each redshift step. In an attempt to figure out what may causing it, I wrote the following short script. This script returns different memory usage by the three different generated InitialConditions objects, which I find to be weird. I have to admit though that I don't have any prior experience with the actualsize(), although it seems to be doing a better job than sys.getsizeof().

import sys
import gc
import numpy as np
import matplotlib.pyplot as plt
import py21cmfast as p21c

def actualsize(input_obj):
    memory_size = 0
    ids = set()
    objects = [input_obj]
    objects_memory = []
    while objects:
        new = []
        for obj in objects:
            if id(obj) not in ids:
                ids.add(id(obj))
                memory_size += sys.getsizeof(obj)
                new.append(obj)
                objects_memory.append(sys.getsizeof(obj))
        objects = gc.get_referents(*new)
    return memory_size, np.sort(objects_memory)

for i in np.arange(3):
    ICs = p21c.initial_conditions(user_params={'HII_DIM': 25 },
                                 regenerate=True,
                                 write=False,
                                 direc="_cache",
                                 random_seed=1)
    M, objects_memory = actualsize(ICs)
    print(f"Total memory of ICs is {M}")
    plt.plot(np.cumsum(objects_memory))
BradGreig commented 1 year ago

Hi @jordanflitter, excellent, good to know you were able to get your simulations complete!

The usage graph you've provided looks fairly similar to one I produce a while back (https://github.com/21cmfast/21cmFAST/pull/221#issuecomment-810746666). Apologies for the difficulty viewing it, it's run all the way through the calculation. However, between that version, your's and the current master there have been a few changes (mainly your 2nd point, which we are actively resolving).

Additionally, thanks for passing along the tools you've been using. They will be helpful (I was just scraping the information from top while the simulation was running).

We'll get to work when we can with respect to implementing a lot of these suggestions. I think the dominant one will be point 2. But we'll check and see how everything else is performing.

jordanflitter commented 1 year ago

Thanks @BradGreig. I'm glad that you found my comment helpful. Please let me know if you find anything interesting in that context :)