21cmfast / 21cmFAST

Official repository for 21cmFAST: a code for generating fast simulations of the cosmological 21cm signal
MIT License
58 stars 38 forks source link

Callable memory estimate #221

Open BradGreig opened 3 years ago

BradGreig commented 3 years ago

Hey @steven-murray I've gone through and added a function to estimate the memory usage for a light-cone. Co-eval will follow a similar structure.

Outputting of information is still a work in progress (currently done with prints), but the main functionality is there.

I'll need to think about how best to check that these estimates actually match full runs of 21cmFAST. At the moment, the only thing I can think of is to check against current memory usage from my machine for an equivalent run. I guess this is the downside for these functions (that they are not robust to changes anywhere and are reliant on me getting the behaviour correct).

Also, for the same reason, coming up for tests are problematic as these tests will always pass unless someone manually changes these functions.

I'd greatly appreciate your thoughts/comments/feedback on what I have thus far.

Fixes #26

codecov[bot] commented 3 years ago

Welcome to Codecov :tada:

Once merged to your default branch, Codecov will compare your coverage reports and display the results in this comment.

Thanks for integrating Codecov - We've got you covered :open_umbrella:

BradGreig commented 3 years ago

Hey @steven-murray, thanks for the quick feedback.

With respect to the tests, I think that would be great. However, I wonder how accurate that would be. Different machines/compilers might allocate different amounts of memory. Thus, these functions relative to actual usage may only be approximate. You might have to invoke some level of tolerance (maybe 10 per cent) to roughly get the two numbers matching close enough to pass the test.

I'm about to generate some runs to test how these functions perform relative to actual usage. Also, to test how they scale with HII_DIM as if I get the scaling correct that gives me confidence that the memory is being accurately captured by these functions.

steven-murray commented 3 years ago

@BradGreig: perfect. Yeah, we'll put in a fairly wide tolerance. The idea would just be that we should capture the broad trends. It would be unusual for someone to require the exact memory to within 10%.

BradGreig commented 3 years ago

Hey @steven-murray, I have been playing around with comparing the light-cone memory function to actual usage. For larger boxes it seems to do relatively well (still testing that though), however, for the smaller boxes it ends up being an under-estimate.

Primarily the under-estimate comes from the fact that for small HII_DIM memory declaration of some small interpolation tables and other variables (which aren't included in this function) are comparable to the 3D boxes themselves. For increasing HII_DIM this becomes less of a problem as these become dwarfed by the 3D boxes.

Therefore, I'm not sure one can simply have tests with a relatively large tolerance to capture this for the entire test suite.

steven-murray commented 3 years ago

Hey @BradGreig, I guess that makes some sense. Two options: 1) capture at least the interpolation tables in the memory function (not sure how difficult that is). 2) Just report in the print_ function that this is a lower estimate if HII_DIM < some_number, and when we test, just test that the estimate is below the actual memory usage (since we essentially only test small boxes).

BradGreig commented 3 years ago

Hey @steven-murray, yeah, it is possible to capture the interpolation tables. It's straightforward, it's just that those weren't relevant for the main usage of this (trying to fit large boxes into memory etc.). But, I'll go away and add those. Oh by the way, another reason why I wasn't tracking these tables is that their size is defined by a C-side variable that is not accessible to Python (i.e. not in global_params). I don't think those variables will ever change though.

Upon further research into memory recording etc. (with help from @qyx268) one thing I am realising is that accurately determining memory usage of processes is non-trivial and potentially inaccurate. It's ok for a single threaded process, but one could run into issues with multi-threading (e.g. what is considered shared, private etc.). Also, we have no control if and at what point swap memory may be used. Further, we can't guarantee that a compiler is being intelligent in how it is allocating/using page memory. Thus I think dynamically estimating the memory and comparing against these functions is fraught with danger. I think one can do this with valgrind, but valgrind is super slow, so not going to consider that.

Thus, what I think is the best way to think of this is that these functions provide an upper-limit on the physical memory required to perform a calculation. That is, provided you have the physical memory (or very close to it), then the calculation should be able to proceed without it crashing for the lack of memory. This to me makes sense.

As an example (in the attached file), I am showing the tracked physical memory for a full light-cone (no minihaloes, but everything else on). This is for a DIM=525, HII_DIM=175 box, saving 4 light-cones. The horizontal dashed line corresponds to the peak physical memory required from my function, the blue curve is the physical memory output using ps. The coloured patches denote which section the code is in, grey = ICs, orange = perturbed field, green = spin temperature and purple=ionisation field. As you can see, the actual physical memory is always below my estimate. I don't anticipate the interpolation tables making up the difference, thus I am confident the function serves it's purpose as providing the upper-limit. It appears to be correct for smaller mini-halo boxes too.

MemoryUsageOutput.pdf

steven-murray commented 3 years ago

Hey @BradGreig, that sounds perfectly reasonable. Yes, I don't think we should be trying to predict the behaviour of the OS! If we do add the interpolation tables, we can properly call it an upper limit, and I think that's the most useful thing we can do. Should be able to test that as well.

BradGreig commented 3 years ago

Hey @qyx268, while individual functions can be easily implemented, I'm not really sure how useful they'll be.

I can see some value for initial_conditions, but for the remaining I don't see the benefit of having individual functions. Particularly given all the requisite information will be provided by a run_coeval or run_lightcone.

But I guess it can be added if it is deemed important and useful enough.

BradGreig commented 3 years ago

Hey @steven-murray, following the completion of #220, I just want to clarify something. In the default case (i.e. MINIMIZE_MEMORY=False) does the memory usage remains as it was prior to #220? If so (and I am assuming it is), then it should only be a matter of adding a MINIMIZE_MEMORY=True output to what I have already done.

steven-murray commented 3 years ago

Hi @BradGreig, no the memory is not quite the same for run_coeval or run_lightcone, because now the initial conditions (at least in most cases) are purged from memory before doing anything after the perturb field. So the general flow of these functions is now:

This should in general reduce the peak memory. Also, each box now only initializes the arrays that it actually needs, given the input parameters (so for example, the _2LPT arrays in the ICs are only allocated if USE_2LPT=True). This will reduce the memory in most cases, not just the peak memory.

You can determine the exact OUTPUT memory required for each kind of output more easily now as well (without actually allocating it all). If you have an instantiated output object, you can use _array_structure to get the various box sizes, eg:

init = p21c.outputs.InitialConditions(user_params={...}, cosmo_params={...})
print(init._array_structure)

Just to be clear, creating the init here does not actually allocate any arrays -- these are allocated only as required just before passing the object into C. The _array_structure is a dictionary of keys corresponding to arrays, and then the values are usually tuples giving the array shape, but are sometimes dicts themselves, with entries "shape" and "init", where the "init" is a function used to initialize the array (by default this is np.zeros but for some arrays it has to be np.ones).

So you can easily use this dictionary to compute the total output size of the struct. You'll just then need to add all the arrays that are allocated INSIDE the function, and of course any of the _c_based_pointers in the struct you can't know their size before they are returned from C (atm this is limited to the halo field).

I think that covers all the bases? Lmk if you need any further info!

BradGreig commented 3 years ago

Hey @steven-murray, awesome thanks for the run-down. Just needed a quick idea to know what I was going to have to deal with, and how much time I might need to set aside. Seems like it might be a bit more involved than I was thinking, but good to know beforehand.

BradGreig commented 1 year ago

Hey @steven-murray once #320 gets merged in I might look at finally completing this PR. Not that it actually requires #320, but I think it'll be a clean/stable base to use.

steven-murray commented 1 year ago

@BradGreig that'd be great!

BradGreig commented 1 year ago

Hey @steven-murray, so I've tried to update all the memory functions I had following all the recent changes. But I've noticed a couple of issues, not sure if it is my interpretation or something else is not quite right. To be honest I didn't look too closely at the behaviour of the functions.

As of now, the functions return estimates of the necessary memory for various options. However, they do not resemble what I find when performing an actual run.

The primary issue appears to be with the purging (either it doesn't behave like it should, or I'm misinterpreting its behaviour).

Basically the memory functions return a close match for the peak memory. But it doesn't remotely match the ongoing memory throughout the calculation. For example, under the logic you described above (regarding purging) after the perturbed fields are computed the ICs are removed from memory. Thus, the peak memory will always occur during the IC generation and/or with determining the perturb fields.

However, what I find in practice is that the memory only every increases throughout a run. It never demonstrates a significant drop following a purge of the ICs. Perhaps the compiler doesn't properly release the memory? Thus once it is allocated it never gets properly released to be used again.

This behaviour will result in an underestimate of the peak memory usage as with purging it's assumed that the peak will happen during IC generation (by the nature of its behaviour). However, in practise since it appears this memory isn't completely removed, that original memory plus the memory of everything else results in a larger peak.

Secondly, if MINIMIZE_MEMORY is False it appears all perturbed_field boxes are kept in memory at any one time. That is, for a regular light-cone (z = 5.5) that'll constitute 88 perturbed fields that are kept in memory. Is this desirable behaviour? I guess this comes about following the restructure of the code to purge the ICs. But I can see a few instances where this is far from ideal.

steven-murray commented 1 year ago

@BradGreig what are you using to capture the memory usage? I know sometimes the OS reserves memory for an application even after it has deallocated, just in case it needs it again. I think there's some distinction in the kind of memory allocated by the OS for different purposes, and I can never quite remember which kind of RAM estimate to use.

Certainly the idea is that once we do perturbed fields, the initial conditions should be dropped from memory, so you should get lower current memory usage then.

Keeping all the perturbed field boxes in memory sounds like a terrible idea, so I'll try fixing that.

BradGreig commented 1 year ago

Hey @steven-murray, in this case I was scraping the information using the command line from top during the evaluation. Additionally, I was grabbing the max memory off the HPC job monitor and submitted job.

I think storing all perturb's has some merit, but I think it is highly dependent on the actual setup. Which should be able to be determined within the function. So if possible you probably want to allow both options.

For example, if the memory requirements of the number of perturbs is less than the ICs, then this is preferable. Otherwise, if the number of redshifts is too large, then only having two perturbs at any one time makes more sense.