amnsbr / cubnm

A toolbox for biophysical network modeling on GPUs
https://cubnm.readthedocs.io
BSD 3-Clause "New" or "Revised" License
10 stars 3 forks source link

[BUG] Memory leak in batch optimize #29

Open amnsbr opened 1 month ago

amnsbr commented 1 month ago

On an A100 device I was running a batch optimization with 32 CMAES optimizers, each with 64 particles (total number of simulations in each generation = 2048) and it failed at the 11th (of 80) generation with this error: Error unknown error at line 1446 in file /path/to/bnm.cu. This line is in _init_gpu and is allocating memory to one of the arrays. Max memory usage in the job when it failed was 33 GB, whereas max memory usage in a test job running a single generation was 4 GB. This indicates memory leak. With a quick look at the code, causes of this leak include:

  1. In optimize.batch_optimize at each iteration the GPU session is reinitialized. This is because in each generation a new MultiSimGroup instance is created, and when it’s run method is called, it’s always in its first run, leading to force_reinit (in SimGroup.run) to be True, and model is reinitialized in every generation.
  2. Regardless, even when the model is reinitialized in every generation, there should be no memory leak! There is a logical bug in the program flow. Model::free_gpu (in bnm.cu) is only called when a Model object is deleted (which happens when CPU ↔ GPU switch has happened or model name has changed). Therefore it is not called when force_reinit is set to True and reinitialization happens at later runs, without the model object being deleted (as it happens in this case).
amnsbr commented 1 month ago

I reran the same job after the fix, and this error did not occur. However the final memory usage was 32 GB which might be related to the rerunning of optimal simulations while saving them. This is done serially for each optimizer and is also not very efficient. Instead it would be more beneficial to run the optimal simulations of the optimizers also through a MultiSimGroup in batch.