Open amnsbr opened 1 month ago
I reran the same job after the fix, and this error did not occur. However the final memory usage was 32 GB which might be related to the rerunning of optimal simulations while saving them. This is done serially for each optimizer and is also not very efficient. Instead it would be more beneficial to run the optimal simulations of the optimizers also through a MultiSimGroup
in batch.
On an A100 device I was running a batch optimization with 32 CMAES optimizers, each with 64 particles (total number of simulations in each generation = 2048) and it failed at the 11th (of 80) generation with this error:
Error unknown error at line 1446 in file /path/to/bnm.cu
. This line is in_init_gpu
and is allocating memory to one of the arrays. Max memory usage in the job when it failed was 33 GB, whereas max memory usage in a test job running a single generation was 4 GB. This indicates memory leak. With a quick look at the code, causes of this leak include:optimize.batch_optimize
at each iteration the GPU session is reinitialized. This is because in each generation a newMultiSimGroup
instance is created, and when it’srun
method is called, it’s always in its first run, leading toforce_reinit
(inSimGroup.run
) to beTrue
, and model is reinitialized in every generation.Model::free_gpu
(inbnm.cu
) is only called when aModel
object is deleted (which happens when CPU ↔ GPU switch has happened or model name has changed). Therefore it is not called whenforce_reinit
is set toTrue
and reinitialization happens at later runs, without the model object being deleted (as it happens in this case).