[BUG] Memory leak in batch optimize

On an A100 device I was running a batch optimization with 32 CMAES optimizers, each with 64 particles (total number of simulations in each generation = 2048) and it failed at the 11th (of 80) generation with this error: Error unknown error at line 1446 in file /path/to/bnm.cu. This line is in _init_gpu and is allocating memory to one of the arrays. Max memory usage in the job when it failed was 33 GB, whereas max memory usage in a test job running a single generation was 4 GB. This indicates memory leak. With a quick look at the code, causes of this leak include:

In optimize.batch_optimize at each iteration the GPU session is reinitialized. This is because in each generation a new MultiSimGroup instance is created, and when it’s run method is called, it’s always in its first run, leading to force_reinit (in SimGroup.run) to be True, and model is reinitialized in every generation.
Regardless, even when the model is reinitialized in every generation, there should be no memory leak! There is a logical bug in the program flow. Model::free_gpu (in bnm.cu) is only called when a Model object is deleted (which happens when CPU ↔ GPU switch has happened or model name has changed). Therefore it is not called when force_reinit is set to True and reinitialization happens at later runs, without the model object being deleted (as it happens in this case).

amnsbr / cubnm

[BUG] Memory leak in batch optimize #29