Memory leak leads to signal 9 using scalar model

Amin-83 commented 2 years ago

Dear Thejs,

I am running the scalar field model as implemented in CLASS (did no change there) and used the attached param file in MontePython. I'm using the JLA data. Running:

mpirun -np 4 python montepython/MontePython.py run -o chains/jla_scf -p input/jla_scf.param -N 1000

generates successfully a covariance matrix and a good acceptance rate. However, afterwards as I try to refine the run by using

mpirun -np 8 python montepython/MontePython.py run -o chains/jla_scf -p input/jla_scf.param -N 20000 -c chains/jla_scf.covmat -b chains/jla_scf.bestfit

I am faced with a memory issue and the run crashes with (I am using my desktop computer)

mpirun noticed that process rank 4 with PID 0 on node BOHR exited on signal 9 (Killed).

Could it be that in the second run a lot of points get rejected and so the parameter space gets sampled at a high frequency which leads to a memory leak?

Thank you in advance, Amin

jla_scf_param.txt

brinckmann commented 2 years ago

Hi Amin,

Probably there's a memory leak in CLASS that leads to this problem once you sample a large number of points, while it doesn't show up for a shorter run because you don't reach the memory limit. In particular, I believe there are some problems with the scf implementation in CLASS that need to be fixed. You could try posting on the CLASS github instead, as this is not a MontePython problem.

Best, Thejs

Amin-83 commented 2 years ago

Hi Thejs,

Thanks for your reply. I have been fiddling around with the upper and lower limits as well as the sigma values of the priors and managed to get it to run with out a memory leak. However, it is very delicate. Any change in the limits brings me back to square one. I will post the issue on the CLASS github as you suggested.

Thanks again.

Best, Amin

brinckmann / montepython_public

Memory leak leads to signal 9 using scalar model #281