brinckmann / montepython_public

Public repository for the Monte Python Code
MIT License
93 stars 77 forks source link

zombie chain files & Memory leak #321

Closed Amlan1996 closed 9 months ago

Amlan1996 commented 1 year ago

I have modified the class for a new dark energy model, and it works fine. Then I tried to run it on montepython. Initially, it ran perfectly and produced covmat and best fit perfectly. I used them to give a second fresh run without restarting it ( I am facing some issues in restarting: as mentioned in #314 ). The mpirun command was the following:

mpirun -np 16 python montepython/MontePython.py run -p input/new.param -o chains/cham_ban1 -b chains/cham_ban.bestfit -c chains/cham_ban.covmat --superupdate 20 -N 300000

Then, I gave a third run and started to face this memory leak problem. After running for 2 to 3hrs, some chain files stopped producing outputs and became a zombie files. This process progressed continuously until the system automatically terminated the whole process without showing any error in the error file and only leaving this in the output file at the end:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 3018839 RUNNING AT nova19
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 1 PID 3018840 RUNNING AT nova19
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

I thought the covmat and best-fit file had been corrupted, so I gave a fresh run and did the similar process as mentioned above, but from the second run, it started to freeze out and became a zombie.

So it will be great if anyone ( @brinckmann , @dchooper , @lesgourg ) or someone from the community can help me out, as I have been struggling with this problem for months.

Best regards, Amlan

DinorahBarbosa commented 9 months ago

I'm having the same issue...I was running CLASS + MontePython with CMB data just fine at first. Then few weeks later when I put some of them to run again I ran into this issue and now it seems unavoidable.

brinckmann commented 9 months ago

Hi,

There isn't enough information here to understand the issue, but I can say in general that if you're running with modified versions of CLASS you might check if you're running into some parts of parameter space that are not allowed or where the sampler gets stuck and can't take a next step. It can be a bit tricky to understand those kinds of issues and they depend very much on the individual case, but in general you should check that either CLASS can be run for the entire parameter range that you are sampling (especially focusing on your new parameters) or that the parts of parameter space that aren't allowed are naturally disfavored by the likelihood so that the code is very unlikely to go there. Otherwise you may need to put a prior range on your parameters eliminating those parts of parameter space.

Best, Thejs