brinckmann / montepython_public

Public repository for the Monte Python Code
MIT License
93 stars 77 forks source link

Excessive memory usage #322

Closed mcosta18 closed 9 months ago

mcosta18 commented 1 year ago

Hi, I have run into the following problem running Montepython (v3.5). Whenever I run a chain (without MPI) trying to get 1kk steps, I always need to ask a huge amount of memory to the cluster (order of 16-32 Gb). Otherwise before walltime the job will get killed with the following message:

PBS: job killed: mem 33569636kb exceeded limit 33554432kb

As far as I understand, such a high memory usage is not normal.

I have run into this problem using several combinations of CLASS versions (official release and a custom version) and likelihoods (Planck2018TTTEEE, pybird BOSS data). This makes me suspect that there must be some memory deallocation problem whenever montepython calls CLASS, but I could not get any further insight

Some further details: -I am using python 2.7.18 -The cluster has PBS -I do not use MPI and I launch the run command multiple time in the same folder (the problem exists even when I call a single chain) -this is an example of the montepython command I am sending to the cluster:

PBS -l nodes=1:ppn=8

PBS -l mem=32GB

(other PBS stuff that is not relevant) python /path/to/montepython/MontePython.py run -p /scratch200/diegor/montepython_public-3.5/input/test.param -o /path/to/chains/test_out --conf path/to/class/class.conf -N 1000000 -f 1.5 --update 30 --superupdate 20 > path/to/out/test.out

Thanks for the attention

Marco

mcosta18 commented 1 year ago

Any updates on this?

brinckmann commented 1 year ago

Hi Marco,

I'm sorry for not reacting to this, I didn't really have any good ideas for what the problem is. I agree it sounds like a memory leak. Since you're still having the problem, let's try to some tests to narrow down possibilities.

The usual culprit is CLASS modifications leading to a memory leak, but you say you tried with standard CLASS too.

Did you try with standard CLASS running LCDM with only Planck? Does it still do it then? That's a case that should work. Specifically, I'm wondering about the BOSS pybird likelihood, as I don't have experience with it, so I wanted you to try without that.

I don't know if CLASS 3.2 has any memory leaks, but I'd be a bit surprised, especially if it's ones that show up after only 1000 steps. Maybe you could also try standard CLASS 2.9?

I also wonder if superupdate has an unfortunate interaction with the parallel launched chains that wasn't caught during testing, can you try without that option?

Best, Thejs

mcosta18 commented 1 year ago

Hi Thejs, thanks for the response.

Right now I am doing tests with the following setup: CLASS_public-3.2, montepython_public-3.5, with Planck likelihoods(compiled with mkl libraries and ifort), as you suggested. More precisely, the data used are the following: data.experiments=['Planck_highl_TTTEEE', 'Planck_lowl_EE', 'Planck_lowl_TT']

I ran 4 times the qsub command with the example script in the first message, with the difference of using the following specs:

PBS -l nodes=1:ppn=4

PBS -l mem=4GB

One of the 4 lcdm chains stopped prematurely (roughly after 10 hours) with the following error: /var/spool/pbs/mom_priv/jobs/6334791.power8.tau.ac.il.SC: line 21: 31811 Segmentation fault python /path/to/montepython/montepython_public-3.5/montepython/MontePython.py run -p /path/to/montepython/montepython_public-3.5/input/base2018TTTEEE.param -o /path/to/montepython/montepython_public-3.5/chains/Planck_lcdm -N 1000000 -f 1.5 --update 30 --superupdate 20 > lcdm.out

The other 3 chains are still going after 20+ hours.

I am also doing 4 other lcdm runs without the super update in the command. So far (5 hours in) they have no problem, and reached a high enough acceptance rate.

I also did test runs with a modified CLASS version: 4 runs with superupdate and 4 without. Both crashed after roughly 4 hours, with the memory error of the original post: =>> PBS: job killed: mem 4237048kb exceeded limit 4194304kb

The only difference between these two latter cases (using the modified CLASS) is that without the superupdate the acceptance ratio of the chains did not reach a high value in the 4 hours.

To summarize: modified class has some issues that are not present in the public class.

Thanks again for the attention

Best, Marco

JaelssonLima commented 4 months ago

I managed to solve a similar problem here by limiting memory usage by implementing a code snippet in montepython_public[...]/montepython/MontePython.py:

##################
import resource

# Set the maximum memory usage limit in bytes (e.g. 1 GB = 1 * 1024 * 1024 * 1024 bytes)
limite_memoria = 24 * 1024 * 1024 * 1024  # 24 GB

# Set memory usage limit
resource.setrlimit(resource.RLIMIT_AS, (limite_memoria, limite_memoria))

#################

Att. J.