brinckmann / montepython_public

Public repository for the Monte Python Code
MIT License
93 stars 80 forks source link

MP not running in parallel @ NERSC supercomputer. #326

Closed ClaudioNahmad closed 1 year ago

ClaudioNahmad commented 1 year ago

Hey, good morning.

I've ran across an issue involving montepython's parallelization. I'm running MP inside the NERSC cluster in which the equivalent of mpirun command is srun. When running the following script:

#!/bin/bash
#SBATCH -N 1
#SBATCH -C cpu
#SBATCH -q regular
#SBATCH -J b0b1b2_general
#SBATCH --mail-user=claudio.nahmad@gmail.com
#SBATCH --mail-type=ALL
#SBATCH -t 00:30:00

#OpenMP settings:
export OMP_NUM_THREADS=1
export OMP_PLACES=threads
export OMP_PROC_BIND=spread

#run the application:
srun -n 16 -c 16 --cpu_bind=cores python /global/homes/n/nahmad/COSMO/montepython_public-3.5/montepython/MontePython.py run -p ../input/0_b0b1b2_desi_priors.param --conf ../classb0b1b2.conf -o ../chains/b0b1b2_2023_4_27 -N 5000 --superupdate 20

The output is an error. MP seems to create 16 copies of the output directory '../chains/b0b1b2_2023_4_27' and tells me that the directory already exists. It seems like MP is not aware that srun is an MPI command and i get 16 copies of the same command :(

Is there any way to fix this? is MP only supposed to be run with mpi4py?

Thanks for the time to answer my issue!

brinckmann commented 1 year ago

Hi Claudio,

Can you try to create the directory with MontePython first (so a log.param is produced), e.g. by running front end with -f 0 with the same param and conf file? Running on clusters it often gets confused otherwise, but if MontePython detects a log.param it should (hopefully) do things correctly. Let me know if that doesn't work, otherwise I have some collaborators who use NERSC who can probably provide best practices tips specific for NERSC, which is a system I'm not familiar with.

Best, Thejs

ClaudioNahmad commented 1 year ago

Hi Thejs,

This worked perfectly, thank you!