galsci / pysm

PySM 3: Sky emission simulations for Cosmic Microwave Background experiments
https://pysm3.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
36 stars 23 forks source link

problem with get_emission when running in parallel #70

Closed NicolettaK closed 3 years ago

NicolettaK commented 3 years ago

Hi,

I need to run pysm3 in parallel, but I'm encountering a problem I can't solve.

I have this simplified version of my code:

 1 import healpy as hp
  2 import numpy as np
  3 import pysm3
  4 import pysm3.units as u
  5 import os
  6
  7
  8 from mpi4py import MPI
  9 comm = MPI.COMM_WORLD
 10 rank = comm.Get_rank()
 11 size = comm.Get_size()
 12
 13 write_dir =f'./test/{rank}/'
 14 if not os.path.exists(write_dir):
 15     os.makedirs(write_dir)
 16 test_map = np.arange(hp.nside2npix(128))
 17 hp.write_map(f'{write_dir}/test_map_{rank}.fits', test_map, overwrite=True, dtype=np.float32)
 18 sky = pysm3.Sky(nside=128, preset_strings=["s1"])
 19 hp.write_map(f'{write_dir}/test_map_{rank}_after_sky.fits', test_map, overwrite=True, dtype=np.float32)
 20 sky_extrap = sky.get_emission(145.*u.GHz)
 21 hp.write_map(f'{write_dir}/test_map_{rank}_after_getem.fits', test_map, overwrite=True, dtype=np.float32)
 22 hp.write_map(f'{write_dir}/sky_extrep_{rank}.fits', sky_extrap, overwrite=True, dtype=np.float32)

I'm trying to run it on a interactive job at NERSC: salloc -N 2 -C knl -q interactive -t 04:00:00 I then set export OMP_NUM_THREADS=2 and run the code with: mpirun -np 100 python test_parallel.py

what happens is the following: the code correctly writes in 100 different folders the 100 test maps at line 17 and 19. But it writes the maps at line 21 an 22 only for a subset of processes (between 4 and 6 depending on the run).

Not that this happens for "s0","d0" ,"d1" (I haven't tried the others) but not for "c1"!

Any idea why this could happen?

Thanks a lot!

zonca commented 3 years ago

Can you please provide a test code I can execute to reproduce the issue? Is there any error message?

On Sun, Dec 6, 2020, 03:29 Nicoletta Krachmalnicoff < notifications@github.com> wrote:

Hi,

I need to run pysm3 in parallel, but I'm encountering a problem I can't solve.

I have this simplified version of my code:

1 import healpy as hp 2 import numpy as np 3 import pysm3 4 import pysm3.units as u 5 import os 6 7 8 from mpi4py import MPI 9 comm = MPI.COMM_WORLD 10 rank = comm.Get_rank() 11 size = comm.Get_size() 12 13 write_dir =f'./test/{rank}/' 14 if not os.path.exists(write_dir): 15 os.makedirs(write_dir) 16 test_map = np.arange(hp.nside2npix(128)) 17 hp.write_map(f'{write_dir}/testmap{rank}.fits', test_map, overwrite=True, dtype=np.float32) 18 sky = pysm3.Sky(nside=128, preset_strings=["s1"]) 19 hp.write_map(f'{write_dir}/testmap{rank}_after_sky.fits', test_map, overwrite=True, dtype=np.float32) 20 sky_extrap = sky.get_emission(145.*u.GHz) 21 hp.write_map(f'{write_dir}/testmap{rank}_after_getem.fits', test_map, overwrite=True, dtype=np.float32) 22 hp.write_map(f'{write_dir}/skyextrep{rank}.fits', sky_extrap, overwrite=True, dtype=np.float32)

I'm trying to run it on a interactive job at NERSC: salloc -N 2 -C knl -q interactive -t 04:00:00 I then set export OMP_NUM_THREADS=2 and run the code with: mpirun -np 100 python test_parallel.py

what happens is the following: the code correctly writes in 100 different folders the 100 test maps at line 17 and 19. But it writes the maps at line 21 an 22 only for a subset of processes (between 4 and 6 depending on the run).

Not that this happens for "s0","d0" ,"d1" (I haven't tried the others) but not for "c1"!

Any idea why this could happen?

Thanks a lot!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/healpy/pysm/issues/70, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAC5Q4UUXVFA44COEJNWD4LSTNTH3ANCNFSM4UPIBSXA .

NicolettaK commented 3 years ago

Hi @zonca, with the code I posted and the interactive job you should be able to reproduce the problem. No error, it keeps running but without writing anything on disk.

zonca commented 3 years ago

sorry @NicolettaK, keep having higher priority items step in front of this, it's going to take time.

zonca commented 3 years ago

@NicolettaK I think it is the number of threads of numba,

I tested with this and worked fine, but better if you test yourself and confirm:

#!/bin/bash
#SBATCH --qos=debug
#SBATCH --time=30
#SBATCH --nodes=2
#SBATCH --tasks-per-node=50
#SBATCH --cpus-per-task=2
#SBATCH --constraint=knl

export OMP_PROC_BIND=true
export OMP_PLACES=threads
export OMP_NUM_THREADS=2
export NUMBA_NUM_THREADS=2
#export NUMBA_DISABLE_JIT=1

srun python run.py
zonca commented 3 years ago

@NicolettaK have you had a chance to test this? I would like to add this to the docs if you confirm it works