galsci / pysm

PySM 3: Sky emission simulations for Cosmic Microwave Background experiments
https://pysm3.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
36 stars 23 forks source link

Possible scaling issue #27

Closed keskitalo closed 5 years ago

keskitalo commented 5 years ago

I am running an SO-specific TOAST simulation with this preset:

SO_d0,SO_s0,SO_a0,SO_f0,SO_x1_cib,SO_x1_tsz,SO_x1_ksz,SO_x1_cmb_lensed_solardipole

or in the case of LF1 band:

SO_d0,SO_s0,SO_a0,SO_f0,SO_x1_tsz,SO_x1_ksz,SO_x1_cmb_lensed_solardipole

The simulation resolution is nside=512.

In the LF1 case, simulating roughly 12 days of data at 37Hz for 65 detectors took about 9 minutes using 32 MPI tasks distributed over 32 nodes. Each MPI task was given 8 threads. Each node was running a total of 8 independent PySM calculations simultaneously.

In the MFS1 case, simulating same 12 days of data at same 37Hz for 43 detectors, using 16 processes on 16 nodes and 8 threads each did not finish in 2 hours. The test has only been run once.

Either the overall size of the jobs (32 processes vs. 256 processes, 518 threads vs. 5558 threads) causes a scaling issue, or the CIB component is very expensive to evaluate. I'll soon have numbers for an LF2 job that is nearly identical with the LF1 case but with CIB.

zonca commented 5 years ago

The CIB component is just a linear interpolator, main issue is that it needs to read a lot of nside 4096 maps, see available maps at https://portal.nersc.gov/project/cmb/so_pysm_models_data/websky/0.3/

zonca commented 5 years ago

is this on Cori-KNL?

keskitalo commented 5 years ago

Yes.

keskitalo commented 5 years ago

It is definitely the CIB. Going from 10 minutes to more than 2 hours (perhaps a lot more) seems like a big jump. Do you have any ideas on how to make the CIB run faster? Could some intermediate product be cached?

zonca commented 5 years ago

are you saying this because you got the results for LF2? so for same exact conditions with/without CIB? Is it possible that the nodes are swapping due to too many CIB maps at 4096 cached in memory?

keskitalo commented 5 years ago

Yes, I got the LF2 results. I don't think swapping is enabled.

zonca commented 5 years ago

ok, good, I'll debug this!

keskitalo commented 5 years ago

Thanks!

zonca commented 5 years ago

ops, I'm re-reading the maps for each channel...fix coming soon.

keskitalo commented 5 years ago

How many of these high resolution maps do you anticipate caching per process? If each map is 768MB and there are multiple processes running on a node, we'll quickly run out of memory.

zonca commented 5 years ago

it actually first ud_grades them to 512, so it should be fine

zonca commented 5 years ago

@keskitalo this should be fixed, all unit tests pass, it only stores the available templates needed by a channel, so if all channels are at the same frequency, it shouldn't use much memory.

keskitalo commented 5 years ago

The LF2 case with CIB now runs in 14 minutes. CIB is still expensive but not prohibitively so. Thanks Andrea!

zonca commented 5 years ago

I guess the rest is reading the nside 4096 maps, we will speed this up once I solve https://github.com/simonsobs/so_pysm_models/issues/45