Closed keskitalo closed 5 years ago
The CIB component is just a linear interpolator, main issue is that it needs to read a lot of nside 4096 maps, see available maps at https://portal.nersc.gov/project/cmb/so_pysm_models_data/websky/0.3/
is this on Cori-KNL?
Yes.
It is definitely the CIB. Going from 10 minutes to more than 2 hours (perhaps a lot more) seems like a big jump. Do you have any ideas on how to make the CIB run faster? Could some intermediate product be cached?
are you saying this because you got the results for LF2? so for same exact conditions with/without CIB? Is it possible that the nodes are swapping due to too many CIB maps at 4096 cached in memory?
Yes, I got the LF2 results. I don't think swapping is enabled.
ok, good, I'll debug this!
Thanks!
ops, I'm re-reading the maps for each channel...fix coming soon.
How many of these high resolution maps do you anticipate caching per process? If each map is 768MB and there are multiple processes running on a node, we'll quickly run out of memory.
it actually first ud_grades them to 512, so it should be fine
@keskitalo this should be fixed, all unit tests pass, it only stores the available templates needed by a channel, so if all channels are at the same frequency, it shouldn't use much memory.
The LF2 case with CIB now runs in 14 minutes. CIB is still expensive but not prohibitively so. Thanks Andrea!
I guess the rest is reading the nside 4096 maps, we will speed this up once I solve https://github.com/simonsobs/so_pysm_models/issues/45
I am running an SO-specific TOAST simulation with this preset:
or in the case of LF1 band:
The simulation resolution is
nside=512
.In the LF1 case, simulating roughly 12 days of data at 37Hz for 65 detectors took about 9 minutes using 32 MPI tasks distributed over 32 nodes. Each MPI task was given 8 threads. Each node was running a total of 8 independent PySM calculations simultaneously.
In the MFS1 case, simulating same 12 days of data at same 37Hz for 43 detectors, using 16 processes on 16 nodes and 8 threads each did not finish in 2 hours. The test has only been run once.
Either the overall size of the jobs (32 processes vs. 256 processes, 518 threads vs. 5558 threads) causes a scaling issue, or the CIB component is very expensive to evaluate. I'll soon have numbers for an LF2 job that is nearly identical with the LF1 case but with CIB.