Open zonca opened 4 years ago
The PySM operator distributes the channels equally across groups and then runs in each group. It uses shared memory so only 1 copy of inputs by node, there is no redundant work. In each node PySM should pick up some channels of the local TOD channels.
In the example in the image, we have 5 PySM channels per node, which are the first 5 of the 500 channels. Group 2 will have the second 5 channels.
Once PySM has done bandpass integration for all the local channels, it broadcasts full maps across the group communicator to each node for its own PySM channels.
Then the maps of those channels, either 1 at a time or in chunks (configurable by user), are broacasted across the rank communicator and put in shared memory, then rescanned locally by each process to the timelines. This is done in parallel in all the nodes of the first group, so we parallelize a factor of 10. Then if we do this broadcast for all 5 local PySM channels, another factor of 5. So the loop over 5000 detectors become a loop over the 100 groups.
In fact once group 0 is done, group 1 does the same with their 5 PySM channels, and so on once all the work is done.
@keskitalo: please review the write-up above.
Exactly as I remember. This will be a huge improvement.
MSS-001 production run