Closed rmjarvis closed 1 year ago
For posterity, here is the timing of a simple test on perlmutter:
timing 128 processes...
128 5.450411558151245
timing 128 processes...
128 5.5901055335998535
timing 64 processes...
64 5.405612945556641
timing 32 processes...
32 7.624440431594849
timing 16 processes...
16 11.752846002578735
timing 8 processes...
8 20.898335933685303
timing 4 processes...
4 39.02970886230469
timing 2 processes...
2 81.31851077079773
We have a class in GalSim config called LoggerWrapper, which is intended to avoid communicating with the main logger via the proxy when we aren't actually going to output anything. E.g. logger.debug statements when the logging level is only INFO or WARNING. This is mostly a big improvement.
However, @beckermr and @jchiang87 have each recently found huge inefficiencies when running on Perlmutter with ~128 processes. I tracked it down to when we make a new LoggerWrapper instance and call
logger.getEffectiveLevel()
. This had always been going back to the original proxy logger instance even when we already have a wrapped one. And for lots of processes, this can end up clogging the inter-process communication completely unnecessarily.This PR makes the fairly trivial change that when we're re-wrapping something that is already wrapped, it doesn't call
getEffectiveLevel
. It just gets the already received value stored as an attribute. This seems to resolve the blockage that Matt (and I presume Jim) was seeing.