LSSTDESC / imSim

GalSim based Rubin Observatory image simulation package
https://lsstdesc.org/imSim
BSD 3-Clause "New" or "Revised" License
36 stars 15 forks source link

Setting nproc #439

Closed esheldon closed 2 months ago

esheldon commented 8 months ago

I have output.nproc: 1 but galsim is using 2 cores.

I can get it down to 1 core by setting OMP_NUM_THREADS to 1.

beckermr commented 8 months ago

Yes. This is a known problem. Cc @erykoff

esheldon commented 8 months ago

This could be an issue running imsim if you assume you should, for example, set nproc to the number of cores on the machine.

erykoff commented 8 months ago

Before running imsim or galsim you must set all the num threads vars. I thought this would be put into imsim (galsim wants to keep the flexibility of implicit multithreading for reasons that I don't understand).

esheldon commented 8 months ago

Note using 1 core vs 2 cores gave very similar run times as well, so I'm not sure what's using the extra cpu time.

erykoff commented 8 months ago

https://github.com/lsst/utils/blob/main/python/lsst/utils/threads.py#L38-L57

It may be that @cwwalter is waiting for my standalone shut-it-all-down package which I'll put together during the break.

erykoff commented 8 months ago

Implicit multithreading takes more resources and only occasionally improves runtime. Often it greatly increases the runtime by x10 or in some cases x100. I hates it.

esheldon commented 8 months ago

Yes, that can happen if you end up oversubscribing the cores due to each proc set by output.proc using more than one core per proc.

Setting OMP_NUM_THREADS to 1 does force it to use one core per proc as set in output.nproc

erykoff commented 8 months ago

Not just oversubscribing. Weird cache contention issues maybe. Unclear but it’s broken everywhere and should never be used.

cwwalter commented 8 months ago

When running on places like USDF with many cores we find we need to use

export OMP_NUM_THREADS=1
export NUMEXPR_MAX_THREADS=1
export OMP_PROC_BIND=false

and are telling people running at scale to use that right now. I haven't bothered on things like my laptop for testing (but maybe I should).

When @erykoff has his Rubin function ready to turn this all off, we will call that instead (too?). I think @jchiang87 may have a branch with some of this functionality if you want to try it instead. This is some basic issue with one of the libraries we use in Rubin and it also seems machine dependent.

cwwalter commented 2 months ago

I don't think there is more for us to do here on the imSim side. @jchiang87 do you have a comment?

jchiang87 commented 2 months ago

Right, I think this is handled by #441.