eth-cscs / DLA-Future

DLA-Future
https://eth-cscs.github.io/DLA-Future/master/
BSD 3-Clause "New" or "Revised" License
64 stars 14 forks source link

Limit fastcov jobs to avoid out of memory #1081

Closed rasolca closed 9 months ago

rasolca commented 9 months ago

After a couple of out-of-memory happened in codecov tests e.g. https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5937332970#L5063

I found that python doesn't handle correctly affinity

srun -n 6 -c 4 python3 -c "import multiprocessing
print(multiprocessing.cpu_count())"
24
24
24
24
24
24

fastcov uses multiprocessing.cpu_count() as default, which leads to a huge oversubscription. https://github.com/RPGillespie6/fastcov/blob/master/fastcov.py#L940

Generation of the report is slower: e.g.

rasolca commented 9 months ago

cscs-ci run

albestro commented 9 months ago

With @rasolca we did a quick test to check if fastcov might end up anyway oversubscribing the same cores. From a quick check it uses multiprocessing.Process, which relies on OS scheduler, so it respects the binding given by Slurm.