Limit fastcov jobs to avoid out of memory

rasolca commented 9 months ago

After a couple of out-of-memory happened in codecov tests e.g. https://gitlab.com/cscs-ci/ci-testing/webhook-ci/mirrors/4700071344751697/7514005670787789/-/jobs/5937332970#L5063

I found that python doesn't handle correctly affinity

srun -n 6 -c 4 python3 -c "import multiprocessing
print(multiprocessing.cpu_count())"
24
24
24
24
24
24

fastcov uses multiprocessing.cpu_count() as default, which leads to a huge oversubscription. https://github.com/RPGillespie6/fastcov/blob/master/fastcov.py#L940

Generation of the report is slower: e.g.

6 ranks MC

Start creating codecov reports from rank 1 at: 14:47:24 +0100 with 12 threads
Done creating codecov reports from rank 1 at: 14:47:51 +0100

v.s.

Start creating codecov reports from rank 0 at: 14:21:59 +0100
Done creating codecov reports from rank 0 at: 14:22:16 +0100

6 ranks GPU

Start creating codecov reports from rank 5 at: 15:03:34 +0100 with 4 threads
Done creating codecov reports from rank 5 at: 15:04:53 +0100

v.s.

Start creating codecov reports from rank 1 at: 14:37:42 +0100
Done creating codecov reports from rank 1 at: 14:38:34 +0100

rasolca commented 9 months ago

cscs-ci run

albestro commented 9 months ago

With @rasolca we did a quick test to check if fastcov might end up anyway oversubscribing the same cores. From a quick check it uses multiprocessing.Process, which relies on OS scheduler, so it respects the binding given by Slurm.

eth-cscs / DLA-Future

Limit fastcov jobs to avoid out of memory #1081