NVIDIA / enroot

A simple yet powerful tool to turn traditional container/OS images into unprivileged sandboxes.
Apache License 2.0
648 stars 94 forks source link

Wrong number of nproc, when running PyTorch container with cpus-per-task set #175

Open itzsimpl opened 10 months ago

itzsimpl commented 10 months ago

When running a PyTorch container in slurm with cpus-per-task set, nproc reports a wrong value (1).

$ srun --ntasks-per-node=3 bash -c 'echo "`nproc`/$SLURM_CPUS_ON_NODE"'
2/6
2/6
2/6

$ srun --exclusive --ntasks-per-node=3 bash -c 'echo "`nproc`/$SLURM_CPUS_ON_NODE"'
4/4
4/4
4/4

$ srun --cpus-per-task 10 --ntasks-per-node=3 bash -c 'echo "`nproc`/$SLURM_CPUS_ON_NODE"'
10/30
10/30
10/30

$ srun -cpus-per-task 32 --overcommit --ntasks-per-node=3 bash -c 'echo "`nproc`/$SLURM_CPUS_ON_NODE"'
32/32
32/32
32/32

and

$ srun --ntasks-per-node=3 --container-image=ubuntu:22.04 bash -c 'echo "`nproc`/$SLURM_CPUS_ON_NODE"'
pyxis: imported docker image: ubuntu:22.04
2/6
2/6
2/6

$ srun --exclusive --ntasks-per-node=3 --container-image=ubuntu:22.04 bash -c 'echo "`nproc`/$SLURM_CPUS_ON_NODE"'
4/4
4/4
4/4

$ srun --cpus-per-task=10 --ntasks-per-node=3 --container-image=ubuntu:22.04 bash -c 'echo "`nproc`/$SLURM_CPUS_ON_NODE"'
pyxis: imported docker image: ubuntu:22.04
10/30
10/30
10/30

$ srun -cpus-per-task 32 --overcommit --ntasks-per-node=3 --container-image=ubuntu:22.04 bash -c 'echo "`nproc`/$SLURM_CPUS_ON_NODE"'
pyxis: imported docker image: ubuntu:22.04
32/32
32/32
32/32

but

$ srun --mem=48G --ntasks-per-node=3 --container-image=nvcr.io/nvidia/pytorch:23.12-py3 bash -c 'echo "`nproc`/$SLURM_CPUS_ON_NODE"'
pyxis: imported docker image: nvcr.io/nvidia/pytorch:23.12-py3
1/6
1/6
1/6

$ srun --mem=48G --exclusive --ntasks-per-node=3 --container-image=nvcr.io/nvidia/pytorch:23.12-py3 bash -c 'echo "`nproc`/$SLURM_CPUS_ON_NODE"'
pyxis: imported docker image: nvcr.io/nvidia/pytorch:23.12-py3
1/4
1/4
1/4

$ srun --mem=48G --cpus-per-task=10 --ntasks-per-node=3 --container-image=nvcr.io/nvidia/pytorch:23.12-py3 bash -c 'echo "`nproc`/$SLURM_CPUS_ON_NODE"'
pyxis: imported docker image: nvcr.io/nvidia/pytorch:23.12-py3
1/30
1/30
1/30

$ srun --mem=48G --cpus-per-task=32 --overcommit --ntasks-per-node=3 --container-image=nvcr.io/nvidia/pytorch:23.12-py3 bash -c 'echo "`nproc`/$SLURM_CPUS_ON_NODE"'
pyxis: imported docker image: nvcr.io/nvidia/pytorch:23.12-py3
1/32
1/32
1/32

This is caused by 50-slurm-pytorch.sh hook, which hardcodes OMP_NUM_THREADS to 1; I have opened a PR (https://github.com/NVIDIA/enroot/pull/174) with a fix that is based on current Pytorch Multiprocessing best practices.

flx42 commented 10 months ago

I'm surprised that nproc has this behavior, to be honest.

I'll review the PR, but it's a little bit of a sensitive topic: setting the wrong number of threads can quickly cause performance issues one way or another (not enough cores in use VS too many threads). I'll check with my colleagues what they think.

itzsimpl commented 10 months ago

@flx42 I agree that surprised me too. For completeness I'm referencing the issue (a cpu oversubscription https://github.com/NVIDIA/NeMo/issues/8141) that led me into this investigation. It turned out there is an issue in numba (https://github.com/numba/numba/issues/9387), which resets the value of torch num_threads on numba num_threads get or set.

My proposal is to keep the behaviour consistent. Especially since torch proposes setting num_threads to nCPU/nTasks, as well as because nproc is updated too (and some bash scripts are based on that value). Do check with colleagues, please.

itzsimpl commented 10 months ago

FWW. nproc does take into account both OMP_NUM_THREADS and OMP_THREAD_LIMIT https://www.gnu.org/software/coreutils/manual/html_node/nproc-invocation.html