NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
282 stars 31 forks source link

Enforcing CPU limits on containers #90

Closed abuettner93 closed 2 years ago

abuettner93 commented 2 years ago

I have slurm configured to use cgroups, and everything works great when running a normal job (srun, sbatch, salloc, etc). However, when running containers via enroot and pyxis, im running into the issue where the requested CPUs arent being enforced. For example, when running: salloc -p job.standard -n 4 ,I end up "seeing" 4 CPUs using htop or lscpu (as expected) salloc -p job.gpu -n 4 --gpus=1 --container-image=/projects/container_images/rhel7_gpu.sqsh, I end up seeing one GPU with nvidia-smi, but I see 64 CPUs when looking at htop, lscpu, and torch.

Im not sure why this is happening, but I assume the number of allocated CPUs is not being passed to Pyxis/Enroot correctly.

Is this a SLURM config thing, or something I need to configure in Enroot/Pyxis?

flx42 commented 2 years ago
salloc -p job.standard -n 4 ,I end up "seeing" 4 CPUs using htop or lscpu (as expected)

I suppose those nodes have only 4 cores anyway, right?

salloc -p job.gpu -n 4 --gpus=1 --container-image=/projects/container_images/rhel7_gpu.sqsh,

This might or might not give you 4 cores, it depends how Slurm was configured. Use nproc to check how many cores were allocated to the job.

Anyway, regarding lscpu and htop, this is normal behavior and is not related to pyxis, enroot or Slurm. This is why LXCFS (https://linuxcontainers.org/lxcfs/introduction/) exists.

abuettner93 commented 2 years ago

With some digging using nproc, it seems there is an issue with slurm isolating resources on the job.gpu cluster. This may turn out to be unrelated to the container, and only related to the partition. If there is any follow up related to Pyxis, I will open a new issue.

Thanks!