Closed abuettner93 closed 2 years ago
salloc -p job.standard -n 4 ,I end up "seeing" 4 CPUs using htop or lscpu (as expected)
I suppose those nodes have only 4 cores anyway, right?
salloc -p job.gpu -n 4 --gpus=1 --container-image=/projects/container_images/rhel7_gpu.sqsh,
This might or might not give you 4 cores, it depends how Slurm was configured. Use nproc
to check how many cores were allocated to the job.
Anyway, regarding lscpu
and htop
, this is normal behavior and is not related to pyxis, enroot or Slurm. This is why LXCFS (https://linuxcontainers.org/lxcfs/introduction/) exists.
With some digging using nproc, it seems there is an issue with slurm isolating resources on the job.gpu cluster. This may turn out to be unrelated to the container, and only related to the partition. If there is any follow up related to Pyxis, I will open a new issue.
Thanks!
I have slurm configured to use cgroups, and everything works great when running a normal job (srun, sbatch, salloc, etc). However, when running containers via enroot and pyxis, im running into the issue where the requested CPUs arent being enforced. For example, when running: salloc -p job.standard -n 4 ,I end up "seeing" 4 CPUs using htop or lscpu (as expected) salloc -p job.gpu -n 4 --gpus=1 --container-image=/projects/container_images/rhel7_gpu.sqsh, I end up seeing one GPU with nvidia-smi, but I see 64 CPUs when looking at htop, lscpu, and torch.
Im not sure why this is happening, but I assume the number of allocated CPUs is not being passed to Pyxis/Enroot correctly.
Is this a SLURM config thing, or something I need to configure in Enroot/Pyxis?