Closed xinyx62 closed 1 year ago
You probably configured Slurm to allocate a subset of the cores to your job, it's not related to pyxis or enroot.
You can check which cores your job has access to with a command like this (from within the job):
$ grep Cpus_allowed_list /proc/self/status
Thanks, I have check as below.
Try numactl --show
too
using numactl --show on test node, the output is :
while enter the container,using srun -N 1 --container-image ./hpl+test.sqsh --pty bash
Right, so it's Slurm allocating a single core to your job. You would get the same results without using --container-image.
On Mon, Oct 24, 2022, 23:06 xinyx62 @.***> wrote:
using numactl --show on test node, the output is :
while enter the container,using srun -N 1 --container-image ./hpl+test.sqsh --pty bash
[image: image] https://user-images.githubusercontent.com/35787534/197694979-bff3218c-3314-41c3-853d-bae167377d2a.png
— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/pyxis/issues/94#issuecomment-1290032158, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA32BDLT3ZCP72IG6B632ZTWE52GJANCNFSM6AAAAAARNPZ4LQ . You are receiving this because you commented.Message ID: @.***>
so how to resolved it ?
You have to change the Slurm configuration, but that's not related to pyxis.
Hi team. I have problem with run nvidia hpc-benchmark use pixys/enroot
cat /etc/os-release
NAME="Ubuntu" VERSION="20.04.1 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.1 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal
I install all nodes:
slurm-19.05.5 (with pmix plugin)
nvslurm-plugin-pyxis-0.7.0-1 enroot-3.4.0 end, i use local image nvcr.io/nvidia/tensorflow:21.11-tf1-py3 , and use enroot import it into hpl+test.sqsh file.
nvidia-smi topo -m
This is my cmd: srun -N 1 --ntasks-per-node=8 --cpu-bind=none --container-image ./hpl+test.sqsh hpl.sh --config dgx-a100 --da t /workspace/hpl-ai-linux-x86_64/sample-dat/HPL-dgx-a100-1N.dat
I got the error:
and this also error:srun -N 1 --ntasks-per-node=8 --cpu-bind=none --container-image ./hpl+test.sqsh /workspace/hpl.sh --cpu-affinity 0-10:0-10:0-10:0-10:0-10:0-10:0-10:0-10 --mem-affinity 0:0:0:0:0:0:0:0 --config dgx-a100 --dat /workspace/hpl-ai-linux-x86_64/sample-dat/HPL-dgx-a100-1N.dat
Do you can help with problem?