NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
282 stars 31 forks source link

numactl error enroot/pyxis running nvidia hpc-benchmark #94

Closed xinyx62 closed 1 year ago

xinyx62 commented 1 year ago

Hi team. I have problem with run nvidia hpc-benchmark use pixys/enroot

cat /etc/os-release

NAME="Ubuntu" VERSION="20.04.1 LTS (Focal Fossa)" ID=ubuntu ID_LIKE=debian PRETTY_NAME="Ubuntu 20.04.1 LTS" VERSION_ID="20.04" HOME_URL="https://www.ubuntu.com/" SUPPORT_URL="https://help.ubuntu.com/" BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" VERSION_CODENAME=focal UBUNTU_CODENAME=focal

I install all nodes:

slurm-19.05.5 (with pmix plugin)

nvslurm-plugin-pyxis-0.7.0-1 enroot-3.4.0 end, i use local image nvcr.io/nvidia/tensorflow:21.11-tf1-py3 , and use enroot import it into hpl+test.sqsh file.

nvidia-smi topo -m image

This is my cmd: srun -N 1 --ntasks-per-node=8 --cpu-bind=none --container-image ./hpl+test.sqsh hpl.sh --config dgx-a100 --da t /workspace/hpl-ai-linux-x86_64/sample-dat/HPL-dgx-a100-1N.dat

I got the error: image

and this also error:srun -N 1 --ntasks-per-node=8 --cpu-bind=none --container-image ./hpl+test.sqsh /workspace/hpl.sh --cpu-affinity 0-10:0-10:0-10:0-10:0-10:0-10:0-10:0-10 --mem-affinity 0:0:0:0:0:0:0:0 --config dgx-a100 --dat /workspace/hpl-ai-linux-x86_64/sample-dat/HPL-dgx-a100-1N.dat image

Do you can help with problem?

flx42 commented 1 year ago

You probably configured Slurm to allocate a subset of the cores to your job, it's not related to pyxis or enroot.

You can check which cores your job has access to with a command like this (from within the job):

$ grep Cpus_allowed_list /proc/self/status
xinyx62 commented 1 year ago

Thanks, I have check as below.

image

flx42 commented 1 year ago

Try numactl --show too

xinyx62 commented 1 year ago

using numactl --show on test node, the output is : image

while enter the container,using srun -N 1 --container-image ./hpl+test.sqsh --pty bash

image

flx42 commented 1 year ago

Right, so it's Slurm allocating a single core to your job. You would get the same results without using --container-image.

On Mon, Oct 24, 2022, 23:06 xinyx62 @.***> wrote:

using numactl --show on test node, the output is :

while enter the container,using srun -N 1 --container-image ./hpl+test.sqsh --pty bash

[image: image] https://user-images.githubusercontent.com/35787534/197694979-bff3218c-3314-41c3-853d-bae167377d2a.png

— Reply to this email directly, view it on GitHub https://github.com/NVIDIA/pyxis/issues/94#issuecomment-1290032158, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA32BDLT3ZCP72IG6B632ZTWE52GJANCNFSM6AAAAAARNPZ4LQ . You are receiving this because you commented.Message ID: @.***>

inspurasc commented 2 weeks ago

so how to resolved it ?

flx42 commented 2 weeks ago

You have to change the Slurm configuration, but that's not related to pyxis.