NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
273 stars 31 forks source link

slurmstepd: error: pyxis: seccomp filter failed #6

Closed fuqingping1 closed 4 years ago

fuqingping1 commented 4 years ago

Hi Felix, I got this error when run the example, root@e:~/enroot# srun -p RTX8000 --container-image="centos" grep NAME /etc/os-release slurmstepd: pyxis: running "enroot import" ... slurmstepd: pyxis: running "enroot create" ... slurmstepd: pyxis: running "enroot start" ... slurmstepd: error: pyxis: seccomp filter failed slurmstepd: error: spank: required plugin spank_pyxis.so: task_init() failed with rc=-1 slurmstepd: error: Failed to invoke spank plugin stack srun: error: em-node0: task 0: Exited with exit code 1

some instructions? thanks!

fuqingping1 commented 4 years ago

I have to delete the lines: (it's OK now(to run and install software in container with srun) but I do not know whether there will be any other bad results) ret = seccomp_set_filter(); if (ret < 0) { slurm_error("pyxis: seccomp filter failed"); goto fail; }

flx42 commented 4 years ago

Hi, thanks for the bug report!

Which distro are you running? cat /etc/os-release

Which kernel version do you have? uname -a

Is your kernel compiled with seccomp support? grep SECCOMP /boot/config-$(uname -r)

Thanks!

flx42 commented 4 years ago

Also, let's try with just enroot, without pyxis:

enroot import docker://ubuntu
enroot create ubuntu.sqsh
enroot start --root --rw ubuntu grep Seccomp /proc/self/status
fuqingping1 commented 4 years ago

the distro is Ubuntu 16.04, and CONFIG_SECCOMP_FILTER=y is supported. also just with enroot is OK. and I think the reason is that, with srun I submit the task to the master-node(for test also as compute node). when I submit the task to a new compute node, it run correctly. thanks!

flx42 commented 4 years ago

and I think the reason is that, with srun I submit the task to the master-node(for test also as compute node). when I submit the task to a new compute node, it run correctly.

This is still weird, and not sure what would be the cause. Maybe the errno value will provide a hint so I'm printing the error in the log now: ba4e52d743774d9a107ce7c0cae13f12c75350ba

fuqingping1 commented 4 years ago

and I think the reason is that, with srun I submit the task to the master-node(for test also as compute node). when I submit the task to a new compute node, it run correctly.

This is still weird, and not sure what would be the cause. Maybe the errno value will provide a hint so I'm printing the error in the log now: ba4e52d

Great, thanks!I will try latter.

flx42 commented 4 years ago

@fuqingping1 were you able to get a log with this patch?

Thanks!

flx42 commented 4 years ago

Hopefully fixed with https://github.com/NVIDIA/pyxis/commit/6bd15a0ba36e48c21bf22c494dec0ac8c6e895b6