Closed twh closed 3 months ago
It's due to this change: https://github.com/NVIDIA/pyxis/commit/aa901d9d81904c0c3e0b6af3af36917cc2e51cd5
With pyxis 0.16 you can undo this change by setting "container_scope=global" in the plugstack file, see https://github.com/NVIDIA/pyxis/wiki/Setup#slurm-plugstack-configuration
Of course it should not fail, so I need to investigate, can you please share the content of the enroot.conf file?
From: Wayne Hendricks @.> Sent: Monday, September 25, 2023 10:55 AM To: NVIDIA/pyxis @.> Cc: Subscribed @.***> Subject: [NVIDIA/pyxis] epilog failures upgrading to v0.16 (Issue #123)
Upgrading from 0.15 to 0.16 we began getting these errors causing nodes to go in DRAIN state in slurm, and had to be backed out. Not sure what behavior has changed.
Sep 24 08:24:46 nda100v4-2 slurmd[8614]: error: [job 5888] epilog failed status=1:0 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: child 38570 failed with error code: 1 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't execute enroot command Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: printing enroot log file: Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: mkdir: cannot create directory '/run/enroot': Permission denied Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't get list of existing containers Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't cleanup pyxis containers for job 5886 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: spank: required plugin spank_pyxis.so: job_epilog() failed with rc=-1 Sep 24 08:26:20 nda100v4-2 slurmd[8614]: error: spank/epilog returned status 0x0100 Sep 24 08:26:20 nda100v4-2 slurmd[8614]: error: [job 5886] epilog failed status=1:0
— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/pyxis/issues/123, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA32BDPNWX6EBTSJIZ3HNUDX4HAQ3ANCNFSM6AAAAAA5GPDIJA. You are receiving this because you are subscribed to this thread.Message ID: @.***>
Apologies for the delay. Nothing too complicated, these are the only items we set outside of the default:
ENROOT_CONFIG_PATH $HOME/enroot ENROOT_TEMP_PATH /tmp ENROOT_RUNTIME_PATH=/scratch/enroot/runtime/uid-$(id -u) ENROOT_CACHE_PATH=/scratch/enroot/cache/uid-$(id -u) ENROOT_DATA_PATH=/scratch/enroot/data/uid-$(id -u) ENROOT_MOUNT_HOME yes ENROOT_ROOTFS_WRITABLE yes
We also have a few mounts: /scratch /scratch none x-create=dir,rw,nosuid,bind,relatime,fs_passno=1 0 0 /hot/data /hot/data none x-create=dir,rw,nosuid,bind,relatime,fs_passno=1 0 0 /hot/images /hot/images none x-create=dir,rw,nosuid,bind,relatime,fs_passno=1 0 0 /warm/data /warm/data none x-create=dir,rw,nosuid,bind,relatime,fs_passno=1 0 0 /cold/datasets /cold/datasets none x-create=dir,rw,nosuid,bind,relatime,fs_passno=1 0 0 /cold/outputs /cold/outputs none x-create=dir,rw,nosuid,bind,relatime,fs_passno=1 0 0
Gets us this error with pyxis v0.16 Oct 02 17:19:27 test-01 spank-epilog[181573]: error: pyxis: printing enroot log file: Oct 02 17:19:27 test-01 spank-epilog[181573]: error: pyxis: /usr/bin/enroot: line 44: HOME: unbound variable Oct 02 17:19:27 test-01 spank-epilog[181573]: error: pyxis: mkdir: cannot create directory '/scratch': Permission denied Oct 02 17:19:27 test-01 spank-epilog[181573]: error: pyxis: mkdir: cannot create directory '/scratch': Permission denied Oct 02 17:19:27 test-01 spank-epilog[181573]: error: pyxis: mkdir: cannot create directory '/scratch': Permission denied Oct 02 17:19:27 test-01 spank-epilog[181573]: error: pyxis: couldn't get list of existing containers Oct 02 17:19:27 test-01 spank-epilog[181573]: error: pyxis: couldn't cleanup pyxis containers for job 570559 Oct 02 17:19:27 test-01 spank-epilog[181573]: error: spank: required plugin spank_pyxis.so: job_epilog() failed with rc=-1 Oct 02 17:19:27 test-01 slurmd[146729]: error: spank/epilog returned status 0x0100 Oct 02 17:19:27 test-01 slurmd[146729]: error: [job 570559] epilog failed status=1:0
I found that the only way to make it work is to make /scratch and /scratch/* mode 777, however I’d like to pair this back to just the needed permissions. I tried adding the slurm user to the group that owns scratch but it still didn’t work. I’m not sure what permission the plugin wants for this directory.
From: Felix Abecassis @.> Date: Monday, September 25, 2023 at 16:12 To: NVIDIA/pyxis @.> Cc: Wayne Hendricks @.>, Author @.> Subject: Re: [NVIDIA/pyxis] epilog failures upgrading to v0.16 (Issue #123) It's due to this change: https://github.com/NVIDIA/pyxis/commit/aa901d9d81904c0c3e0b6af3af36917cc2e51cd5
With pyxis 0.16 you can undo this change by setting "container_scope=global" in the plugstack file, see https://github.com/NVIDIA/pyxis/wiki/Setup#slurm-plugstack-configuration
Of course it should not fail, so I need to investigate, can you please share the content of the enroot.conf file?
From: Wayne Hendricks @.> Sent: Monday, September 25, 2023 10:55 AM To: NVIDIA/pyxis @.> Cc: Subscribed @.***> Subject: [NVIDIA/pyxis] epilog failures upgrading to v0.16 (Issue #123)
Upgrading from 0.15 to 0.16 we began getting these errors causing nodes to go in DRAIN state in slurm, and had to be backed out. Not sure what behavior has changed.
Sep 24 08:24:46 nda100v4-2 slurmd[8614]: error: [job 5888] epilog failed status=1:0 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: child 38570 failed with error code: 1 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't execute enroot command Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: printing enroot log file: Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: mkdir: cannot create directory '/run/enroot': Permission denied Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't get list of existing containers Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't cleanup pyxis containers for job 5886 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: spank: required plugin spank_pyxis.so: job_epilog() failed with rc=-1 Sep 24 08:26:20 nda100v4-2 slurmd[8614]: error: spank/epilog returned status 0x0100 Sep 24 08:26:20 nda100v4-2 slurmd[8614]: error: [job 5886] epilog failed status=1:0
— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/pyxis/issues/123, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA32BDPNWX6EBTSJIZ3HNUDX4HAQ3ANCNFSM6AAAAAA5GPDIJA. You are receiving this because you are subscribed to this thread.Message ID: @.***>
— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/pyxis/issues/123#issuecomment-1734396071, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAHTNEVNSDEJ4F3EAACPXFTX4HQSBANCNFSM6AAAAAA5GPDIJA. You are receiving this because you authored the thread.Message ID: @.***>
Thanks, you should not need to set the mode to 777
on the folder, so there is clearly something wrong I need to fix.
If you can, please let me know if setting container_scope=global
in the plugstack config works as a mitigation too.
Apologies for the delay, yes this did fix the issue.
From: Felix Abecassis @.> Date: Tuesday, October 3, 2023 at 16:45 To: NVIDIA/pyxis @.> Cc: Wayne Hendricks @.>, Author @.> Subject: Re: [NVIDIA/pyxis] epilog failures upgrading to v0.16 (Issue #123)
Thanks, you should not need to set the mode to 777 on the folder, so there is clearly something wrong I need to fix. If you can, please let me know if setting container_scope=global in the plugstack config works as a mitigation too.
— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/pyxis/issues/123#issuecomment-1745698873, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAHTNET5H4BH3BAXTD7YHVTX5R2OZAVCNFSM6AAAAAA5GPDIJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBVGY4TQOBXGM. You are receiving this because you authored the thread.Message ID: @.***>
@flx42 , there's no srun
/sbatch
flag to change the scope as a one-off, right?
@flx42 , there's no
srun
/sbatch
flag to change the scope as a one-off, right?
No, there is no flag for srun/sbatch.
Hi @flx42 !
Just wondering if there's been any update on this issue?
We've been hit by the same epilog problem after a recent update, and I can confirm that setting container_scope=global
in the plugstack config works around the problem. But it would be nice to keep the container scope to job
and clean up the named containers during the epilog.
Thanks!
@kcgthb no update, I did not reproduce the problem on my side yet. I can revert the default to global
again if that makes deployment slightly simpler (e.g. by not having to override the default pyxis config).
Is the error also coming from a mounted file-system? Is it like NFS with root-squashing? Or perhaps is the filesystem unmounted in the epilog? Also, any usage of another SPANK plugin that creates a mount namespace maybe?
The error is actually coming from a local filesystem (xfs).
We have the following enroot.conf
:
# grep -Ev '^#|^$' /etc/enroot/enroot.conf
ENROOT_CACHE_PATH ${SCRATCH:-/tmp}/.enroot/cache
ENROOT_DATA_PATH /tmp/.enroot/data
ENROOT_SQUASH_OPTIONS -noI -noD -noF -noX -no-duplicates
ENROOT_MOUNT_HOME yes
but, we also don't use pam_systemd
, so our XDG_*
variables are limited. We still define XDG_RUNTIME_DIR=/tmp
in our user profile. though (it's a poly-instanciated directory, so it's user-specific).
This is the error we're seeing with global
scope and no explicit value set for ENROOT_RUNTIME_PATH
:
spank-epilog[27186]: error: pyxis: child 27187 failed with error code: 1
spank-epilog[27186]: error: pyxis: couldn't execute enroot command
spank-epilog[27186]: error: pyxis: printing enroot log file:
spank-epilog[27186]: error: pyxis: mkdir: cannot create directory '/run/enroot': Permission denied
spank-epilog[27186]: error: pyxis: couldn't get list of existing containers
spank-epilog[27186]: error: pyxis: couldn't cleanup pyxis containers for job 42870214
spank-epilog[27186]: error: spank: required plugin spank_pyxis.so: job_epilog() failed with rc=-1
Not exactly sure why it's trying to create a directory in /run
:thinking: It's like the XDG_RUNTIME_DIR
value is ignored.
And actually, if we uncomment ENROOT_RUNTIME_PATH=${XDG_RUNTIME_DIR}/enroot
in enroot.conf
(the default value), then the error becomes:
error: pyxis: printing enroot log file:
error: pyxis: /usr/bin/enroot: line 44: XDG_RUNTIME_DIR: unbound variable
error: pyxis: mkdir: cannot create directory '/run/enroot': Permission denied
error: pyxis: couldn't get list of existing containers
Yet XDG_RUNTIME_DIR
is defined in the user environment, within the job.
As for other SPANK plugins, the only other one we're using is slurm-spank-lua which doesn't use namespaces.
In any case, the easiest is probably to explicitly set ENROOT_RUNTIME_DIR
to a different location.
@kcgthb your issue seems to be indeed about XDG_RUNTIME_DIR
not being set in the epilog when enroot is executed. That might be a little different than the original issue. I need to check again if we can extract and apply the full job environment from a SPANK epilog callback.
(sorry, closed by mistake)
After taking another look at this, I can confirm that we are fairly limited in what we can do in the slurmd SPANK callback job_epilog()
. Particularly, it seems we can't retrieve any environment variable from the job, unlike in slurmstepd callbacks. That means pyxis doesn't have a way to properly support a config with ENROOT_RUNTIME_PATH=${MY_PATH}/enroot
as there is no easy way to get the value of ${MY_PATH}
.
But it was then probably a mistake to change the default to container_scope=job
, as there are too multiple cases it can't handle today. So I'm planning to change the default back to container_scope=global
.
Sounds good, thanks @flx42 !
Upgrading from 0.15 to 0.16 we began getting these errors causing nodes to go in DRAIN state in slurm, and had to be backed out. Not sure what behavior has changed.
Sep 24 08:24:46 nda100v4-2 slurmd[8614]: error: [job 5888] epilog failed status=1:0 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: child 38570 failed with error code: 1 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't execute enroot command Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: printing enroot log file: Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: mkdir: cannot create directory '/run/enroot': Permission denied Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't get list of existing containers Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't cleanup pyxis containers for job 5886 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: spank: required plugin spank_pyxis.so: job_epilog() failed with rc=-1 Sep 24 08:26:20 nda100v4-2 slurmd[8614]: error: spank/epilog returned status 0x0100 Sep 24 08:26:20 nda100v4-2 slurmd[8614]: error: [job 5886] epilog failed status=1:0