NVIDIA / pyxis

Container plugin for Slurm Workload Manager
Apache License 2.0
263 stars 28 forks source link

epilog failures upgrading to v0.16 #123

Closed twh closed 3 months ago

twh commented 9 months ago

Upgrading from 0.15 to 0.16 we began getting these errors causing nodes to go in DRAIN state in slurm, and had to be backed out. Not sure what behavior has changed.

Sep 24 08:24:46 nda100v4-2 slurmd[8614]: error: [job 5888] epilog failed status=1:0 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: child 38570 failed with error code: 1 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't execute enroot command Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: printing enroot log file: Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: mkdir: cannot create directory '/run/enroot': Permission denied Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't get list of existing containers Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't cleanup pyxis containers for job 5886 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: spank: required plugin spank_pyxis.so: job_epilog() failed with rc=-1 Sep 24 08:26:20 nda100v4-2 slurmd[8614]: error: spank/epilog returned status 0x0100 Sep 24 08:26:20 nda100v4-2 slurmd[8614]: error: [job 5886] epilog failed status=1:0

flx42 commented 9 months ago

It's due to this change: https://github.com/NVIDIA/pyxis/commit/aa901d9d81904c0c3e0b6af3af36917cc2e51cd5

With pyxis 0.16 you can undo this change by setting "container_scope=global" in the plugstack file, see https://github.com/NVIDIA/pyxis/wiki/Setup#slurm-plugstack-configuration

Of course it should not fail, so I need to investigate, can you please share the content of the enroot.conf file?


From: Wayne Hendricks @.> Sent: Monday, September 25, 2023 10:55 AM To: NVIDIA/pyxis @.> Cc: Subscribed @.***> Subject: [NVIDIA/pyxis] epilog failures upgrading to v0.16 (Issue #123)

Upgrading from 0.15 to 0.16 we began getting these errors causing nodes to go in DRAIN state in slurm, and had to be backed out. Not sure what behavior has changed.

Sep 24 08:24:46 nda100v4-2 slurmd[8614]: error: [job 5888] epilog failed status=1:0 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: child 38570 failed with error code: 1 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't execute enroot command Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: printing enroot log file: Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: mkdir: cannot create directory '/run/enroot': Permission denied Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't get list of existing containers Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't cleanup pyxis containers for job 5886 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: spank: required plugin spank_pyxis.so: job_epilog() failed with rc=-1 Sep 24 08:26:20 nda100v4-2 slurmd[8614]: error: spank/epilog returned status 0x0100 Sep 24 08:26:20 nda100v4-2 slurmd[8614]: error: [job 5886] epilog failed status=1:0

— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/pyxis/issues/123, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA32BDPNWX6EBTSJIZ3HNUDX4HAQ3ANCNFSM6AAAAAA5GPDIJA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

twh commented 9 months ago

Apologies for the delay. Nothing too complicated, these are the only items we set outside of the default:

ENROOT_CONFIG_PATH $HOME/enroot ENROOT_TEMP_PATH /tmp ENROOT_RUNTIME_PATH=/scratch/enroot/runtime/uid-$(id -u) ENROOT_CACHE_PATH=/scratch/enroot/cache/uid-$(id -u) ENROOT_DATA_PATH=/scratch/enroot/data/uid-$(id -u) ENROOT_MOUNT_HOME yes ENROOT_ROOTFS_WRITABLE yes

We also have a few mounts: /scratch /scratch none x-create=dir,rw,nosuid,bind,relatime,fs_passno=1 0 0 /hot/data /hot/data none x-create=dir,rw,nosuid,bind,relatime,fs_passno=1 0 0 /hot/images /hot/images none x-create=dir,rw,nosuid,bind,relatime,fs_passno=1 0 0 /warm/data /warm/data none x-create=dir,rw,nosuid,bind,relatime,fs_passno=1 0 0 /cold/datasets /cold/datasets none x-create=dir,rw,nosuid,bind,relatime,fs_passno=1 0 0 /cold/outputs /cold/outputs none x-create=dir,rw,nosuid,bind,relatime,fs_passno=1 0 0

Gets us this error with pyxis v0.16 Oct 02 17:19:27 test-01 spank-epilog[181573]: error: pyxis: printing enroot log file: Oct 02 17:19:27 test-01 spank-epilog[181573]: error: pyxis: /usr/bin/enroot: line 44: HOME: unbound variable Oct 02 17:19:27 test-01 spank-epilog[181573]: error: pyxis: mkdir: cannot create directory '/scratch': Permission denied Oct 02 17:19:27 test-01 spank-epilog[181573]: error: pyxis: mkdir: cannot create directory '/scratch': Permission denied Oct 02 17:19:27 test-01 spank-epilog[181573]: error: pyxis: mkdir: cannot create directory '/scratch': Permission denied Oct 02 17:19:27 test-01 spank-epilog[181573]: error: pyxis: couldn't get list of existing containers Oct 02 17:19:27 test-01 spank-epilog[181573]: error: pyxis: couldn't cleanup pyxis containers for job 570559 Oct 02 17:19:27 test-01 spank-epilog[181573]: error: spank: required plugin spank_pyxis.so: job_epilog() failed with rc=-1 Oct 02 17:19:27 test-01 slurmd[146729]: error: spank/epilog returned status 0x0100 Oct 02 17:19:27 test-01 slurmd[146729]: error: [job 570559] epilog failed status=1:0

I found that the only way to make it work is to make /scratch and /scratch/* mode 777, however I’d like to pair this back to just the needed permissions. I tried adding the slurm user to the group that owns scratch but it still didn’t work. I’m not sure what permission the plugin wants for this directory.

From: Felix Abecassis @.> Date: Monday, September 25, 2023 at 16:12 To: NVIDIA/pyxis @.> Cc: Wayne Hendricks @.>, Author @.> Subject: Re: [NVIDIA/pyxis] epilog failures upgrading to v0.16 (Issue #123) It's due to this change: https://github.com/NVIDIA/pyxis/commit/aa901d9d81904c0c3e0b6af3af36917cc2e51cd5

With pyxis 0.16 you can undo this change by setting "container_scope=global" in the plugstack file, see https://github.com/NVIDIA/pyxis/wiki/Setup#slurm-plugstack-configuration

Of course it should not fail, so I need to investigate, can you please share the content of the enroot.conf file?


From: Wayne Hendricks @.> Sent: Monday, September 25, 2023 10:55 AM To: NVIDIA/pyxis @.> Cc: Subscribed @.***> Subject: [NVIDIA/pyxis] epilog failures upgrading to v0.16 (Issue #123)

Upgrading from 0.15 to 0.16 we began getting these errors causing nodes to go in DRAIN state in slurm, and had to be backed out. Not sure what behavior has changed.

Sep 24 08:24:46 nda100v4-2 slurmd[8614]: error: [job 5888] epilog failed status=1:0 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: child 38570 failed with error code: 1 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't execute enroot command Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: printing enroot log file: Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: mkdir: cannot create directory '/run/enroot': Permission denied Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't get list of existing containers Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: pyxis: couldn't cleanup pyxis containers for job 5886 Sep 24 08:26:20 nda100v4-2 spank-epilog[38569]: error: spank: required plugin spank_pyxis.so: job_epilog() failed with rc=-1 Sep 24 08:26:20 nda100v4-2 slurmd[8614]: error: spank/epilog returned status 0x0100 Sep 24 08:26:20 nda100v4-2 slurmd[8614]: error: [job 5886] epilog failed status=1:0

— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/pyxis/issues/123, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AA32BDPNWX6EBTSJIZ3HNUDX4HAQ3ANCNFSM6AAAAAA5GPDIJA. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/pyxis/issues/123#issuecomment-1734396071, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAHTNEVNSDEJ4F3EAACPXFTX4HQSBANCNFSM6AAAAAA5GPDIJA. You are receiving this because you authored the thread.Message ID: @.***>

flx42 commented 9 months ago

Thanks, you should not need to set the mode to 777 on the folder, so there is clearly something wrong I need to fix. If you can, please let me know if setting container_scope=global in the plugstack config works as a mitigation too.

twh commented 7 months ago

Apologies for the delay, yes this did fix the issue.

From: Felix Abecassis @.> Date: Tuesday, October 3, 2023 at 16:45 To: NVIDIA/pyxis @.> Cc: Wayne Hendricks @.>, Author @.> Subject: Re: [NVIDIA/pyxis] epilog failures upgrading to v0.16 (Issue #123)

Thanks, you should not need to set the mode to 777 on the folder, so there is clearly something wrong I need to fix. If you can, please let me know if setting container_scope=global in the plugstack config works as a mitigation too.

— Reply to this email directly, view it on GitHubhttps://github.com/NVIDIA/pyxis/issues/123#issuecomment-1745698873, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AAHTNET5H4BH3BAXTD7YHVTX5R2OZAVCNFSM6AAAAAA5GPDIJCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTONBVGY4TQOBXGM. You are receiving this because you authored the thread.Message ID: @.***>

krono commented 5 months ago

@flx42 , there's no srun/sbatch flag to change the scope as a one-off, right?

flx42 commented 5 months ago

@flx42 , there's no srun/sbatch flag to change the scope as a one-off, right?

No, there is no flag for srun/sbatch.

kcgthb commented 3 months ago

Hi @flx42 !

Just wondering if there's been any update on this issue?

We've been hit by the same epilog problem after a recent update, and I can confirm that setting container_scope=global in the plugstack config works around the problem. But it would be nice to keep the container scope to job and clean up the named containers during the epilog.

Thanks!

flx42 commented 3 months ago

@kcgthb no update, I did not reproduce the problem on my side yet. I can revert the default to global again if that makes deployment slightly simpler (e.g. by not having to override the default pyxis config).

Is the error also coming from a mounted file-system? Is it like NFS with root-squashing? Or perhaps is the filesystem unmounted in the epilog? Also, any usage of another SPANK plugin that creates a mount namespace maybe?

kcgthb commented 3 months ago

The error is actually coming from a local filesystem (xfs). We have the following enroot.conf:

# grep -Ev '^#|^$' /etc/enroot/enroot.conf
ENROOT_CACHE_PATH           ${SCRATCH:-/tmp}/.enroot/cache
ENROOT_DATA_PATH            /tmp/.enroot/data
ENROOT_SQUASH_OPTIONS       -noI -noD -noF -noX -no-duplicates
ENROOT_MOUNT_HOME           yes

but, we also don't use pam_systemd, so our XDG_* variables are limited. We still define XDG_RUNTIME_DIR=/tmp in our user profile. though (it's a poly-instanciated directory, so it's user-specific).

This is the error we're seeing with global scope and no explicit value set for ENROOT_RUNTIME_PATH:

spank-epilog[27186]: error: pyxis: child 27187 failed with error code: 1
spank-epilog[27186]: error: pyxis: couldn't execute enroot command
spank-epilog[27186]: error: pyxis: printing enroot log file:
spank-epilog[27186]: error: pyxis:     mkdir: cannot create directory '/run/enroot': Permission denied
spank-epilog[27186]: error: pyxis: couldn't get list of existing containers
spank-epilog[27186]: error: pyxis: couldn't cleanup pyxis containers for job 42870214
spank-epilog[27186]: error: spank: required plugin spank_pyxis.so: job_epilog() failed with rc=-1

Not exactly sure why it's trying to create a directory in /run :thinking: It's like the XDG_RUNTIME_DIR value is ignored.

And actually, if we uncomment ENROOT_RUNTIME_PATH=${XDG_RUNTIME_DIR}/enroot in enroot.conf (the default value), then the error becomes:

error: pyxis: printing enroot log file:
error: pyxis:     /usr/bin/enroot: line 44: XDG_RUNTIME_DIR: unbound variable
error: pyxis:     mkdir: cannot create directory '/run/enroot': Permission denied
error: pyxis: couldn't get list of existing containers

Yet XDG_RUNTIME_DIR is defined in the user environment, within the job.

As for other SPANK plugins, the only other one we're using is slurm-spank-lua which doesn't use namespaces.

In any case, the easiest is probably to explicitly set ENROOT_RUNTIME_DIR to a different location.

flx42 commented 3 months ago

@kcgthb your issue seems to be indeed about XDG_RUNTIME_DIR not being set in the epilog when enroot is executed. That might be a little different than the original issue. I need to check again if we can extract and apply the full job environment from a SPANK epilog callback.

flx42 commented 3 months ago

(sorry, closed by mistake)

flx42 commented 3 months ago

After taking another look at this, I can confirm that we are fairly limited in what we can do in the slurmd SPANK callback job_epilog(). Particularly, it seems we can't retrieve any environment variable from the job, unlike in slurmstepd callbacks. That means pyxis doesn't have a way to properly support a config with ENROOT_RUNTIME_PATH=${MY_PATH}/enroot as there is no easy way to get the value of ${MY_PATH}.

But it was then probably a mistake to change the default to container_scope=job, as there are too multiple cases it can't handle today. So I'm planning to change the default back to container_scope=global.

kcgthb commented 3 months ago

Sounds good, thanks @flx42 !