NERSC / shifter

Shifter - Linux Containers for HPC
Other
352 stars 63 forks source link

Issue with Slurm SPANK plugin using --image #224

Closed alanm-cray closed 6 years ago

alanm-cray commented 6 years ago

Shifter version 16.08 Slurm version 17.11.7

We noticed that a container is not getting initialized when --image is used with srun, sbatch, salloc. The SPANK plugin is loaded, but an error was detected when debug was enabled in slurmd.

I'm seeing an error from the Shifter SPANK plugin about not being able to find an image that is present.

spank-prolog: debug: spank: /etc/opt/slurm/plugstack.conf:2: Loaded plugin shifter_slurm.so spank-prolog: debug: shifter prolog, id after looking at args: docker:alpine:latest spank-prolog: error: about to lookup image in prolog env spank-prolog: debug: shifter prolog, id after looking at env: docker:alpine:latest spank-prolog: debug3: Trying to load plugin /opt/slurm/17.11.7/lib64/slurm/auth_munge.so spank-prolog: debug: Munge authentication plugin loaded spank-prolog: debug3: Success. spank-prolog: debug3: Trying to load plugin /opt/slurm/17.11.7/lib64/slurm/select_alps.so spank-prolog: debug3: Success. spank-prolog: debug3: Trying to load plugin /opt/slurm/17.11.7/lib64/slurm/select_cons_res.so spank-prolog: Consumable Resources (CR) Node Selection plugin loaded with argument 52 spank-prolog: debug3: Success. spank-prolog: debug3: Trying to load plugin /opt/slurm/17.11.7/lib64/slurm/select_cray.so spank-prolog: Cray node selection plugin loaded spank-prolog: debug3: Success. spank-prolog: debug3: Trying to load plugin /opt/slurm/17.11.7/lib64/slurm/select_linear.so spank-prolog: Linear node selection plugin loaded with argument 52 spank-prolog: debug3: Success. spank-prolog: debug3: Trying to load plugin /opt/slurm/17.11.7/lib64/slurm/select_serial.so spank-prolog: Serial Job Resource Selection plugin loaded with argument 52 spank-prolog: debug3: Success. spank-prolog: debug3: Trying to load plugin /opt/slurm/17.11.7/lib64/slurm/select_cons_res.so spank-prolog: Consumable Resources (CR) Node Selection plugin loaded with argument 52 spank-prolog: debug3: Success. spank-prolog: debug: shifter prolog: got gid from getpwuid_r: (null) spank-prolog: debug: shifter prolog: failed to get username from environment, trying getpwuid_r on 29289 spank-prolog: debug: shifter prolog: got username from getpwuid_r: alanm spank-prolog: error: setupRoot arg 0: /opt/cray/shifter/default//sbin/setupRoot spank-prolog: error: setupRoot arg 1: -U spank-prolog: error: setupRoot arg 2: 29289 spank-prolog: error: setupRoot arg 3: -G spank-prolog: error: setupRoot arg 4: 12790 spank-prolog: error: setupRoot arg 5: -u spank-prolog: error: setupRoot arg 6: alanm spank-prolog: error: setupRoot arg 7: -N spank-prolog: error: setupRoot arg 8: nid00384/24 spank-prolog: error: setupRoot arg 9: docker spank-prolog: error: setupRoot arg 10: alpine:latest spank-prolog: error: setupRoot stderr: FAILED to get image alpine:latest of type docker

From the same node I can find alpine: nid00384:~ # shifterimg lookup alpine 9797e5e798a034d53525968de25bd25c913e7bb17c6d068ebc778cb33e3ff6e5

When I call setupRoot by hand, it works if I pass the id of alpine instead: /opt/cray/shifter/default//sbin/setupRoot -U 29289 -G 12790 -u alanm -N nid00384/24 docker 9797e5e798a034d53525968de25bd25c913e7bb17c6d068ebc778cb33e3ff6e5

nid00384:~ # losetup | grep shifter /dev/loop4 0 0 1 1 /lus/peel/shifter/saturn-p1/UDI/9797e5e798a034d53525968de25bd25c913e7bb17c6d068ebc778cb33e3ff6e5.squashfs 0 nid00384:~ # cat /var/udiMount/etc/os-release NAME="Alpine Linux" ID=alpine VERSION_ID=3.7.0 PRETTY_NAME="Alpine Linux v3.7" HOME_URL="http://alpinelinux.org" BUG_REPORT_URL="http://bugs.alpinelinux.org"

Using the id as the argument to --image did not work. There I got an immediate error about not being able to find the image: alanm@nid00384:~> srun --image=9797e5e798a034d53525968de25bd25c913e7bb17c6d068ebc778cb33e3ff6e5 cat /etc/os-release srun: error: Failed to lookup image. Aborting.

Do arguments 9 and 10 to setupRoot look correct when called by the SPANK plugin? The next time I have an slurm system available I'll attempt to insert debugging to see why it can't find the image. My initial read is that a translation of alpine to 9797e5e.... is not occurring and setupRoot.c is trying to find the image with a location of //alpine.meta or //alpine.squashfs. That's why it works when I call setupRoot manually with the image id.

dmjacobsen commented 6 years ago

As you've identified setupRoot only uses the fully resolved id of the container. The slurm plugin will attempt to run the shifterimg lookup at job submission time. This must be failing. Perhaps it just runs "shifterimg" without consideration for the path.

dmjacobsen commented 6 years ago

hmm, it looks like it should use the prefix specified in your udiRoot.conf: https://github.com/NERSC/shifter/blob/48c9f6bd62f145a79539ad09f9738e4725dea4d3/src/ImageData.c#L120

alanm-cray commented 6 years ago

I think that is working, at least to identify images that really don't exist:

https://github.com/NERSC/shifter/blob/fd1ebb82382dd78406de541db5974ab211a820dc/wlm_integration/slurm/shifterSpank.c#L664-L668

That's likely the error I see when I used the tag id instead of the image name: alanm@nid00384:~> srun --image=9797e5e798a034d53525968de25bd25c913e7bb17c6d068ebc778cb33e3ff6e5 cat /etc/os-release srun: error: Failed to lookup image. Aborting.

So I'll need to look at why that call is not saving the value of the image id because it would have aborted there with an error.

dmjacobsen commented 6 years ago

Ah, have you updated your shifter builds? They changed the spank interface in 17.11 that required some adjustments.

On Thu, Jun 21, 2018, 06:52 alanm-cray notifications@github.com wrote:

I think that is working, at least to identify images that really don't exist:

https://github.com/NERSC/shifter/blob/fd1ebb82382dd78406de541db5974ab211a820dc/wlm_integration/slurm/shifterSpank.c#L664-L668

That's likely the error I see when I used the tag id instead of the image name: alanm@nid00384:~> srun --image=9797e5e798a034d53525968de25bd25c913e7bb17c6d068ebc778cb33e3ff6e5 cat /etc/os-release srun: error: Failed to lookup image. Aborting.

So I'll need to look at why that call is not saving the value of the image id because it would have aborted there with an error.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NERSC/shifter/issues/224#issuecomment-399111151, or mute the thread https://github.com/notifications/unsubscribe-auth/ACGnWK74philbxZyj0MPu-m-MYlL6j95ks5t-6U2gaJpZM4UyFnG .

alanm-cray commented 6 years ago

This Shifter is 16.08. But it is building against Slurm 17.11.7

dmjacobsen commented 6 years ago

16.08.5 had spank updates, I suspect you are using 16.08.3. of course we have 18.03.0 tagged now and only busyness has prevented me from tagging master as 18.03.1

On Thu, Jun 21, 2018, 06:59 alanm-cray notifications@github.com wrote:

This Shifter is 16.08. But it is building against Slurm 17.11.7

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NERSC/shifter/issues/224#issuecomment-399113321, or mute the thread https://github.com/notifications/unsubscribe-auth/ACGnWCCSgVO4k3jjzNwxI1BkSr89BuGyks5t-6bCgaJpZM4UyFnG .

alanm-cray commented 6 years ago

Yep, this is 16.08.3. Are the changes for the SPANK plugin the commits 74f1f3cf40 and 4649185bdc? Does that mean SPANK_SHIFTER_IMAGETYPE and SPANK_SHIFTER_IMAGE need to be set instead of using the --image integration? Or is that still possible in 16.08.5?

dmjacobsen commented 6 years ago

That looks correct (from a phone)

On Thu, Jun 21, 2018, 07:20 alanm-cray notifications@github.com wrote:

Yep, this is 16.08.3. Are the changes for the SPANK plugin the commits 74f1f3c https://github.com/NERSC/shifter/commit/74f1f3cf405a31ab0de32fb18206927651424a28 and 4649185 https://github.com/NERSC/shifter/commit/4649185bdc1ada607a9945077cc9408492520de6? Does that mean SPANK_SHIFTER_IMAGETYPE and SPANK_SHIFTER_IMAGE need to be set instead of using the --image integration? Or is that still possible in 16.08.5?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/NERSC/shifter/issues/224#issuecomment-399120611, or mute the thread https://github.com/notifications/unsubscribe-auth/ACGnWEt45gbUyvKHczUEZZPX237PSBmtks5t-6u5gaJpZM4UyFnG .

alanm-cray commented 6 years ago

I've upgraded Shifter to 16.08.5 and I see --image working with salloc now.

alanm@nid00032:~> salloc -N1 -w nid00032 --image=alpine
salloc: Granted job allocation 10169
alanm@nid00032:~> env | grep SHIFTER
SLURM_SPANK_SHIFTER_IMAGETYPE=id
SLURM_SPANK_SHIFTER_GID=12790
SLURM_SPANK_SHIFTER_IMAGE=9797e5e798a034d53525968de25bd25c913e7bb17c6d068ebc778cb33e3ff6e5
alanm@nid00032:~> srun hostname
nid00032
alanm@nid00032:~> ls /var/udiMount/
bin  dev  etc  home  lib  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var
alanm@nid00032:~> srun shifter cat /etc/os-release
NAME="Alpine Linux"
ID=alpine
VERSION_ID=3.7.0
PRETTY_NAME="Alpine Linux v3.7"
HOME_URL="http://alpinelinux.org"
BUG_REPORT_URL="http://bugs.alpinelinux.org"
alanm@nid00032:~>

Still some confusion on how srun is supposed to work with --image

alanm@nid00032:~> ls /var/udiMount
alanm@nid00032:~> env | grep SHIFTER
alanm@nid00032:~> srun -w nid00032 --image=alpine shifter cat /etc/os-release
No image specified, or specified incorrectly!

alanm@nid00032:~> srun -w nid00032 --image=alpine cat /etc/os-release
NAME="SLES"
VERSION="12-SP3"
VERSION_ID="12.3"
PRETTY_NAME="SUSE Linux Enterprise Server 12 SP3"
ID="sles"
ANSI_COLOR="0;32"
CPE_NAME="cpe:/o:suse:sles:12:sp3"

How is --image used correctly with srun? Am I missing some environment variables before I call srun? Is the shifter binary required with all srun commands?

dmjacobsen commented 6 years ago

srun within a job allocation will ignore the image option. This is because the integration is setup during the node prolog, which necessarily runs before srun within a job allocation runs. I suppose we could change the behavior so that it will at least set the needed environment variables for shifter.

alanm-cray commented 6 years ago

Issue resolved by upgrading Shifter to 16.08.5