NERSC / shifter

Shifter - Linux Containers for HPC
Other
348 stars 65 forks source link

FAILED to find requested image #249

Closed qlux closed 5 years ago

qlux commented 5 years ago

Hi, Using the last version of shifter and Slurm 18.03, I cannot get a batch or salloc/srun job to start. Submit commands error with: FAILED to find requested image

The submission command I use is: sbatch -c1 --mem=6G --image=docker:bash:latest --wrap="shifter cat /etc/os-release" or salloc -c1 --mem=6G --image=docker:bash:latest (tested also with --image=bash)

On the compute node, the slurmd log outputs

 [2019-04-12T14:48:59.462] error: about to lookup image in prolog env
 [2019-04-12T14:48:59.462] error: shifterConfig.json already exists!
 [2019-04-12T14:48:59.462] Serial Job Resource Selection plugin loaded with argument 20
 [2019-04-12T14:48:59.462] error: setupRoot arg 0: /ai/apps/shifter/sbin/setupRoot
 [2019-04-12T14:48:59.462] error: setupRoot arg 1: -U
 [2019-04-12T14:48:59.462] error: setupRoot arg 2: 20000
 [2019-04-12T14:48:59.462] error: setupRoot arg 3: -G
 [2019-04-12T14:48:59.462] error: setupRoot arg 4: 1500000000
 [2019-04-12T14:48:59.462] error: setupRoot arg 5: -u
 [2019-04-12T14:48:59.462] error: setupRoot arg 6: usertest
 [2019-04-12T14:48:59.462] error: setupRoot arg 7: -N
 [2019-04-12T14:48:59.462] error: setupRoot arg 8: computenode
 [2019-04-12T14:48:59.462] error: setupRoot arg 9: id
 [2019-04-12T14:48:59.462] error: setupRoot arg 10: 3d3a12fa4050f647721e66b54564562c7c90dcbcac6a729ae729e2543700613b
 [2019-04-12T14:48:59.601] error: waiting on setupRoot

 [2019-04-12T14:48:59.601] error: after setupRoot, exit code: 0
 [2019-04-12T14:48:59.839] _run_prolog: run job script took usec=397049
 [2019-04-12T14:48:59.839] _run_prolog: prolog with lock for job 384 ran for 0 seconds
[2019-04-12T14:48:59.892] [384.batch] debug level = 2
[2019-04-12T14:48:59.892] [384.batch] starting 1 tasks
[2019-04-12T14:48:59.893] [384.batch] task 0 (9414) started 2019-04-12T14:48:59
[2019-04-12T14:48:59.905] [384.batch] Can't propagate RLIMIT_NPROC of 120500 from submit host: Operation not permitted
[2019-04-12T14:48:59.925] [384.batch] task 0 (9414) exited with exit code 1.

no other apparent error reported

Doing a watch on lsblk I can see /dev/loop4 mounted with /var/udiLoopMount

root@computenode:/var/log/slurm# lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0     7:0    0   56M  1 loop /snap/google-cloud-sdk/73
loop1     7:1    0 57.6M  1 loop /snap/google-cloud-sdk/77
loop2     7:2    0 89.3M  1 loop /snap/core/6673
loop3     7:3    0   91M  1 loop /snap/core/6405
loop4     7:4    0  5.5M  1 loop /var/udiLoopMount
sda       8:0    0   20G  0 disk
├─sda1    8:1    0 19.9G  0 part /
├─sda14   8:14   0    4M  0 part
└─sda15   8:15   0  106M  0 part /boot/efi
root@computenode:/# ll /var/udiLoopMount/
total 4
drwxr-xr-x 19 root root  232 Apr 11 19:44 ./
drwxr-xr-x 15 root root 4096 Apr 12 15:13 ../
drwxr-xr-x  2 root root 1096 Apr  9 23:37 bin/
drwxr-xr-x  2 root root    3 Apr  8 20:30 dev/
drwxr-xr-x 16 root root  548 Apr  9 23:37 etc/
drwxr-xr-x  2 root root    3 Apr  8 20:30 home/
drwxr-xr-x  5 root root  194 Apr  9 23:37 lib/
drwxr-xr-x  5 root root   53 Apr  8 20:30 media/
drwxr-xr-x  2 root root    3 Apr  8 20:30 mnt/
drwxr-xr-x  2 root root    3 Apr  8 20:30 opt/
drwxr-xr-x  2 root root    3 Apr  8 20:30 proc/
drwxr-xr-x  2 root root    3 Apr  8 20:30 root/
drwxr-xr-x  2 root root    3 Apr  8 20:30 run/
drwxr-xr-x  2 root root  911 Apr  8 20:30 sbin/
drwxr-xr-x  2 root root    3 Apr  8 20:30 srv/
drwxr-xr-x  2 root root    3 Apr  8 20:30 sys/
drwxrwxrwt  2 root root    3 Apr  9 23:37 tmp/
drwxr-xr-x  8 root root  110 Apr  9 23:37 usr/
drwxr-xr-x 11 root root  134 Apr  8 20:30 var/

I don't know where else it fails If I try to mount the image manually as root /ai/apps/shifter/sbin/setupRoot -U 20000 -G 1500000000 -u usertest -N computenode if 3d3a12fa4050f647721e66b54564562c7c90dcbcac6a729ae729e2543700613b

it succeeeds and /dev/loop4 is mounted however if I unmount it manually, I can see a lot of others fs subsequently mounted to the same loop instead of grabbing another one:

loop4     7:4    0  5.5M  1 loop /var/udiMount/var/local
...umount /dev/loop4...
loop4     7:4    0  5.5M  1 loop /var/udiMount/var/lib
...umount /dev/loop4...
loop4     7:4    0  5.5M  1 loop /var/udiMount/var/cache
...umount /dev/loop4...
loop4     7:4    0  5.5M  1 loop /var/udiMount/usr
...umount /dev/loop4...
loop4     7:4    0  5.5M  1 loop /var/udiMount/srv
etc...

Can you please advise ?

scanon commented 5 years ago

Do things work okay from the prompt? For example, can you do...

shifterimg pull bash salloc -N 1 shifter --image=bash

qlux commented 5 years ago

Hi Shane, shifterimg can find the image from the login or compute node but for whatever reason shifter don't find it. Does shifter requires access to the Mongo db ?

$ salloc --image=bash
...
usertest@node64503-hx85:~$ lsblk
NAME    MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
loop0     7:0    0   91M  1 loop /snap/core/6405
loop1     7:1    0   56M  1 loop /snap/google-cloud-sdk/73
loop2     7:2    0 89.3M  1 loop /snap/core/6673
loop3     7:3    0 57.6M  1 loop /snap/google-cloud-sdk/77
loop4     7:4    0  5.5M  1 loop /var/udiLoopMount
sda       8:0    0   20G  0 disk
├─sda1    8:1    0 19.9G  0 part /
├─sda14   8:14   0    4M  0 part
└─sda15   8:15   0  106M  0 part /boot/efi
usertest@node64503-hx85:~$ shifter hostname
FAILED to find requested image.
usertest@node64503-hx85:~$ shifterimg images
gcpnode    docker     READY    3d3a12fa40   2019-04-11T19:44:41 bash:latest
usertest@node64503-hx85:~$ srun hostname
node64503-hx85
usertest@node64503-hx85:~$ srun shifter hostname
FAILED to find requested image.
srun: error: gcp-c1-13g-kofu1: task 0: Exited with exit code 1
usertest@node64503-hx85:~$ shifter --image=bash
FAILED to find requested image.

EDIT: Actually if I give read permissions to the images (on an NFS shared dir with _no_rootsquash option), I now have this error

usertest@node64503-hx85:~$ srun shifter hostname
Not running with root privileges, will fail.
srun: error: gcp-c1-13g-kofu1: task 0: Exited with exit code 1

I assume shifter does not need to be run as root, right ? Does every user needs to be sudo ?

scanon commented 5 years ago

If you do shifterimg images on a compute node. Do you see the list?

qlux commented 5 years ago

Yes, see above, node64503 refers to the compute node

usertest@node64503-hx85:~$ shifterimg images gcpnode docker READY 3d3a12fa40 2019-04-11T19:44:41 bash:latest

scanon commented 5 years ago

Is the shifter binary set uid on the compute nodes? Can you do an ls -l?

qlux commented 5 years ago

You're right they are not

usertest@login-node-1:~$ ll /ai/apps/shifter/bin/
total 1052
drwxrwxr-x 2 root root     39 Apr 11 15:00 ./
drwxrwxr-x 8 root root     79 Apr 11 15:00 ../
-rwxr-xr-x 1 root root 511288 Apr 11 15:00 shifter*
-rwxr-xr-x 1 root root 564192 Apr 11 15:00 shifterimg*

Does it need to be an s, is that a security concern ?

qlux commented 5 years ago

You're right it was the suid bit that was causing the issue ! Must have been dropped somewhere when changing permissions to root. Thank you for you help, I'll read a bit more on it. Everything seems to be set up, hope to contribute to the project