NERSC / podman-hpc

Other
35 stars 8 forks source link

problems with --userns=keep-id for squashed images #83

Closed lastephey closed 1 year ago

lastephey commented 1 year ago

Seems we have lost some of our --userns=keep-id functionality for squashed images. This test works fine on a login with an unsquashed image, but fails on a compute node using the squashed image:

stephey@muller:login01:~> podman-hpc pull ubuntu:latest
WARN[0000] Failed to decode the keys ["storage.options.overlay.squashmount"] from "/global/homes/s/stephey/.config/containers/storage.conf" 
WARN[0000] Failed to decode the keys ["storage.options.overlay.squashmount"] from "/global/u1/s/stephey/.config/containers/storage.conf" 
Resolved "ubuntu" as an alias (/global/homes/s/stephey/.cache/containers/short-name-aliases.conf)
Trying to pull docker.io/library/ubuntu:latest...
Getting image source signatures
Copying blob 3153aa388d02 done  
Copying config 5a81c4b850 done  
Writing manifest to image destination
Storing signatures
5a81c4b8502e4979e75bd8f91343b95b0d695ab67f241dbed0d1530a35bde1eb
INFO: Migrating image to /mscratch/sd/s/stephey/storage
stephey@muller:login01:~> podman-hpc run --rm -it --entrypoint=/bin/bash --userns=keep-id ubuntu:latest 
WARN[0000] Failed to decode the keys ["storage.options.overlay.squashmount"] from "/global/homes/s/stephey/.config/containers/storage.conf" 
WARN[0000] Failed to decode the keys ["storage.options.overlay.squashmount"] from "/global/u1/s/stephey/.config/containers/storage.conf" 
stephey@02f5975d0240:/$ exit
exit
stephey@muller:login01:~> salloc -N 1 -t 30 -C cpu -A nstaff -q interactive
salloc: Granted job allocation 352192
salloc: Waiting for resource configuration
salloc: Nodes nid001007 are ready for job
stephey@nid001007:~> podman-hpc run --rm -it --entrypoint=/bin/bash --userns=keep-id ubuntu:latest
WARN[0000] Failed to decode the keys ["storage.options.overlay.squashmount"] from "/global/homes/s/stephey/.config/containers/storage.conf" 
WARN[0000] Failed to decode the keys ["storage.options.overlay.squashmount"] from "/global/u1/s/stephey/.config/containers/storage.conf" 
Error: runc: runc create failed: unable to start container process: exec: "/bin/bash": stat /bin/bash: permission denied: OCI permission denied
stephey@nid001007:~> 

This is blocking some tests I'd like to do for openmpi support. @scanon do you think you could take a look and see if you can reproduce?

danfulton commented 1 year ago

Do we have the ignore_chown_errors option on for the pull (or the squash) in this case? I'm starting to get suspicious it may be lossy.

tylern4 commented 1 year ago

@balewski and I ran into the same issue when get an image working on Perlmutter. An image built on one login node and then migrated gives the same runc error when trying to run it on another login node.

[perlmutter-login31:~/scratch/microsoft]$ podman-hpc images
REPOSITORY               TAG         IMAGE ID      CREATED     SIZE        R/O
localhost/tylern/csharp  sdk         951a9362ead4  3 days ago  886 MB      true

[perlmutter-login31:~/scratch/microsoft]$ podman-hpc run --userns keep-id --rm -it tylern/csharp:sdk
Error: runc: runc create failed: unable to start container process: exec: "bash": executable file not found in $PATH: OCI runtime attempted to invoke a command that was not found

[perlmutter-login31:~/scratch/microsoft]$ podman-hpc run --rm -it tylern/csharp:sdk
root@894cd5c223cb:/#
lastephey commented 1 year ago

Should hopefully be addressed by #85