NERSC / podman-hpc

Other
34 stars 5 forks source link

Problems pulling a docker image on a Perlmutter Compute node #80

Closed asnaylor closed 10 months ago

asnaylor commented 1 year ago

Was trying to pull nvcr.io/nvidia/tritonserver:22.02-py3 on a Perlmutter compute node but the squash failed. I can pull this image fine on a login node.

$ podman-hpc pull nvcr.io/nvidia/tritonserver:22.02-py3
WARN[0000] "/" is not a shared mount, this could cause issues or missing mounts with rootless containers
ERRO[0000] Image nvcr.io/nvidia/tritonserver:22.02-py3 exists in local storage but may be corrupted (remove the image to resolve the issue): layer not known
Trying to pull nvcr.io/nvidia/tritonserver:22.02-py3...
Getting image source signatures
Copying blob 0627d99a5c7e [--------------------------------------] 0.0b / 11.3KiB
Copying blob 0627d99a5c7e [--------------------------------------] 0.0b / 11.3KiB
Copying blob 08c01a0ec47e [--------------------------------------] 0.0b / 27.2MiB
Copying blob 2c5671971cd1 [--------------------------------------] 22.2KiB / 2.1GiB
Copying blob 0bd41856d1bc [--------------------------------------] 0.0b / 86.9MiB
Copying blob 5eedfe82c26b [--------------------------------------] 0.0b / 76.5MiB
Copying blob 2514f64065e8 [--------------------------------------] 0.0b / 177.8KiB
Copying blob 0627d99a5c7e [--------------------------------------] 0.0b / 11.3KiB
Copying blob 08c01a0ec47e [=>------------------------------------] 1.7MiB / 27.2MiB
Copying blob 2c5671971cd1 [--------------------------------------] 2.1MiB / 2.1GiB
Copying blob 0bd41856d1bc [--------------------------------------] 0.0b / 86.9MiB
Copying blob 5eedfe82c26b [>-------------------------------------] 1.1MiB / 76.5MiB
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e [===========>--------------------------] 8.6MiB / 27.2MiB
Copying blob 2c5671971cd1 [--------------------------------------] 8.6MiB / 2.1GiB
Copying blob 0bd41856d1bc [--------------------------------------] 17.3KiB / 86.9MiB
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e [=======================>--------------] 16.9MiB / 27.2MiB
Copying blob 2c5671971cd1 [--------------------------------------] 17.4MiB / 2.1GiB
Copying blob 0bd41856d1bc [--------------------------------------] 103.3KiB / 86.9MiB
Copying blob 0627d99a5c7e done
Copying blob 0627d99a5c7e done
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 2c5671971cd1 [>-------------------------------------] 49.7MiB / 2.1GiB
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 2c5671971cd1 [>-------------------------------------] 58.4MiB / 2.1GiB
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 2c5671971cd1 [=>------------------------------------] 84.7MiB / 2.1GiB
Copying blob 0bd41856d1bc [======================>---------------] 53.2MiB / 86.9MiB
Copying blob 5eedfe82c26b [====================================>-] 74.2MiB / 76.5MiB
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 0627d99a5c7e done
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 2c5671971cd1 [==>-----------------------------------] 153.3MiB / 2.1GiB
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 2c5671971cd1 [===>----------------------------------] 220.3MiB / 2.1GiB
Copying blob 0bd41856d1bc done
Copying blob 5eedfe82c26b done
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 2c5671971cd1 [========>-----------------------------] 478.6MiB / 2.1GiB
Copying blob 0627d99a5c7e done
Copying blob 0627d99a5c7e done
Copying blob 08c01a0ec47e done
Copying blob 2c5671971cd1 done
Copying blob 0bd41856d1bc done
Copying blob 5eedfe82c26b done
Copying blob 2514f64065e8 done
Copying blob ec1508216c18 done
Copying blob 749074f382f7 done
Copying blob a7e6794ed569 done
Copying blob 4f1ad9e2a154 done
Copying blob 09cd2bfa6cba done
Copying blob 9da140dedb4f done
Copying blob 2935286a91f6 done
Copying blob 2b0599a695e2 skipped: already exists
Copying blob 628e96fc9140 skipped: already exists
Copying blob d2eb54715c6f skipped: already exists
Copying blob a6d158916ac9 skipped: already exists
Copying blob 5caa897e63ad skipped: already exists
Copying blob 164c096fb0d9 skipped: already exists
Copying blob ff1be69fcc70 skipped: already exists
Copying blob 38972466c8c5 skipped: already exists
Copying blob d260d4926e86 skipped: already exists
Copying blob 665a51e67490 skipped: already exists
Copying blob 4df3be64118e skipped: already exists
Copying blob 6072fed44aff skipped: already exists
Copying blob 878d8a1d0950 skipped: already exists
Copying blob aaaa03356373 skipped: already exists
Copying blob ec889413fc1a skipped: already exists
Copying blob 1d0881aa8a74 skipped: already exists
Copying blob 30efd5ebf026 skipped: already exists
Copying blob d3ae827fe332 skipped: already exists
2023/07/07 14:37:05 bolt.Close(): funlock error: errno 524
2023/07/07 14:37:05 bolt.Close(): funlock error: errno 524
Copying config d52ac03519 done
2023/07/07 14:37:05 bolt.Close(): funlock error: errno 524
2023/07/07 14:37:05 bolt.Close(): funlock error: errno 524
2023/07/07 14:37:05 bolt.Close(): funlock error: errno 524
Writing manifest to image destination
Storing signatures
d52ac03519ab5da3e240477c27defe10687fbcc2484cf6e62d38006638a0bae7
WARN[0068] Failed to add pause process to systemd sandbox cgroup: dial unix /run/user/75235/bus: connect: no such file or directory
INFO: Migrating image to /pscratch/sd/a/asnaylor/storage
ERROR:root:Squash Failed
ERROR:root:
ERROR:root:time="2023-07-07T14:37:05-07:00" level=warning msg="The cgroupv2 manager is set to systemd but there is no systemd user session available"
time="2023-07-07T14:37:05-07:00" level=warning msg="For using systemd, you may need to login using an user session"
time="2023-07-07T14:37:05-07:00" level=warning msg="Alternatively, you can enable lingering with: `loginctl enable-linger 75235` (possibly as root)"
time="2023-07-07T14:37:05-07:00" level=warning msg="Falling back to --cgroup-manager=cgroupfs"
time="2023-07-07T14:37:05-07:00" level=warning msg="The cgroupv2 manager is set to systemd but there is no systemd user session available"
time="2023-07-07T14:37:05-07:00" level=warning msg="For using systemd, you may need to login using an user session"
time="2023-07-07T14:37:05-07:00" level=warning msg="Alternatively, you can enable lingering with: `loginctl enable-linger 75235` (possibly as root)"
time="2023-07-07T14:37:05-07:00" level=warning msg="Falling back to --cgroup-manager=cgroupfs"
time="2023-07-07T14:37:05-07:00" level=warning msg="Can't read link \"/tmp/75235_hpc/storage/overlay/l/JEZ2HQREEFDRMUHHFJ3NCKBF27\" because it does not exist. A storage corruption might have occurred, attempting to recreate the missing symlinks. It might be best wipe the storage to avoid further errors due to storage corruption."
Error: readlink /tmp/75235_hpc/storage/overlay/l/JEZ2HQREEFDRMUHHFJ3NCKBF27: no such file or directory

As a temporary fix @lastephey suggested doing:

podman-hpc pull --storage-opt mount_program=/usr/bin/fuse-overlayfs-wrap nvcr.io/nvidia/tritonserver:22.02-py3
lastephey commented 1 year ago

Thanks @asnaylor. I think the easiest fix is to add this to default_pull_args.

The only question is whether using fuse-overlays rather than native adds a lot of slowdown on logins. If that's the case, we could add some logic for podman-hpc to determine if it's on a compute or login, although that would make things a bit messy. I'll do some testing.

lastephey commented 10 months ago

Should be addressed by https://github.com/NERSC/podman-hpc/pull/82