NERSC / podman-hpc

Other
38 stars 10 forks source link

`podman-hpc pull` fail on a specific login node #86

Closed asnaylor closed 1 year ago

asnaylor commented 1 year ago

Pulling hrzhao076/custom_backend:2.0 on perlmutter:login06 failed with a KeyError but was able to successfully pull on login07.

perlmutter:login06 | ~ $ podman-hpc pull hrzhao076/custom_backend:2.0
WARN[0000] "/" is not a shared mount, this could cause issues or missing mounts with rootless containers
ERRO[0000] Refreshing volume 4f6b6ad49882dfcc4fe3a4c935209406c8ac3e51c8fd6669978b0ff9e88d041c: acquiring lock 2 for volume 4f6b6ad49882dfcc4fe3a4c935209406c8ac3e51c8fd6669978b0ff9e88d041c: file exists
✔ docker.io/hrzhao076/custom_backend:2.0
Trying to pull docker.io/hrzhao076/custom_backend:2.0...
Getting image source signatures
Copying blob 438769c28cd4 done
........
Copying config e4f68a21c4 done
Writing manifest to image destination
Storing signatures
e4f68a21c4ba743a94cfbcabdb39599a38f6501795d4b704238c772177dad768
INFO: Migrating image to /pscratch/sd/a/asnaylor/storage
Traceback (most recent call last):
  File "/usr/bin/podman-hpc", line 11, in <module>
    load_entry_point('podman-hpc==1.0.2', 'console_scripts', 'podman-hpc')()
  File "/usr/lib/python3.6/site-packages/podman_hpc/podman_hpc.py", line 388, in main
    podhpc(prog_name="podman-hpc")
  File "/usr/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/podman_hpc/podman_hpc.py", line 177, in pull
    mu.migrate_image(image)
  File "/usr/lib/python3.6/site-packages/podman_hpc/migrate2scratch.py", line 435, in migrate_image
    rld = self._get_img_layers(self.src, img_id)
  File "/usr/lib/python3.6/site-packages/podman_hpc/migrate2scratch.py", line 321, in _get_img_layers
    ld = by_digest[layer["digest"]]
KeyError: 'sha256:a68ad9aa133df8e5dc93cc27c4452e2875d590e0b6762b5c8dfda4c3b2a03949'[https://sites.google.com/lbl.gov/nug-2023/home](https://www.google.com/url?q=https://sites.google.com/lbl.gov/nug-2023/home&sa=D&source=calendar&ust=1695670862047456&usg=AOvVaw3brRnBdBua_3Brg-pi6AWC)
lastephey commented 1 year ago

Could be the ongoing lustre issues?

hrzhao76 commented 1 year ago

Hi I don't think it's because of a specific login node.
Today I pull hrzhao076/custom_backend:2.1 after I push it on login09 it fails again.
Switching to another login node and pulling again works. Maybe it's simply because that one cannot push and pull on a same node?

Copying blob d1c5a39be588 done
Copying blob b2eb8f42dffa done
Copying blob 1a2a288b4b59 skipped: already exists
Copying config bddaf9cb14 done
Writing manifest to image destination
Storing signatures
bddaf9cb1464e14dc3b35f22e6e2e75ad4eec59ed98f16aef599ef7d4c0f41e6
INFO: Migrating image to /pscratch/sd/h/hrzhao/storage
Traceback (most recent call last):
  File "/usr/bin/podman-hpc", line 11, in <module>
    load_entry_point('podman-hpc==1.0.2', 'console_scripts', 'podman-hpc')()
  File "/usr/lib/python3.6/site-packages/podman_hpc/podman_hpc.py", line 388, in main
    podhpc(prog_name="podman-hpc")
  File "/usr/lib/python3.6/site-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3.6/site-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/decorators.py", line 64, in new_func
    return ctx.invoke(f, obj, *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/click/decorators.py", line 17, in new_func
    return f(get_current_context(), *args, **kwargs)
  File "/usr/lib/python3.6/site-packages/podman_hpc/podman_hpc.py", line 177, in pull
    mu.migrate_image(image)
  File "/usr/lib/python3.6/site-packages/podman_hpc/migrate2scratch.py", line 435, in migrate_image
    rld = self._get_img_layers(self.src, img_id)
  File "/usr/lib/python3.6/site-packages/podman_hpc/migrate2scratch.py", line 321, in _get_img_layers
    ld = by_digest[layer["digest"]]
KeyError: 'sha256:1a2a288b4b593904fe90ec4335d78ae7b1026a979ff43a02aa88374a63dae5dc'
lastephey commented 1 year ago

Is this still happening @asnaylor and @hrzhao76?

asnaylor commented 1 year ago

Hmm I have a different error now:

asnaylor@perlmutter:login02 | ~ $ podman-hpc pull hrzhao076/custom_backend:2.1
....
Copying blob 4713e6baa1cc done
Copying blob ecac3aeb0b12 done
Copying blob d1c5a39be588 done
Copying blob b2eb8f42dffa done
Copying blob 1a2a288b4b59 [==================================>---] 8.7GiB / 9.4GiB
Error: writing blob: storing blob to file "/tmp/storage2742201414/47": happened during read: (heuristic tuning data: last retry 9290086454, current offset 9290086454; 361049.516 ms total, 63015.394 ms since progress): unexpected EOF
Pull failed.
asnaylor commented 1 year ago

It was fine when i ran it again

lastephey commented 1 year ago

Thanks for checking. I'll go ahead and close, but please re-open if you see it again.