Closed rhpvorderman closed 4 years ago
Yeah, I can reproduce something like what you describe. Is this similar to your error?
$ singularity exec docker://ubuntu ls & sleep 2.0s && singularity cache clean --all
[1] 32257
INFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob sha256:6abc03819f3e00a67ed5adc1132cfec041d5f7ec3c29d5416ba0433877547b6f
27.52 MiB / 27.52 MiB [====================================================] 3s
Copying blob sha256:05731e63f21105725a5c062a725b33a54ad8c697f9c810870c6aa3e3cd9fb6a2
844 B / 844 B [============================================================] 0s
Copying blob sha256:0bd67c50d6beeb55108476f72bea3b4b29a9f48832d6e045ec66b7ac4bf712a0
164 B / 164 B [============================================================] 0s
Copying config sha256:68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031
2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
INFO: Creating SIF file...
FATAL: Unable to handle docker://ubuntu uri: unable to build: While creating SIF: while creating container: container file creation failed: open /home/westleyk/.singularity/cache/oci-tmp/f08638ec7ddc90065187e7eabdfac3c96e5ff0f6b2f1762cf31a4f49b53000a5/ubuntu_latest.sif: no such file or directory
[1]+ Exit 255 singularity exec docker://ubuntu ls
After some messing around, I seem to have corrupted my cache, is this more like the error message your got?
$ singularity pull library://alpine:latest
INFO: Downloading library image
2.59 MiB / 2.59 MiB [=======================================================] 100.00% 3.60 MiB/s 0s
FATAL: While pulling library image: while opening cached image: open : no such file or directory
EDIT: this bug is not related this issue, I was not on the master branch :man_facepalming:
Btw, my singularity version is:
3.2.0-513.g3c02d0904
I'm running two pulls in parallel using:
$ singularity --version
singularity version 3.2.1-1.el7
Parallel Pull:
$ rm -Rf ~/.singularity/cache/ ; rm -f *.img ; strace -ff -o /tmp/singularity/ubuntu1810.strace singularity pull --name ubuntu1810.img docker://ubuntu:18.10 & strace -ff -o /tmp/singularity/ubuntu1804.strace singularity pull --name ubuntu1804.img docker://ubuntu:18.04
[1] 262982
INFO: Starting build...
INFO: Starting build...
Getting image source signatures
Copying blob sha256:89074f19944ee6c68e5da6dea5004e1339e4e8e9c54ea39641ad6e0bc0e4223b
Getting image source signatures
Copying blob sha256:6abc03819f3e00a67ed5adc1132cfec041d5f7ec3c29d5416ba0433877547b6f
27.52 MiB / 27.52 MiB [====================================================] 1s
Copying blob sha256:05731e63f21105725a5c062a725b33a54ad8c697f9c810870c6aa3e3cd9fb6a2
27.89 MiB / 27.89 MiB [====================================================] 2s
Copying blob sha256:6cd3a42e50dfbbe2b8a505f7d3203c07e72aa23ce1bdc94c67221f7e72f9af6c
844 B / 844 B [============================================================] 0s
Copying blob sha256:0bd67c50d6beeb55108476f72bea3b4b29a9f48832d6e045ec66b7ac4bf712a0
865 B / 865 B [============================================================] 0s
Copying blob sha256:26b902a7bf04aa8d7c02fd742898dab4b6c791b8e363fddc06298191167d5fac
162 B / 162 B [============================================================] 0s
164 B / 164 B [============================================================] 0s
Copying config sha256:7c8c583f970820a51dab6e0613761c4f99077d9a22b373a59f47ee2afb247e72
0 B / 2.36 KiB [--------------------------------------------------------------]Copying config sha256:68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031
2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination
2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
Storing signatures
FATAL: Unable to pull docker://ubuntu:18.10: conveyor failed to get: Error initializing source oci:/home/sigim/.singularity/cache/oci:50c1dc36867d3caf13f3c07456b40c57b3e6a4dcda20d05feac2c15e357353d4: no descriptor found for reference "50c1dc36867d3caf13f3c07456b40c57b3e6a4dcda20d05feac2c15e357353d4"
INFO: Creating SIF file...
INFO: Build complete: ubuntu1804.img
[1]+ Exit 255 strace -ff -o /tmp/singularity/ubuntu1810.strace singularity pull --name ubuntu1810.img docker://ubuntu:18.10
We sometimes get this error when we have cache corruption: Never mind, this was not related.FATAL: container creation failed: mount error: can't remount /run/shm: no such file or directory
. But maybe that is caused by something else.
I am glad you were able to reproduce the race-conditions! Thanks!
After some messing around, I seem to have corrupted my cache, is this more like the error message your got?
$ singularity pull library://alpine:latest INFO: Downloading library image 2.59 MiB / 2.59 MiB [=======================================================] >100.00% 3.60 MiB/s 0s FATAL: While pulling library image: while opening cached image: open : no such file or directory
Nevermined this ^^^ problem, I was on a dev branch (not master) :man_facepalming: , that issue has nothing to do with cache corrupting.
But, there still is a bug if you clean the cache, while building a container. witch may not be a bug...
@WestleyK Especially nextflow uses parallel pulls prior to the workflow execution.
Is there any chance for a bug fix?
I'm also getting issues like this in Toil workflows trying to use Singularity:
Unable to handle docker://devorbitus/ubuntu-bash-jq-curl uri: unable to build: conveyor failed to get: no descriptor found for reference "7f5e6bce78bb52d74e6a0881ec91806d11978cedfd4caa43a6fb71c55350254a"
It seems difficult in practice to prevent all software running as the same user as you from trying to use Singularity to run the same image as you are trying to run. The only workaround I can come up with is always setting your own SINGULARITY_CACHEDIR, at which point you lose the benefit of caching between tasks.
I've upgraded to latest version and still end up with the conveyor error:
$ singularity --version
singularity version 3.4.2-1.el7
Caused by:
Failed to pull singularity image
command: singularity pull --name ubuntu-18.10.img docker://ubuntu:18.10 > /dev/null
status : 255
message:
[34mINFO: Converting OCI blobs to SIF format
INFO: Starting build...
Getting image source signatures
Copying blob sha256:8a532469799e09ef8e1b56ebe39b87c8b9630c53e86380c13fbf46a09e51170e
0 B / 25.82 MiB [-------------------------------------------------------------]
8.88 MiB / 25.82 MiB [===================>------------------------------------]
15.61 MiB / 25.82 MiB [=================================>---------------------]
21.16 MiB / 25.82 MiB [=============================================>---------]
25.82 MiB / 25.82 MiB [====================================================] 0s
Copying blob sha256:32f4dcec3531395ca50469cbb6cba0d2d4fed1b8b2166c83b25b2f5171c7db62
0 B / 34.32 KiB [-------------------------------------------------------------]
34.32 KiB / 34.32 KiB [====================================================] 0s
Copying blob sha256:230f0701585eb7153c6ba1a9b08f4cfbf6a25d026d7e3b78a47c0965e4c6d60a
0 B / 868 B [-----------------------------------------------------------------]
868 B / 868 B [============================================================] 0s
Copying blob sha256:e01f70622967c0cca68d6a771ae7ff141c59ab979ac98b5184db665a4ace6415
0 B / 164 B [-----------------------------------------------------------------]
164 B / 164 B [============================================================] 0s
Copying config sha256:e4186b579c943dcced1341ccc4b62ee0617614cafc5459733e2f2f7ef708f224
0 B / 2.42 KiB [--------------------------------------------------------------]
2.42 KiB / 2.42 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
FATAL: While making image from oci registry: while building SIF from layers: conveyor failed to get: no descriptor found for reference "7d657275047118bb77b052c4c0ae43e8a289ca2879ebfa78a703c93aa8fd686c"
As a response to https://github.com/sylabs/singularity/issues/4555#issuecomment-570612570, it would be extremely useful for my use case to have some synchronization inside Singularity that depends on atomic globally-consistent rename support, or even that depends on file lock support, on the backing filesystem. The result would be AFAIK no worse in the case where Singularity is running on multiple machines against a filesystem without support for these tools (i.e. you'd still get uncontrolled races and apparently arbitrary failures), but within a single machine with an ext4 home directory (which covers e.g. most cloud VMs) you would get actually-reliable performance.
@adamnovak - understood. There have been some caching code changes since earlier 3.x versions that I'm not entirely familiar with yet, but I believe we have fewer issues now. We can try and establish the exact points we have problems remaining, and take a look at improvements for the pretty constrained case you give there in the next release cycle. I just don't want to give any promises that we can solve things simply for people who are wanting to share cache directories between multiple users on arbitrary cluster filesystems.
We still recommend, that you singularity pull
in a single location, single script etc. into a SIF file before any concurrent execution, and run against that immutable SIF.
This has surfaced again in #5020 - I'm going to close this issue and we'll pick it up there. We have a plan to move forward on this on that issue.
Version of Singularity:
3.1.0
Expected behavior
When two singularity processes pull the same image, some sort of measures are taken that they do not write to the cache at the same time.
Actual behavior
Two singularity process will write to the cache at the same time. Oddly enough this works well in most of the cases. However sometimes we get cache corruption on our cluster. This happens when we start multiple jobs that require the same image simultaneously.
Steps to reproduce behavior
singularity cache clean --all
singularity shell docker://python:3.7
simultaneously in two different terminals.EDIT: I realize it is very hard to reproduce behaviour that happens 'sometimes'. I could not find a similar issue so I hope that other people with the same problem manage to find this one.