apptainer / singularity

Singularity has been renamed to Apptainer as part of us moving the project to the Linux Foundation. This repo has been persisted as a snapshot right before the changes.
https://github.com/apptainer/apptainer
Other
2.53k stars 424 forks source link

Race conditions when caching images sometimes causes cache corruption. #3634

Closed rhpvorderman closed 4 years ago

rhpvorderman commented 5 years ago

Version of Singularity:

3.1.0

Expected behavior

When two singularity processes pull the same image, some sort of measures are taken that they do not write to the cache at the same time.

Actual behavior

Two singularity process will write to the cache at the same time. Oddly enough this works well in most of the cases. However sometimes we get cache corruption on our cluster. This happens when we start multiple jobs that require the same image simultaneously.

Steps to reproduce behavior

  1. singularity cache clean --all
  2. run singularity shell docker://python:3.7 simultaneously in two different terminals.

EDIT: I realize it is very hard to reproduce behaviour that happens 'sometimes'. I could not find a similar issue so I hope that other people with the same problem manage to find this one.

WestleyK commented 5 years ago

Yeah, I can reproduce something like what you describe. Is this similar to your error?

$ singularity exec docker://ubuntu ls & sleep 2.0s && singularity cache clean --all
[1] 32257
INFO:    Converting OCI blobs to SIF format
INFO:    Starting build...
Getting image source signatures
Copying blob sha256:6abc03819f3e00a67ed5adc1132cfec041d5f7ec3c29d5416ba0433877547b6f
 27.52 MiB / 27.52 MiB [====================================================] 3s
Copying blob sha256:05731e63f21105725a5c062a725b33a54ad8c697f9c810870c6aa3e3cd9fb6a2
 844 B / 844 B [============================================================] 0s
Copying blob sha256:0bd67c50d6beeb55108476f72bea3b4b29a9f48832d6e045ec66b7ac4bf712a0
 164 B / 164 B [============================================================] 0s
Copying config sha256:68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031
 2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
INFO:    Creating SIF file...
FATAL:   Unable to handle docker://ubuntu uri: unable to build: While creating SIF: while creating container: container file creation failed: open /home/westleyk/.singularity/cache/oci-tmp/f08638ec7ddc90065187e7eabdfac3c96e5ff0f6b2f1762cf31a4f49b53000a5/ubuntu_latest.sif: no such file or directory

[1]+  Exit 255                singularity exec docker://ubuntu ls
WestleyK commented 5 years ago

After some messing around, I seem to have corrupted my cache, is this more like the error message your got?

$ singularity pull library://alpine:latest
INFO:    Downloading library image
 2.59 MiB / 2.59 MiB [=======================================================] 100.00% 3.60 MiB/s 0s
FATAL:   While pulling library image: while opening cached image: open : no such file or directory


EDIT: this bug is not related this issue, I was not on the master branch :man_facepalming:

WestleyK commented 5 years ago

Btw, my singularity version is:

3.2.0-513.g3c02d0904
tbugfinder commented 5 years ago

I'm running two pulls in parallel using:

$ singularity --version
singularity version 3.2.1-1.el7

Parallel Pull:

$ rm -Rf ~/.singularity/cache/ ;  rm -f *.img ; strace -ff -o /tmp/singularity/ubuntu1810.strace singularity pull --name ubuntu1810.img docker://ubuntu:18.10  & strace -ff -o /tmp/singularity/ubuntu1804.strace singularity pull --name ubuntu1804.img docker://ubuntu:18.04
[1] 262982
INFO:    Starting build...
INFO:    Starting build...
Getting image source signatures
Copying blob sha256:89074f19944ee6c68e5da6dea5004e1339e4e8e9c54ea39641ad6e0bc0e4223b
Getting image source signatures
Copying blob sha256:6abc03819f3e00a67ed5adc1132cfec041d5f7ec3c29d5416ba0433877547b6f
 27.52 MiB / 27.52 MiB [====================================================] 1s
Copying blob sha256:05731e63f21105725a5c062a725b33a54ad8c697f9c810870c6aa3e3cd9fb6a2
 27.89 MiB / 27.89 MiB [====================================================] 2s
Copying blob sha256:6cd3a42e50dfbbe2b8a505f7d3203c07e72aa23ce1bdc94c67221f7e72f9af6c
 844 B / 844 B [============================================================] 0s
Copying blob sha256:0bd67c50d6beeb55108476f72bea3b4b29a9f48832d6e045ec66b7ac4bf712a0
 865 B / 865 B [============================================================] 0s
Copying blob sha256:26b902a7bf04aa8d7c02fd742898dab4b6c791b8e363fddc06298191167d5fac
 162 B / 162 B [============================================================] 0s
 164 B / 164 B [============================================================] 0s
Copying config sha256:7c8c583f970820a51dab6e0613761c4f99077d9a22b373a59f47ee2afb247e72
 0 B / 2.36 KiB [--------------------------------------------------------------]Copying config sha256:68eb5e93296fbcd70feb84182a3121664ec2613435bd82f2e1205136352ae031
 2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination
 2.36 KiB / 2.36 KiB [======================================================] 0s
Writing manifest to image destination
Storing signatures
Storing signatures
FATAL:   Unable to pull docker://ubuntu:18.10: conveyor failed to get: Error initializing source oci:/home/sigim/.singularity/cache/oci:50c1dc36867d3caf13f3c07456b40c57b3e6a4dcda20d05feac2c15e357353d4: no descriptor found for reference "50c1dc36867d3caf13f3c07456b40c57b3e6a4dcda20d05feac2c15e357353d4"
INFO:    Creating SIF file...
INFO:    Build complete: ubuntu1804.img
[1]+  Exit 255                strace -ff -o /tmp/singularity/ubuntu1810.strace singularity pull --name ubuntu1810.img docker://ubuntu:18.10
rhpvorderman commented 5 years ago

We sometimes get this error when we have cache corruption: FATAL: container creation failed: mount error: can't remount /run/shm: no such file or directory . But maybe that is caused by something else. Never mind, this was not related.

I am glad you were able to reproduce the race-conditions! Thanks!

WestleyK commented 5 years ago

After some messing around, I seem to have corrupted my cache, is this more like the error message your got?

$ singularity pull library://alpine:latest
INFO:    Downloading library image
2.59 MiB / 2.59 MiB [=======================================================] >100.00% 3.60 MiB/s 0s
FATAL:   While pulling library image: while opening cached image: open : no such file or directory

Nevermined this ^^^ problem, I was on a dev branch (not master) :man_facepalming: , that issue has nothing to do with cache corrupting.

But, there still is a bug if you clean the cache, while building a container. witch may not be a bug...

tbugfinder commented 5 years ago

@WestleyK Especially nextflow uses parallel pulls prior to the workflow execution.

tbugfinder commented 5 years ago

Is there any chance for a bug fix?

adamnovak commented 5 years ago

I'm also getting issues like this in Toil workflows trying to use Singularity:

Unable to handle docker://devorbitus/ubuntu-bash-jq-curl uri: unable to build: conveyor failed to get: no descriptor found for reference "7f5e6bce78bb52d74e6a0881ec91806d11978cedfd4caa43a6fb71c55350254a"

It seems difficult in practice to prevent all software running as the same user as you from trying to use Singularity to run the same image as you are trying to run. The only workaround I can come up with is always setting your own SINGULARITY_CACHEDIR, at which point you lose the benefit of caching between tasks.

tbugfinder commented 5 years ago

I've upgraded to latest version and still end up with the conveyor error:

$ singularity --version
singularity version 3.4.2-1.el7

Caused by:
  Failed to pull singularity image
  command: singularity pull  --name ubuntu-18.10.img docker://ubuntu:18.10 > /dev/null
  status : 255
  message:
    [34mINFO:    Converting OCI blobs to SIF format
    INFO:    Starting build...
    Getting image source signatures
    Copying blob sha256:8a532469799e09ef8e1b56ebe39b87c8b9630c53e86380c13fbf46a09e51170e

     0 B / 25.82 MiB [-------------------------------------------------------------]
     8.88 MiB / 25.82 MiB [===================>------------------------------------]
     15.61 MiB / 25.82 MiB [=================================>---------------------]
     21.16 MiB / 25.82 MiB [=============================================>---------]
     25.82 MiB / 25.82 MiB [====================================================] 0s
    Copying blob sha256:32f4dcec3531395ca50469cbb6cba0d2d4fed1b8b2166c83b25b2f5171c7db62

     0 B / 34.32 KiB [-------------------------------------------------------------]
     34.32 KiB / 34.32 KiB [====================================================] 0s
    Copying blob sha256:230f0701585eb7153c6ba1a9b08f4cfbf6a25d026d7e3b78a47c0965e4c6d60a

     0 B / 868 B [-----------------------------------------------------------------]
     868 B / 868 B [============================================================] 0s
    Copying blob sha256:e01f70622967c0cca68d6a771ae7ff141c59ab979ac98b5184db665a4ace6415

     0 B / 164 B [-----------------------------------------------------------------]
     164 B / 164 B [============================================================] 0s
    Copying config sha256:e4186b579c943dcced1341ccc4b62ee0617614cafc5459733e2f2f7ef708f224

     0 B / 2.42 KiB [--------------------------------------------------------------]
     2.42 KiB / 2.42 KiB [======================================================] 0s
    Writing manifest to image destination
    Storing signatures
    FATAL:   While making image from oci registry: while building SIF from layers: conveyor failed to get: no descriptor found for reference "7d657275047118bb77b052c4c0ae43e8a289ca2879ebfa78a703c93aa8fd686c"
adamnovak commented 4 years ago

As a response to https://github.com/sylabs/singularity/issues/4555#issuecomment-570612570, it would be extremely useful for my use case to have some synchronization inside Singularity that depends on atomic globally-consistent rename support, or even that depends on file lock support, on the backing filesystem. The result would be AFAIK no worse in the case where Singularity is running on multiple machines against a filesystem without support for these tools (i.e. you'd still get uncontrolled races and apparently arbitrary failures), but within a single machine with an ext4 home directory (which covers e.g. most cloud VMs) you would get actually-reliable performance.

dtrudg commented 4 years ago

@adamnovak - understood. There have been some caching code changes since earlier 3.x versions that I'm not entirely familiar with yet, but I believe we have fewer issues now. We can try and establish the exact points we have problems remaining, and take a look at improvements for the pretty constrained case you give there in the next release cycle. I just don't want to give any promises that we can solve things simply for people who are wanting to share cache directories between multiple users on arbitrary cluster filesystems.

We still recommend, that you singularity pull in a single location, single script etc. into a SIF file before any concurrent execution, and run against that immutable SIF.

dtrudg commented 4 years ago

This has surfaced again in #5020 - I'm going to close this issue and we'll pick it up there. We have a plan to move forward on this on that issue.