Singularity jobs fail on first run waiting for image pull

neoformit commented 1 year ago

Describe the bug I'm sure this has been noticed before, but I could not find an issue for it.

When singularity-enabled tools run for the first time they fail waiting for the image pull from biocontainers, with an obscure error:

FATAL:   Unable to handle docker://quay.io/biocontainers/gtdbtk:2.2.2--pyhdfd78af_0 uri: unable to create tmp file: open /mnt/singularity_data/tmp/sbuild-tmp-cache-171554111: no such file or directory

Galaxy Version and/or server at which you observed the bug Galaxy Version: usegalaxy.org.au Commit: 22.05

To Reproduce Steps to reproduce the behavior:

Install a new tool to use the singularity runner (or delete the image for an existing one)
Try to run the tool for the first time

Expected behavior The tool should work first time following install or update. I couldn't locate the code that throws unable to create tmp file but at this point I would expect the code to wait until the image pull has completed.

bernt-matthias commented 1 year ago

How are your container resolvers configured?

bernt-matthias commented 1 year ago

And, do you have a bit more context from the logs?

cat-bro commented 1 year ago

Hi @bernt-matthias, I'm pasting our container resolvers configuration. This happens for both mulled and explicit singularity containers.

    container_resolvers:
      - type: explicit
      - type: cached_explicit_singularity
        cache_directory: "{{ galaxy_tools_indices_dir }}/cache/singularity"
      - type: explicit_singularity
        cache_directory: "{{ galaxy_tools_indices_dir }}/cache/singularity"
      - type: cached_mulled_singularity
        cache_directory: "{{ galaxy_tools_indices_dir }}/cache/singularity"
      - type: mulled_singularity
        cache_directory: "{{ galaxy_tools_indices_dir }}/cache/singularity"
      - type: build_mulled_singularity
        cache_directory: "{{ galaxy_tools_indices_dir }}/cache/singularity"
        auto_install: false

bernt-matthias commented 1 year ago

At which point is this FATAL: .../ error occuring? While pulling happening during job preparation, or during the actual job? Do you have the log messages from the container resolvers (maybe for the 1st and the 2nd run)?

My first thought was that auto_install: False might help for mulled_singularity. Then the pulled image should be used already in the first iteration (in the second iteration cached_mulled_singularity will find it).

But the problem might be different, since you wrote that it also happens for the explicit resolvers.

A few more comments:

since you are using singularity you might want to remove explicit (but it costs nearly nothing to keep it)
Wondering if having both cached_explicit_singularity and explicit_singularity is necessary?

pcm32 commented 1 year ago

I remember sorting lots of issues with singularity by making sure that I was using the latest version (I found that the default version that was installed by HPC admins was quite old).

I also had some initial issues when the first tool invoking the container was part of a dataset collection (so multiple nodes making the pull at the same time for the execution of their element of the collection). This led me to have all containers pre-downloaded at a specific file system location (a lot of space used though).

bernt-matthias commented 1 year ago

I guess this is a duplicate of https://github.com/galaxyproject/galaxy/issues/15673 .. see also the linked discussion (error message looks the same). Is this happening is production?

I should have a fix in https://github.com/galaxyproject/galaxy/pull/15614/commits, i.e. fffb6f80be3c962cd0f3d4568ae207c8e852c0c3, 5637129c5eb9abf07328b08765375342436a7b03, and b37085b50a1e2e5f7c6270000ceaab3d9dbc6977

neoformit commented 1 year ago

Yep this is on usegalaxy.org.au. Great :crossed_fingers:

bernt-matthias commented 1 year ago

What this does not explain it why it does not work only on first run.

This I could possibly be fixed/changed by adding: auto_install: True to mulled_singularity .. note: this does not trigger auto installation as the name suggests, but only that the cached image will be used on the 1st run (otherwise its the docker:// URI) .. see also my (still developing) notes on this in https://github.com/galaxyproject/galaxy/pull/15614

cat-bro commented 1 year ago

I've assumed that the first job triggers auto install but does not wait long enough before running the job bash script. Maybe it will wait for "singularity pull" to complete but not for the sif file to be present.

cat-bro commented 1 year ago

@bernt-matthias is auto_install not true by default?

bernt-matthias commented 1 year ago

@bernt-matthias is auto_install not true by default?

indeed. I mixed this up.

This line is then responsible that you get the cached container description.

Maybe it will wait for "singularity pull" to complete but not for the sif file to be present.

The pull happens before this (and should definitely be finished when to job is starting). Its just that for auto_install=True the resolver does not return the cached description. For the second run the cached_mulled_singularity kicks in and resolves to the sif file.

galaxyproject / galaxy

Singularity jobs fail on first run waiting for image pull #15641