Open neoformit opened 1 year ago
How are your container resolvers configured?
And, do you have a bit more context from the logs?
Hi @bernt-matthias, I'm pasting our container resolvers configuration. This happens for both mulled and explicit singularity containers.
container_resolvers:
- type: explicit
- type: cached_explicit_singularity
cache_directory: "{{ galaxy_tools_indices_dir }}/cache/singularity"
- type: explicit_singularity
cache_directory: "{{ galaxy_tools_indices_dir }}/cache/singularity"
- type: cached_mulled_singularity
cache_directory: "{{ galaxy_tools_indices_dir }}/cache/singularity"
- type: mulled_singularity
cache_directory: "{{ galaxy_tools_indices_dir }}/cache/singularity"
- type: build_mulled_singularity
cache_directory: "{{ galaxy_tools_indices_dir }}/cache/singularity"
auto_install: false
At which point is this FATAL: .../
error occuring? While pulling happening during job preparation, or during the actual job? Do you have the log messages from the container resolvers (maybe for the 1st and the 2nd run)?
My first thought was that auto_install: False
might help for mulled_singularity
. Then the pulled image should be used already in the first iteration (in the second iteration cached_mulled_singularity
will find it).
But the problem might be different, since you wrote that it also happens for the explicit resolvers.
A few more comments:
explicit
(but it costs nearly nothing to keep it)cached_explicit_singularity
and explicit_singularity
is necessary?I remember sorting lots of issues with singularity by making sure that I was using the latest version (I found that the default version that was installed by HPC admins was quite old).
I also had some initial issues when the first tool invoking the container was part of a dataset collection (so multiple nodes making the pull at the same time for the execution of their element of the collection). This led me to have all containers pre-downloaded at a specific file system location (a lot of space used though).
I guess this is a duplicate of https://github.com/galaxyproject/galaxy/issues/15673 .. see also the linked discussion (error message looks the same). Is this happening is production?
I should have a fix in https://github.com/galaxyproject/galaxy/pull/15614/commits, i.e. fffb6f80be3c962cd0f3d4568ae207c8e852c0c3, 5637129c5eb9abf07328b08765375342436a7b03, and b37085b50a1e2e5f7c6270000ceaab3d9dbc6977
Yep this is on usegalaxy.org.au. Great :crossed_fingers:
What this does not explain it why it does not work only on first run.
This I could possibly be fixed/changed by adding: auto_install: True
to mulled_singularity
.. note: this does not trigger auto installation as the name suggests, but only that the cached image will be used on the 1st run (otherwise its the docker:// URI) .. see also my (still developing) notes on this in https://github.com/galaxyproject/galaxy/pull/15614
I've assumed that the first job triggers auto install but does not wait long enough before running the job bash script. Maybe it will wait for "singularity pull" to complete but not for the sif file to be present.
@bernt-matthias is auto_install not true by default?
@bernt-matthias is auto_install not true by default?
indeed. I mixed this up.
This line is then responsible that you get the cached container description.
Maybe it will wait for "singularity pull" to complete but not for the sif file to be present.
The pull happens before this (and should definitely be finished when to job is starting). Its just that for auto_install=True
the resolver does not return the cached description. For the second run the cached_mulled_singularity
kicks in and resolves to the sif file.
Describe the bug I'm sure this has been noticed before, but I could not find an issue for it.
When singularity-enabled tools run for the first time they fail waiting for the image pull from biocontainers, with an obscure error:
Galaxy Version and/or server at which you observed the bug Galaxy Version: usegalaxy.org.au Commit: 22.05
To Reproduce Steps to reproduce the behavior:
Expected behavior The tool should work first time following install or update. I couldn't locate the code that throws
unable to create tmp file
but at this point I would expect the code to wait until the image pull has completed.