emo-bon / MetaGOflow

MGnify oriented implementation for the Marine Genomic Observatories oriented pipeline, developed in the framework of an EOSC-Life funded project
https://metagoflow.readthedocs.io
Apache License 2.0
7 stars 7 forks source link

FATAL: ... error fetching image to cache: failed to get checksum for docker://docker:// #51

Open cymon opened 1 month ago

cymon commented 1 month ago

I have some serious weirdness that I thought I should document.

I have a small HPC; head node, 6 compute nodes, compute nodes mount /share/apps /home from head node. I have metaGOflow installed in /share/apps - so all nodes see the same code. All nodes run an up to date AlmaLinux release 8.10. All nodes have apptainer.x86_64 1.3.2-1.el8 from EPEL repo installed and podman-docker.noarch plus dependencies. No other conflicting container software is installed.

The problem: some nodes fail (not all just 3 out of 6) to run metaGOflow giving the following error:

FATAL:   While making image from oci registry: error fetching image to cache: failed to get checksum for docker://docker://cymon/eggnog-2.1.12:0.2: unable to parse image name docker://docker://cymon/eggnog-"

The call being attempted is:

 subprocess.CalledProcessError: Command '['singularity', 'pull', '--force', '--name', '/home/cymon/src/metaGOflow-cymon.git/sif_images/cymon_eggnog-2.1.12:0.2.sif', 'docker://docker://cymon/eggnog-2.1.12:0.2

It looks like the docker://docker:// string is being mis-formed, and should only have one docker://

This happens for other images as well, but only on certain nodes:

FATAL:   While making image from oci registry: error fetching image to cache: failed to get checksum for docker://docker://microbiomeinformatics/pipeline-v5.bash-scripts:v1.3

Google does not help at all. No idea what is going on...

I'm assuming this is not a problem with the workflow but the configuration of some of my nodes... but thought it put it here for posterity anyway...

hariszaf commented 1 month ago

I am not sure I am following; is this when you are running the get_singularity_images.sh script?

Where is this docker://docker:/ pattern you mention?

I guess you can try to get the images in any possible way and then use them from cahce.

cymon commented 1 month ago

I am not sure I am following; is this when you are running the get_singularity_images.sh script?

No, this is when running metaGOflow itself.

Where is this docker://docker:/ pattern you mention?

This is in the python traceback or the error:

FATAL:   While making image from oci registry: error fetching image to cache: failed to get checksum for docker://docker://cymon/eggnog-2.1.12:0.2: unable to parse image name docker://docker://cymon/eggnog-
8018 ^[[1;30mERROR^[[0m ^[[31mGot workflow error: Singularity is not available for this tool, try --no-container to disable Singularity, or install a user space Docker replacement like uDocker with --user-space-
8019 Traceback (most recent call last):
8020   File "/share/apps/lib/python3.7/site-packages/cwltool/job.py", line 809, in run
8021     runtimeContext.tmp_outdir_prefix,
8022   File "/share/apps/lib/python3.7/site-packages/cwltool/singularity.py", line 307, in get_from_requirements
8023     if not self.get_image(cast(Dict[str, str], r), pull_image, force_pull):
8024   File "/share/apps/lib/python3.7/site-packages/cwltool/singularity.py", line 249, in get_image
8025     check_call(cmd, stdout=sys.stderr)  # nosec
8026   File "/share/apps/lib/python3.7/subprocess.py", line 347, in check_call
8027     raise CalledProcessError(retcode, cmd)
8028 subprocess.CalledProcessError: Command '['singularity', 'pull', '--force', '--name', '/home/cymon/src/metaGOflow-cymon.git/sif_images/cymon_eggnog-2.1.12:0.2.sif', 'docker://docker://cymon/eggnog-2.1.12:0.2
8029

I guess you can try to get the images in any possible way and then use them from cahce. I have the images in the sif_images folder and they work when running the exact same code on different nodes; but just these 3 node give this error - and I can't see why...

hariszaf commented 1 month ago

I have not seen this error before but based on the error message, is there any chance singularity is not installed correctly in those nodes?