pipeline fails trying to connect to auth.docker.io

GodloveD commented 2 years ago

Description of bug

I've installed caper and the pipeline in a conda environment and I'm trying to run it with example config/data from the repo thusly:

$ git clone https://github.com/encode-dcc/chip-seq-pipeline2
$ cd chip-seq-pipeline2
$ git checkout v2.0.1
$ cd ..
$ rm -rf ~/.caper/
$ caper init local
$ caper run ./chip-seq-pipeline2/chip.wdl -i ./chip-seq-pipeline2/dev/example_input_json/caper/ENCSR000DYI_subsampled_chr19_only_caper.json --conda
$ less /lscratch/29930830/cromwell.out

The pipeline does stuff and produces informational messages for about 15 minutes and then produces an error:

2022-01-10 13:57:25,383|caper.nb_subproc_thread|ERROR| Cromwell failed. returncode=-7
2022-01-10 13:57:25,384|caper.cli|ERROR| Check stdout in /lscratch/123456/cromwell.out

Inspecting cromwell.out turns up the following error (which appears to be the relevant problem?)

2022-01-10 13:25:17,159  INFO  - Request threw an exception on attempt #1. Retrying after 883 milliseconds
org.http4s.client.ConnectionFailure: Error connecting to https://auth.docker.io using address auth.docker.io:443 (unresolved: true)

along with a java stack trace, yadda yadda.

Why is this trying to contact docker.io when I'm using the --conda directive? Is there a workaround? Thanks!

OS/Platform

OS/Platform: CentOS-7 (NIH HPC Biowulf cluster)
Conda version: 4.10.3
Pipeline version: 2.0.1
Caper version: 2.1.2

Caper configuration file

backend=local

# Hashing strategy for call-caching (3 choices)
# This parameter is for local (local/slurm/sge/pbs/lsf) backend only.
# This is important for call-caching,
# which means re-using outputs from previous/failed workflows.
# Cache will miss if different strategy is used.
# "file" method has been default for all old versions of Caper<1.0.
# "path+modtime" is a new default for Caper>=1.0,
#   file: use md5sum hash (slow).
#   path: use path.
#   path+modtime: use path and modification time.
local-hash-strat=path+modtime

# Metadata DB for call-caching (reusing previous outputs):
# Cromwell supports restarting workflows based on a metadata DB
# DB is in-memory by default
#db=in-memory

# If you use 'caper server' then you can use one unified '--file-db'
# for all submitted workflows. In such case, uncomment the following two lines
# and defined file-db as an absolute path to store metadata of all workflows
#db=file
#file-db=

# If you use 'caper run' and want to use call-caching:
# Make sure to define different 'caper run ... --db file --file-db DB_PATH'
# for each pipeline run.
# But if you want to restart then define the same '--db file --file-db DB_PATH'
# then Caper will collect/re-use previous outputs without running the same task again
# Previous outputs will be simply hard/soft-linked.

# Local directory for localized files and Cromwell's intermediate files
# If not defined, Caper will make .caper_tmp/ on local-out-dir or CWD.
# /tmp is not recommended here since Caper store all localized data files
# on this directory (e.g. input FASTQs defined as URLs in input JSON).
local-loc-dir=

cromwell=/home/user/.caper/cromwell_jar/cromwell-65.jar
womtool=/home/user/.caper/womtool_jar/womtool-65.jar```

## **Input JSON file**
```json
{
    "chip.pipeline_type" : "tf",
    "chip.genome_tsv" : "https://storage.googleapis.com/encode-pipeline-genome-data/genome_tsv/v1/hg38_chr19_chrM_caper.tsv",
    "chip.fastqs_rep1_R1" : ["https://storage.googleapis.com/encode-pipeline-test-samples/encode-chip-seq-pipeline/ENCSR000DYI/fastq_subsampled/rep1.subsampled.25.fastq.gz"
    ],
    "chip.fastqs_rep2_R1" : ["https://storage.googleapis.com/encode-pipeline-test-samples/encode-chip-seq-pipeline/ENCSR000DYI/fastq_subsampled/rep2.subsampled.20.fastq.gz"
    ],
    "chip.ctl_fastqs_rep1_R1" : ["https://storage.googleapis.com/encode-pipeline-test-samples/encode-chip-seq-pipeline/ENCSR000DYI/fastq_subsampled/ctl1.subsampled.25.fastq.gz"
    ],
    "chip.ctl_fastqs_rep2_R1" : ["https://storage.googleapis.com/encode-pipeline-test-samples/encode-chip-seq-pipeline/ENCSR000DYI/fastq_subsampled/ctl2.subsampled.25.fastq.gz"
    ],
    "chip.paired_end" : false,
    "chip.title" : "ENCSR000DYI (subsampled 1/25, chr19_chrM only)",
    "chip.description" : "CEBPB ChIP-seq on human A549 produced by the Snyder lab"
}

Troubleshooting result

If you ran caper run without Caper server then Caper automatically runs a troubleshooter for failed workflows. Find troubleshooting result in the bottom of Caper's screen log.

If you ran caper submit with a running Caper server then first find your workflow ID (1st column) with caper list and run caper debug [WORKFLOW_ID].

Paste troubleshooting result.

2022-01-10 13:57:25,383|caper.nb_subproc_thread|ERROR| Cromwell failed. returncode=-7
2022-01-10 13:57:25,384|caper.cli|ERROR| Check stdout in /lscratch/123456/cromwell.out.1

leepc12 commented 2 years ago

Does your system have internet connection? Please try this on your login node

$ docker pull encodedcc/chip-seq-pipeline:v2.1.3

GodloveD commented 2 years ago

Thank you for your reply.

The docker command is not going to work on the login node since we don't have docker installed. The singularity equivalent will work on both the login nodes and compute nodes. The compute nodes are behind a firewall and some types of traffic (like https) is allowed through a proxy. For java to respect the proxy, we need to set the following:

export JAVA_TOOL_OPTIONS="-Djava.net.useSystemProxies=true"

But in this case the env var does not seem to help.

I'm trying to run this with a locally installed conda environment and I've set the --conda option so I don't know why the pipeline is trying to contact dockerhub anyway. Thank you.

leepc12 commented 2 years ago

Okay, then let's localize all data on the login node and then run with a localized file. Localize the input JSON.

$ cd YOUR_WORK_DIR

# recursively localize the input JSON (pip install autouri if it doesn't exist)
$ autouri loc https://storage.googleapis.com/encode-pipeline-test-samples/encode-chip-seq-
pipeline/ENCSR000DYI_subsampled_chr19_only.json . --recursive

$ find -name *.json
./15b4ff439de58d859f4c3ee4482c3bd2/ENCSR000DYI_subsampled_chr19_only.local.json

Use that .local.json input JSON file to run a pipeline.

The docker error looks very weird. Please let me know if that occurs again.

GodloveD commented 2 years ago

Thanks again for your response. I've localized that data as you suggested and tried running it again. I've got a lot of errors in the standard out and the cromwell.out file also seems to be trying to access docker. I've attached a copy and past of the stdout, and the cromwell.out file.

leepc12 commented 2 years ago

Are you sure that your caper is 2.1.2 (latest)? Please check caper versions both inside and outside of the Conda (base or pipeline's env) environment.

$ caper -v
2.1.2

Also, is this scratch directory /lscratch/29930830 dynamically allocated (when a job is submitted)? If so, please don't run pipelines on such dynamic scratch directories. Caper/Cromwell will lose track of important intermediate files.

Found this in your cromwell.out.

2022-01-14 10:01:15,867  INFO  - Request method=GET uri=https://auth.docker.io/token?service=registry.docker.io&scope=repository%3Aencodedcc/chip-seq-pipeline%3Apull headers= threw an exception on attempt #4. Giving up.
org.http4s.client.ConnectionFailure: Error connecting to https://auth.docker.io using address auth.docker.io:443 (unresolved: true)

So Caper is still trying to download the chip-seq-pipeline's docker image from dockerhub. It's weird. I still don't know why this happens.

GodloveD commented 2 years ago

Thank you for you response and sorry for the delay. I can confirm I'm running caper version 2.1.2. I reran the pipeline from a network storage location as you suggested and received the same docker error. I don't know if it makes a difference, but my caper config is local, so I don't think using local vs network storage should make a difference. It seems to me that the workflow is trying to access docker and dockerhub even though it is being instructed to run through conda. Maybe this error was not caught because testing is carried out on a system that has access to docker and docker hub? Unsure.

lixin4306ren commented 3 months ago

Got the same problem when running atac-seq-pipeline locally. @GodloveD Have you solved this problem?

2024-03-16 12:04:24,823 cromwell-system-akka.dispatchers.engine-dispatcher-132 INFO  - Not triggering log of restart checking token queue status. Effective log interval = None
2024-03-16 12:04:24,862 cromwell-system-akka.dispatchers.engine-dispatcher-117 INFO  - Not triggering log of execution token queue status. Effective log interval = None
2024-03-16 12:04:27,256 cromwell-system-akka.dispatchers.engine-dispatcher-117 INFO  - WorkflowExecutionActor-f20624b6-5892-494f-af65-616cd41fd1ff [UUID(f20624b6)]: Starting atac.r
ead_genome_tsv
2024-03-16 12:04:27,877 cromwell-system-akka.dispatchers.engine-dispatcher-117 INFO  - Assigned new job execution tokens to the following groups: f20624b6: 1
2024-03-16 12:04:48,936  INFO  - Request threw an exception on attempt #1. Retrying after 649 milliseconds
org.http4s.client.ConnectionFailure: Error connecting to https://auth.docker.io using address auth.docker.io:443 (unresolved: true)

ENCODE-DCC / chip-seq-pipeline2