DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
265 stars 44 forks source link

Job failing before starting with docker error #240

Closed nithinjoshy closed 2 years ago

nithinjoshy commented 2 years ago

I am having a recurring issue where jobs that I start are failing immediately with an error that looks related to Docker. The log file is below. These are the first lines in the file.

time="2022-04-20T17:43:09Z" level=error msg="error waiting for container: context canceled"`
Error response from daemon: driver failed programming external connectivity on endpoint ssh (4b09db99808becfa8dbe7f000240cf72d29b6535f0b006e93c97391bc82904cf): Error starting userland proxy: listen tcp4 0.0.0.0:22: bind: address already in use

There are a few statements afterwards but they contain info specific to my project so I would not like to copy them here. They are similar to the two statements below and seem generally unrelated to any error.

2022-04-20 17:43:10 INFO: gsutil -h Content-Type:text/plain  -mq cp /tmp/continuous_logging_action/...
2022-04-20 17:43:10 INFO: mkdir -m 777 -p /mnt/data/input/...
2022-04-20 17:43:10 INFO: mkdir -m 777 -p /mnt/data/output/....

Below is the code I use to start the job.

dsub \
    --provider google-cls-v2 \
    --project ${PROJECT} \
    --logging gs://${DSUB_BUCKET}/logs \
    --input-recursive INPUT_PATH=gs://${OUTPUT_BUCKET}/${name}/ \
    --output-recursive OUTPUT_PATH=gs://${OUTPUT_BUCKET}/${name}/ \
    --image ${AGGREGATE_IMAGE} \
    --script aggregate.sh \
    --disk-size 1000 \
    --name "aggregate" \
    --machine-type n1-standard-16 \
    --ssh \
    --boot-disk-size 30

I am posting about this here because this issue started to occur without any change to my code. Furthermore, as far as I can tell, this error is unrelated to any of my own code but instead is an issue with how Dsub is deploying my Docker container to the VM. Does anyone have any ideas about what may be happening? Please excuse me if it is actually a trivial error in my code.

wnojopra commented 2 years ago

Hi @nj3252! The error you're seeing seems related to the SSH port. If you're not actively using the ssh feature, could you remove the --ssh flag from your command and try again?

FWIW, we are seeing a few other issues related to SSH in https://github.com/DataBiosphere/dsub/issues/238 and https://github.com/DataBiosphere/dsub/issues/233

nithinjoshy commented 2 years ago

Thanks, that has fixed it. I was using the ssh flag previously and never removed it. I really appreciate the quick response.