DataBiosphere / dsub

Open-source command-line tool to run batch computing tasks and workflows on backend services such as Google Cloud.
Apache License 2.0
263 stars 44 forks source link

No stderr logging when docker image not found #271

Open nick-youngblut opened 1 year ago

nick-youngblut commented 1 year ago

If I run:

dsub \
  --provider local \
  --project ${PROJECT_ID} \
  --regions ${REGION} \
  --logging dsub-debug.log \
  --image ubuntu:20.04x \
  --output OUT=gs://${BUCKET_NAME}/output/out.txt \
  --command 'echo "Hello World" > "${OUT}"' \
  --wait

I get the following written to dsub-debug.log:

Unable to find image 'ubuntu:20.04x' locally
docker: Error response from daemon: manifest for ubuntu:20.04x not found: manifest unknown: manifest unknown.
See 'docker run --help'.

However, if I run the job on google-batch instead of local, the ${jobid}.stderr file set via the --logging parameter is empty.

The lack of an error messages in the *.stderr logging files created by google-batch makes it very hard to troubleshoot failed jobs.

I'm using dsub 0.4.9

wnojopra commented 1 year ago

Hi @nick-youngblut

Thanks for reporting. There was an update in the Batch API that changed the way log files were written to disk. We're in the process of making the dsub changes and pushing a new release. In the meantime, the same command should work with the google-cls-v2 provider. [EDIT]: I misunderstood the issue. See https://github.com/DataBiosphere/dsub/issues/271#issuecomment-1726425079

nick-youngblut commented 1 year ago

Thanks @wnojopra for the quick feedback! If I use google-cls-v2, then no logging files are generated at gs://${BUCKET_NAME}/logging/:

dsub \
  --provider google-cls-v2 \
  --project ${PROJECT_ID} \
  --regions ${REGION} \
  --logging gs://${BUCKET_NAME}/logging/ \
  --image ubuntu:20.04x \
  --output OUT=gs://${BUCKET_NAME}/output/out.txt \
  --command 'echo "Hello World" > "${OUT}"' \
  --wait

It appears that if the --provider google-cls-v2 job fails (at least for a non-existent docker image), no --logging files are created. However, if the --provider google-cls-v2 job succeeds, the --logging files are created.

wnojopra commented 1 year ago

Thanks @nick-youngblut . I misunderstood the issue. The confusion comes from how the API pulls the docker image. From https://cloud.google.com/life-sciences/docs/reference/rest/v2beta/projects.locations.pipelines/run#Action in the imageUri section it mentions "Note that all images referenced by actions in the pipeline are pulled before the first action runs."

Logging is done in one of the actions of a dsub job. It starts relatively early, but as noted the API will attempt to pull the image before the logging action starts. This is why you have no --logging files.

I do notice that dsub itself outputs a helpful error message in this case:

['Error in echo--willyn--230919-201554-73 - code 9: Execution failed: generic::failed_precondition: pulling image: docker pull: running ["docker" "pull" "ubuntu:20.04x"]: exit status 1 (standard error: "Error response from daemon: manifest for ubuntu:20.04x not found: manifest unknown: manifest unknown\\n")']
JobExecutionError: One or more jobs finished with status FAILURE or CANCELED during wait.

So in this case, the error couldn't make it to the logs, but it is output by dsub itself.

nick-youngblut commented 1 year ago

Thanks @wnojopra for all of the information!

as noted the API will attempt to pull the image before the logging action starts

Would it be possible to continue the dsub job (try-except) to include the logging, regardless of whether the image pull was successful? Inconsistent logging behavior can be harder to troubleshoot (e.g., "in this case, check the logs... but in this other case there are no logs, so you have to check...").

I do notice that dsub itself outputs a helpful error message in this case:

Thanks for pointing that out! Is there a way to get a "pretty" print of the logs directly in dsub? Each line of the logs is so long that reading them (mentally parsing the logs) can be a challenge.