dask / dask-yarn

Deploy dask on YARN clusters
http://yarn.dask.org
BSD 3-Clause "New" or "Revised" License
69 stars 41 forks source link

Application Failure When Submitting Dask-Yarn Model Inferencing Job Remotely #152

Closed rileyhun closed 2 years ago

rileyhun commented 2 years ago

What happened: I've been following the documentation here to submit my application to dask-yarn. Unfortunately, the job keeps failing when I run deploy-mode as remote. It does seem to work when deploy-mode is local though. The other thing to note is that the worker-count and worker-vcores don't even reflect what I specified in my dask-yarn submit parameters. I tried looking into the yarn application logs but they weren't particularly helpful. The logs just say

21/11/28 10:47:18 INFO skein.ApplicationMaster: Shutting down: Exception in submitted dask application, see logs for more details

...but don't point me to where to look for this exception.

What you expected to happen:

I expected the application status to run to completion but instead the status returned was FAILED.

Minimal Complete Verifiable Example:

dask-yarn submit \
  --name uq_component_batch_inference \
  --environment s3://ch-ml-data/uq_component_count/dask_environment/uq_component_dask.tar.gz \
  --deploy-mode remote \
  --worker-count 30 \
  --worker-vcores 2 \
  --worker-memory 8GiB \
  myscript.py

Anything else we need to know?: Relevant files are attached here: Archive.zip

Environment: Only 26 containers and 26 vcores despite my specifying 30 workers with 2 cores each:

Screen Shot 2021-11-28 at 3 09 41 AM

Application failed

Screen Shot 2021-11-28 at 3 10 56 AM