dask / dask-yarn

Deploy dask on YARN clusters
http://yarn.dask.org
BSD 3-Clause "New" or "Revised" License
69 stars 41 forks source link

Jupyter Notebook Cell Hangs after submitting job to remote EMR cluster #154

Open bkahloon opened 2 years ago

bkahloon commented 2 years ago

What happened: Connecting to a remote EMR cluster from a Jupyter notebook (using YarnCluster for Dask Cluster creation) causes notebook cell to hang. The YarnCluster client is able to successfully submit the job to Yarn on EMR and the application is listed under the running applications tab, however on the notebook client side the cell just hangs. The application on Yarn seemingly continue to run as well and has to be manually killed (nothing in the Yarn application logs seems to be indicating an error)

What you expected to happen: After the job is submitted, the notebook cell should not hang and allow user to submit further Dask transformation code to the Dask cluster created on EMR (Yarn app)

Minimal Complete Verifiable Example:

Hangs after submitting the following code in the notebook cell, no errors are reported (and there is a little asterisk beside the cell)

from dask_yarn import YarnCluster

cluster = YarnCluster.from_specification( 'spec.yaml')

client = Client(cluster)

spec.yaml

name: test-dask
queue: default

services:
  dask.scheduler:
    # Restrict scheduler to 2 GiB and 1 core
    resources:
      memory: 2 GiB
      vcores: 1
    script: |
      dask-yarn services scheduler
  dask.worker:
    # Don't start any workers initially
    instances: 0
    # Workers can infinite number of times
    max_restarts: -1
    depends:
      - dask.scheduler
    # Restrict workers to 4 GiB and 2 cores each
    resources:
      memory: 4 GiB
      vcores: 2
    # Distribute this python environment to every worker node
    files:
      environment: /notebooks_deps_pkg.tar.gz
    # The bash script to start the worker
    # Here we activate the environment, then start the worker
    script: |
      virtualenv env
      source env/bin/activate
      dask-yarn services worker

Anything else we need to know?: In the logs after adding print statement to base skein core.py file (added a print(req) before the return) I see the following in the logs

22/03/04 21:08:19 INFO conf.Configuration: resource-types.xml not found
22/03/04 21:08:19 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
22/03/04 21:08:19 INFO skein.Driver: Uploading application resources to hdfs://cluster.ip:8020/user/hadoop/.skein/application_1646182918041_0074
22/03/04 21:08:43 INFO skein.Driver: Submitting application...
22/03/04 21:08:43 INFO impl.YarnClientImpl: Submitted application application_1646182918041_0074
id: "application_1646182918041_0074"

<generator object KeyValueStore._input_iter at 0x7f20908370a0>

Then it just hangs in the notebook cell

Environment: