dask / dask-yarn

Deploy dask on YARN clusters
http://yarn.dask.org
BSD 3-Clause "New" or "Revised" License
69 stars 41 forks source link

Unable to connect with ImportError #111

Closed cnachteg closed 4 years ago

cnachteg commented 4 years ago

Hello,

When trying to create a YarnCluster object in a jupyter notebook, the application always failed with a ConnectionError. After looking through the yarn logs, this seems to be linked to the absence of the module dask_yarn.cli.

I must specify that I launch the jupyter notebook in the same environment that I deploy in the YarnCluster.

from dask_yarn import YarnCluster
cluster = YarnCluster(environment='../../environment_dask.tar.gz',
                      worker_vcores=1,
                      worker_memory='1GB',
                      n_workers=1)

Traceback:

---------------------------------------------------------------------------
ConnectionError                           Traceback (most recent call last)
~/.local/lib/python3.7/site-packages/dask_yarn/core.py in submit_and_handle_failures(skein_client, spec)
     80         try:
---> 81             yield skein_client.connect(app_id, security=spec.master.security)
     82         except BaseException:

~/.local/lib/python3.7/site-packages/dask_yarn/core.py in _start_cluster(self)
    563             with submit_and_handle_failures(skein_client, self.spec) as app:
--> 564                 scheduler_address = app.kv.wait("dask.scheduler").decode()
    565                 dashboard_address = app.kv.get("dask.dashboard")

~/.local/lib/python3.7/site-packages/skein/kv.py in wait(self, key, return_owner)
    648 
--> 649             event = event_queue.get()
    650 

~/.local/lib/python3.7/site-packages/skein/kv.py in get(self, block, timeout)
    274             self._exception = out
--> 275             raise out
    276         return out

ConnectionError: Unable to connect to application

During handling of the above exception, another exception occurred:

DaskYarnError                             Traceback (most recent call last)
<ipython-input-3-22fd856af1d8> in <module>
      2                       worker_vcores=1,
      3                       worker_memory='1GB',
----> 4                       n_workers=1)

~/.local/lib/python3.7/site-packages/dask_yarn/core.py in __init__(self, environment, n_workers, worker_vcores, worker_memory, worker_restarts, worker_env, scheduler_vcores, scheduler_memory, deploy_mode, name, queue, tags, user, host, port, dashboard_address, skein_client, asynchronous, loop)
    390             asynchronous=asynchronous,
    391             loop=loop,
--> 392             skein_client=skein_client,
    393         )
    394 

~/.local/lib/python3.7/site-packages/dask_yarn/core.py in _init_common(self, spec, application_client, host, port, dashboard_address, asynchronous, loop, skein_client)
    533 
    534         if not self.asynchronous:
--> 535             self._sync(self._start_internal())
    536 
    537     def _start_cluster(self):

~/.local/lib/python3.7/site-packages/dask_yarn/core.py in _sync(self, task)
    701         future = asyncio.run_coroutine_threadsafe(task, self.loop.asyncio_loop)
    702         try:
--> 703             return future.result()
    704         except BaseException:
    705             future.cancel()

~/.conda/envs/dask_distrib/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

~/.conda/envs/dask_distrib/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~/.local/lib/python3.7/site-packages/dask_yarn/core.py in _start_internal(self)
    592             self._start_task = asyncio.ensure_future(self._start_async())
    593         try:
--> 594             await self._start_task
    595         except BaseException:
    596             # On exception, cleanup

~/.local/lib/python3.7/site-packages/dask_yarn/core.py in _start_async(self)
    607             else:
    608                 self._scheduler = None
--> 609             await self.loop.run_in_executor(None, self._start_cluster)
    610         else:
    611             # Connect to an existing cluster

~/.conda/envs/dask_distrib/lib/python3.7/concurrent/futures/thread.py in run(self)
     55 
     56         try:
---> 57             result = self.fn(*self.args, **self.kwargs)
     58         except BaseException as exc:
     59             self.future.set_exception(exc)

~/.local/lib/python3.7/site-packages/dask_yarn/core.py in _start_cluster(self)
    565                 dashboard_address = app.kv.get("dask.dashboard")
    566                 if dashboard_address is not None:
--> 567                     dashboard_address = dashboard_address.decode()
    568 
    569         # Ensure application gets cleaned up

~/.conda/envs/dask_distrib/lib/python3.7/contextlib.py in __exit__(self, type, value, traceback)
    128                 value = type()
    129             try:
--> 130                 self.gen.throw(type, value, traceback)
    131             except StopIteration as exc:
    132                 # Suppress StopIteration *unless* it's the same exception that

~/.local/lib/python3.7/site-packages/dask_yarn/core.py in submit_and_handle_failures(skein_client, spec)
     91                 "See the application logs for more information:\n\n"
     92                 "$ yarn logs -applicationId {app_id}"
---> 93             ).format(app_id=app_id)
     94         )
     95 

DaskYarnError: Failed to start dask-yarn application

Yarn logs :

WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
19/11/18 20:50:24 INFO client.RMProxy: Connecting to ResourceManager at server/192.168.201.10:8032
Container: container_1571297750618_0128_01_000003
LogAggregationType: AGGREGATED
==============================================================================
LogType:dask.worker.log
LogLastModifiedTime:Mon Nov 18 20:32:58 +0100 2019
LogLength:285
LogContents:
Traceback (most recent call last):
  File "/data03/yarn/nm/usercache/user/appcache/application_1571297750618_0128/container_1571297750618_0128_01_000003/environment/bin/dask-yarn", line 7, in <module>
    from dask_yarn.cli import main
ImportError: No module named dask_yarn.cli

End of LogType:dask.worker.log
********************************************************************************

End of LogType:prelaunch.err
******************************************************************************

Container: container_1571297750618_0128_01_000003
LogAggregationType: AGGREGATED
==============================================================================
LogType:prelaunch.out
LogLastModifiedTime:Mon Nov 18 20:32:58 +0100 2019
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container

End of LogType:prelaunch.out
******************************************************************************

Container: container_1571297750618_0128_01_000002
LogAggregationType: AGGREGATED
==============================================================================
LogType:dask.scheduler.log
LogLastModifiedTime:Mon Nov 18 20:32:58 +0100 2019
LogLength:285
LogContents:
Traceback (most recent call last):
  File "/data04/yarn/nm/usercacheuser/appcache/application_1571297750618_0128/container_1571297750618_0128_01_000002/environment/bin/dask-yarn", line 7, in <module>
    from dask_yarn.cli import main
ImportError: No module named dask_yarn.cli

End of LogType:dask.scheduler.log
***********************************************************************************

End of LogType:prelaunch.err
******************************************************************************

Container: container_1571297750618_0128_01_000002
LogAggregationType: AGGREGATED
==============================================================================
LogType:prelaunch.out
LogLastModifiedTime:Mon Nov 18 20:32:58 +0100 2019
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container

End of LogType:prelaunch.out
******************************************************************************

Container: container_1571297750618_0128_01_000001
LogAggregationType: AGGREGATED
==============================================================================
LogType:application.master.log
LogLastModifiedTime:Mon Nov 18 20:32:59 +0100 2019
LogLength:2973
LogContents:
19/11/18 20:33:59 INFO skein.ApplicationMaster: Starting Skein version 0.8.0
19/11/18 20:34:00 INFO skein.ApplicationMaster: Running as user hpda000013
19/11/18 20:34:00 INFO conf.Configuration: resource-types.xml not found
19/11/18 20:34:00 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
19/11/18 20:34:00 INFO skein.ApplicationMaster: Application specification successfully loaded
19/11/18 20:34:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/11/18 20:34:01 INFO client.RMProxy: Connecting to ResourceManager at server/192.168.201.10:8030
19/11/18 20:34:02 INFO skein.ApplicationMaster: gRPC server started at server:40289
19/11/18 20:34:03 INFO skein.ApplicationMaster: WebUI server started at server:44177
19/11/18 20:34:03 INFO skein.ApplicationMaster: Registering application with resource manager
19/11/18 20:34:03 INFO client.RMProxy: Connecting to ResourceManager at server/192.168.201.10:8032
19/11/18 20:34:03 INFO skein.ApplicationMaster: Initializing service 'dask.worker'.
19/11/18 20:34:03 INFO skein.ApplicationMaster: WAITING: dask.worker_0
19/11/18 20:34:03 INFO skein.ApplicationMaster: Initializing service 'dask.scheduler'.
19/11/18 20:34:03 INFO skein.ApplicationMaster: REQUESTED: dask.scheduler_0
19/11/18 20:34:04 INFO skein.ApplicationMaster: Starting container_1571297750618_0128_01_000002...
19/11/18 20:34:04 INFO skein.ApplicationMaster: RUNNING: dask.scheduler_0 on container_1571297750618_0128_01_000002
19/11/18 20:34:04 INFO skein.ApplicationMaster: REQUESTED: dask.worker_0
19/11/18 20:34:06 INFO skein.ApplicationMaster: Starting container_1571297750618_0128_01_000003...
19/11/18 20:34:06 INFO skein.ApplicationMaster: RUNNING: dask.worker_0 on container_1571297750618_0128_01_000003
19/11/18 20:34:16 WARN skein.ApplicationMaster: FAILED: dask.worker_0 - Container failed during execution, see logs for more information.
19/11/18 20:34:16 INFO skein.ApplicationMaster: RESTARTING: adding new container to replace dask.worker_0.
19/11/18 20:34:16 INFO skein.ApplicationMaster: REQUESTED: dask.worker_1
19/11/18 20:34:16 WARN skein.ApplicationMaster: FAILED: dask.scheduler_0 - Container failed during execution, see logs for more information.
19/11/18 20:34:16 INFO skein.ApplicationMaster: Shutting down: Failure in service dask.scheduler, see logs for more information.
19/11/18 20:34:16 INFO skein.ApplicationMaster: Unregistering application with status FAILED
19/11/18 20:34:16 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
19/11/18 20:34:16 INFO skein.ApplicationMaster: Deleted application directory user/.skein/application_1571297750618_0128
19/11/18 20:34:16 INFO skein.ApplicationMaster: WebUI server shut down
19/11/18 20:34:16 INFO skein.ApplicationMaster: gRPC server shut down

End of LogType:application.master.log
***************************************************************************************

End of LogType:prelaunch.err
******************************************************************************

Container: container_1571297750618_0128_01_000001
LogAggregationType: AGGREGATED
==============================================================================
LogType:prelaunch.out
LogLastModifiedTime:Mon Nov 18 20:32:59 +0100 2019
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container

End of LogType:prelaunch.out
******************************************************************************

Thank you for the attention !

jcrist commented 4 years ago

Hmmm, it looks like you don't have dask-yarn installed in your environment.tar.gz file even though the dask-yarn cli wrapper appears to be present. How was environment.tar.gz created (conda-pack, venv-pack, etc...)?

cnachteg commented 4 years ago

conda-pack, from inside the environment with : conda pack -n dask_distrib -o environment_dask.tar.gz

jcrist commented 4 years ago

Hmmm, ok. I suspect something wrong with your tar file. What does this output:

tar tvf environment_dask.tar.gz | grep dask_yarn
cnachteg commented 4 years ago
-rw-rw-r-- hpda000013/hpda000013        5 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn-0.8.0.dist-info/INSTALLER
-rw-rw-r-- hpda000013/hpda000013     1483 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn-0.8.0.dist-info/LICENSE.txt
-rw-rw-r-- hpda000013/hpda000013     1326 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn-0.8.0.dist-info/METADATA
-rw-rw-r-- hpda000013/hpda000013     1926 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn-0.8.0.dist-info/RECORD
-rw-rw-r-- hpda000013/hpda000013       93 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn-0.8.0.dist-info/WHEEL
-rw-rw-r-- hpda000013/hpda000013       70 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn-0.8.0.dist-info/entry_points.txt
-rw-rw-r-- hpda000013/hpda000013       10 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn-0.8.0.dist-info/top_level.txt
-rw-rw-r-- hpda000013/hpda000013      177 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/__init__.py
-rw-rw-r-- hpda000013/hpda000013      294 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/__pycache__/__init__.cpython-37.pyc
-rw-rw-r-- hpda000013/hpda000013      456 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/__pycache__/_version.cpython-37.pyc
-rw-rw-r-- hpda000013/hpda000013    12676 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/__pycache__/cli.cpython-37.pyc
-rw-rw-r-- hpda000013/hpda000013      392 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/__pycache__/config.cpython-37.pyc
-rw-rw-r-- hpda000013/hpda000013    19837 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/__pycache__/core.cpython-37.pyc
-rw-rw-r-- hpda000013/hpda000013      497 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/_version.py
-rw-rw-r-- hpda000013/hpda000013    14825 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/cli.py
-rw-rw-r-- hpda000013/hpda000013      223 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/config.py
-rw-rw-r-- hpda000013/hpda000013    23603 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/core.py
-rw-rw-r-- hpda000013/hpda000013        0 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/tests/__init__.py
-rw-rw-r-- hpda000013/hpda000013      142 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/tests/__pycache__/__init__.cpython-37.pyc
-rw-rw-r-- hpda000013/hpda000013     2439 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/tests/__pycache__/conftest.cpython-37.pyc
-rw-rw-r-- hpda000013/hpda000013     6438 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/tests/__pycache__/test_cli.cpython-37.pyc
-rw-rw-r-- hpda000013/hpda000013    10030 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/tests/__pycache__/test_core.cpython-37.pyc
-rw-rw-r-- hpda000013/hpda000013     2560 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/tests/conftest.py
-rw-rw-r-- hpda000013/hpda000013     8807 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/tests/test_cli.py
-rw-rw-r-- hpda000013/hpda000013    14185 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/tests/test_core.py
-rw-rw-r-- hpda000013/hpda000013     1230 2019-08-05 17:31 lib/python3.7/site-packages/dask_yarn/yarn.yaml
jcrist commented 4 years ago

Hmmm, that is odd. Ok, lets test the conda-pack'ed archive locally. Can you do the following:

conda deactivate
mkdir temp
tar -xf environment_dask.tar.gz -C temp/
source temp/bin/activate
python -m site
dask-yarn --version
cnachteg commented 4 years ago
sys.path = [
    '/home/user',
    '/home/user/temp/lib/python37.zip',
    '/home/user/temp/lib/python3.7',
    '/home/user/temp/lib/python3.7/lib-dynload',
    '/home/user/.local/lib/python3.7/site-packages',
    '/home/user/jupyter_contrib_nbextensions/src',
    '/home/user/temp/lib/python3.7/site-packages',
]
USER_BASE: '/home/user/.local' (exists)
USER_SITE: '/home/user/.local/lib/python3.7/site-packages' (exists)
ENABLE_USER_SITE: True

And I have dask-yarn 0.8.0 appearing without any problem.

EDIT : I tested with an environment built with venv-pack and I have no problem, so I guess the original error should come from how the environment in packed or build with conda.

jcrist commented 4 years ago

Apologies for the delayed response here. The tar file seems ok locally (although you say things worked fine with venv-pack, so maybe there's something off with it).

The next thing I'd try is the same thing, but remotely on one of the worker nodes. To do this you can write a skein specification (the library underlying dask-yarn) that does the same thing you did above. Please run the following (untested, but I think there are no typos) and report back with the application logs:

import skein
import time

# An application specification.
# Note that `environment_dask.tar.gz` should be the relative path to your archive file
spec = skein.ApplicationSpec.from_yaml("""
name: test-run
master:
  script: |
    set -xe
    source environment/bin/activate
    ls
    which python
    python -m site
    which dask-yarn
    dask-yarn --version
  files:
    environment: environment_dask.tar.gz
""")

# Submit the application and wait for it to complete
with skein.Client() as client:
    app_id = client.submit(spec)
    print("Application id: %s" % app_id)
    # Wait for application to finish
    while client.application_report(app_id).state not in ("FINISHED", "FAILED", "KILLED"):
        time.sleep(1)

print("Run `yarn logs -applicationId %s and report back with the results" % app_id)
jcrist commented 4 years ago

One other thing to look at would the shebang (line beginning with #!) at the top of the environment/bin/dask-yarn file in your archive. This should be rewritten to something path independent, but perhaps there's a bug in conda-pack.

rodfloripa commented 4 years ago

I have the same problem Now trying to analyze the environment/bin/dask-yarn file.

rodfloripa commented 4 years ago

(env1) [opc@brora1vudap003 rodney]$ yarn logs -applicationId application_1582571628698_2826 20/02/28 16:23:31 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm15

Container: container_e531_1582571628698_2826_01_000001 on lpclbv0303.us6.oraclecloud.com_8041

LogType:application.driver.log Log Upload Time:Fri Feb 28 16:22:54 -0300 2020 LogLength:2905 Log Contents:

LogType:application.master.log Log Upload Time:Fri Feb 28 16:22:54 -0300 2020 LogLength:1728 Log Contents: 20/02/28 16:22:37 INFO skein.ApplicationMaster: Starting Skein version 0.8.0 20/02/28 16:22:40 INFO skein.ApplicationMaster: Running as user work_deep 20/02/28 16:22:40 INFO skein.ApplicationMaster: Application specification successfully loaded 20/02/28 16:22:42 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 20/02/28 16:22:42 WARN shortcircuit.DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 20/02/28 16:22:45 INFO skein.ApplicationMaster: gRPC server started at lpclbv0303.us6.oraclecloud.com:11307 20/02/28 16:22:46 INFO skein.ApplicationMaster: WebUI server started at lpclbv0303.us6.oraclecloud.com:46578 20/02/28 16:22:46 INFO skein.ApplicationMaster: Registering application with resource manager 20/02/28 16:22:46 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm15 20/02/28 16:22:49 INFO client.ConfiguredRMFailoverProxyProvider: Failing over to rm15 20/02/28 16:22:49 INFO skein.ApplicationMaster: Starting application driver 20/02/28 16:22:50 INFO skein.ApplicationMaster: Shutting down: Application driver failed with exit code 127, see logs for more information. 20/02/28 16:22:50 INFO skein.ApplicationMaster: Unregistering application with status FAILED 20/02/28 16:22:50 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered. 20/02/28 16:22:51 INFO skein.ApplicationMaster: Deleted application directory hdfs://BigDataNextel-ns/user/work_deep/.skein/application_1582571628698_2826 20/02/28 16:22:51 INFO skein.ApplicationMaster: WebUI server shut down 20/02/28 16:22:51 INFO skein.ApplicationMaster: gRPC server shut down

rodfloripa commented 4 years ago

When trying to create the YarnCluster() I got the following error:

Container: container_e531_1582571628698_5919_01_000002 on lpclbv0315.us6.oraclecloud.com_8041

LogType:dask.scheduler.log Log Upload Time:Tue Mar 03 09:06:46 -0300 2020 LogLength:49 Log Contents: /usr/bin/env: python3: No such file or directory

I installed Dask-yarn on python3.6.5, but the nodes use python2.6.6. How can I fix this error?

IrfanWahyudin commented 4 years ago

One other thing to look at would the shebang (line beginning with #!) at the top of the environment/bin/dask-yarn file in your archive. This should be rewritten to something path independent, but perhaps there's a bug in conda-pack.

Got exactly the same "Unable to connect with ImportError " message. I have try this suggestion by changing the top most line with "#!/home/irfan/anaconda3/python"

My environment:

rodfloripa commented 4 years ago

In my case I have Python3 installed on edge node and Python2 on the nodes. I think that this error is caused because of this incompatibility. Which Python version you have installed on the nodes?

IrfanWahyudin commented 4 years ago

In my case I have Python3 installed on edge node and Python2 on the nodes. I think that this error is caused because of this incompatibility. Which Python version you have installed on the nodes?

Hi, thanks for the reply, i have tried your suggestion, and that seems what the problem is. I have Python 3.7 on the edge node, and Python 2.7 on the hadoop node. I already configured the Hadoop node to use the same Python version. But now i get this:

20/03/09 08:48:17 INFO impl.YarnClientImpl: Killed application application_1583717965322_0003
Traceback (most recent call last):

  File "/home/irfan/anaconda3/lib/python3.7/site-packages/dask_yarn/core.py", line 81, in submit_and_handle_failures
    yield skein_client.connect(app_id, security=spec.master.security)
  File "/home/irfan/anaconda3/lib/python3.7/site-packages/dask_yarn/core.py", line 564, in _start_cluster
    scheduler_address = app.kv.wait("dask.scheduler").decode()
  File "/home/irfan/anaconda3/lib/python3.7/site-packages/skein/kv.py", line 649, in wait
    event = event_queue.get()
  File "/home/irfan/anaconda3/lib/python3.7/site-packages/skein/kv.py", line 275, in get
    raise out
skein.exceptions.ConnectionError: Unable to connect to application
rodfloripa commented 4 years ago

Please, run the code posted here by jcrist(on 26 Nov 2019) and paste the results like I did. Maybe you have different versions of the libraries on edge node and other nodes.

In my case I have Python3 installed on edge node and Python2 on the nodes. I think that this error is caused because of this incompatibility. Which Python version you have installed on the nodes?

Hi, thanks for the reply, i have tried your suggestion, and that seems what the problem is. I have Python 3.7 on the edge node, and Python 2.7 on the hadoop node. I already configured the Hadoop node to use the same Python version. But now i get this:

20/03/09 08:48:17 INFO impl.YarnClientImpl: Killed application application_1583717965322_0003
Traceback (most recent call last):

  File "/home/irfan/anaconda3/lib/python3.7/site-packages/dask_yarn/core.py", line 81, in submit_and_handle_failures
    yield skein_client.connect(app_id, security=spec.master.security)
  File "/home/irfan/anaconda3/lib/python3.7/site-packages/dask_yarn/core.py", line 564, in _start_cluster
    scheduler_address = app.kv.wait("dask.scheduler").decode()
  File "/home/irfan/anaconda3/lib/python3.7/site-packages/skein/kv.py", line 649, in wait
    event = event_queue.get()
  File "/home/irfan/anaconda3/lib/python3.7/site-packages/skein/kv.py", line 275, in get
    raise out
skein.exceptions.ConnectionError: Unable to connect to application
rodfloripa commented 4 years ago

You received the same errors on edge node as me.Exactly on the same lines: 81,564,649,275

20/03/09 09:22:34 INFO impl.YarnClientImpl: Submitted application application_1582571628698_10945 20/03/09 09:22:51 INFO impl.YarnClientImpl: Killed application application_1582571628698_10945 Traceback (most recent call last): File "/home/opc/rodney/env1/lib/python3.6/site-packages/dask_yarn/core.py", line 81, in submit_and_handle_failures yield skein_client.connect(app_id, security=spec.master.security) File "/home/opc/rodney/env1/lib/python3.6/site-packages/dask_yarn/core.py", line 564, in _start_cluster scheduler_address = app.kv.wait("dask.scheduler").decode() File "/home/opc/rodney/env1/lib/python3.6/site-packages/skein/kv.py", line 649, in wait event = event_queue.get() File "/home/opc/rodney/env1/lib/python3.6/site-packages/skein/kv.py", line 275, in get raise out skein.exceptions.ConnectionError: Unable to connect to application

jcrist commented 4 years ago

This has been fixed in #115, released as version 0.8.1. It's available on PyPI now, should be up on conda-forge later today once the build finishes.

RahulJangir2003 commented 2 years ago

Hmmm, it looks like you don't have dask-yarn installed in your environment.tar.gz file even though the dask-yarn cli wrapper appears to be present. How was environment.tar.gz created (conda-pack, venv-pack, etc...)?

I have the exactly same error and I have tried everything mentioned above but still getting it I am using python 3.8 and venv-pack for packaging

RahulJangir2003 commented 2 years ago

Hello,

When trying to create a YarnCluster object in a jupyter notebook, the application always failed with a ConnectionError. After looking through the yarn logs, this seems to be linked to the absence of the module dask_yarn.cli.

I must specify that I launch the jupyter notebook in the same environment that I deploy in the YarnCluster.

  • Steps to reproduce
from dask_yarn import YarnCluster
cluster = YarnCluster(environment='../../environment_dask.tar.gz',
                      worker_vcores=1,
                      worker_memory='1GB',
                      n_workers=1)
  • Relevant logs/tracebacks

Traceback:

---------------------------------------------------------------------------
ConnectionError                           Traceback (most recent call last)
~/.local/lib/python3.7/site-packages/dask_yarn/core.py in submit_and_handle_failures(skein_client, spec)
     80         try:
---> 81             yield skein_client.connect(app_id, security=spec.master.security)
     82         except BaseException:

~/.local/lib/python3.7/site-packages/dask_yarn/core.py in _start_cluster(self)
    563             with submit_and_handle_failures(skein_client, self.spec) as app:
--> 564                 scheduler_address = app.kv.wait("dask.scheduler").decode()
    565                 dashboard_address = app.kv.get("dask.dashboard")

~/.local/lib/python3.7/site-packages/skein/kv.py in wait(self, key, return_owner)
    648 
--> 649             event = event_queue.get()
    650 

~/.local/lib/python3.7/site-packages/skein/kv.py in get(self, block, timeout)
    274             self._exception = out
--> 275             raise out
    276         return out

ConnectionError: Unable to connect to application

During handling of the above exception, another exception occurred:

DaskYarnError                             Traceback (most recent call last)
<ipython-input-3-22fd856af1d8> in <module>
      2                       worker_vcores=1,
      3                       worker_memory='1GB',
----> 4                       n_workers=1)

~/.local/lib/python3.7/site-packages/dask_yarn/core.py in __init__(self, environment, n_workers, worker_vcores, worker_memory, worker_restarts, worker_env, scheduler_vcores, scheduler_memory, deploy_mode, name, queue, tags, user, host, port, dashboard_address, skein_client, asynchronous, loop)
    390             asynchronous=asynchronous,
    391             loop=loop,
--> 392             skein_client=skein_client,
    393         )
    394 

~/.local/lib/python3.7/site-packages/dask_yarn/core.py in _init_common(self, spec, application_client, host, port, dashboard_address, asynchronous, loop, skein_client)
    533 
    534         if not self.asynchronous:
--> 535             self._sync(self._start_internal())
    536 
    537     def _start_cluster(self):

~/.local/lib/python3.7/site-packages/dask_yarn/core.py in _sync(self, task)
    701         future = asyncio.run_coroutine_threadsafe(task, self.loop.asyncio_loop)
    702         try:
--> 703             return future.result()
    704         except BaseException:
    705             future.cancel()

~/.conda/envs/dask_distrib/lib/python3.7/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

~/.conda/envs/dask_distrib/lib/python3.7/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

~/.local/lib/python3.7/site-packages/dask_yarn/core.py in _start_internal(self)
    592             self._start_task = asyncio.ensure_future(self._start_async())
    593         try:
--> 594             await self._start_task
    595         except BaseException:
    596             # On exception, cleanup

~/.local/lib/python3.7/site-packages/dask_yarn/core.py in _start_async(self)
    607             else:
    608                 self._scheduler = None
--> 609             await self.loop.run_in_executor(None, self._start_cluster)
    610         else:
    611             # Connect to an existing cluster

~/.conda/envs/dask_distrib/lib/python3.7/concurrent/futures/thread.py in run(self)
     55 
     56         try:
---> 57             result = self.fn(*self.args, **self.kwargs)
     58         except BaseException as exc:
     59             self.future.set_exception(exc)

~/.local/lib/python3.7/site-packages/dask_yarn/core.py in _start_cluster(self)
    565                 dashboard_address = app.kv.get("dask.dashboard")
    566                 if dashboard_address is not None:
--> 567                     dashboard_address = dashboard_address.decode()
    568 
    569         # Ensure application gets cleaned up

~/.conda/envs/dask_distrib/lib/python3.7/contextlib.py in __exit__(self, type, value, traceback)
    128                 value = type()
    129             try:
--> 130                 self.gen.throw(type, value, traceback)
    131             except StopIteration as exc:
    132                 # Suppress StopIteration *unless* it's the same exception that

~/.local/lib/python3.7/site-packages/dask_yarn/core.py in submit_and_handle_failures(skein_client, spec)
     91                 "See the application logs for more information:\n\n"
     92                 "$ yarn logs -applicationId {app_id}"
---> 93             ).format(app_id=app_id)
     94         )
     95 

DaskYarnError: Failed to start dask-yarn application

Yarn logs :

WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
19/11/18 20:50:24 INFO client.RMProxy: Connecting to ResourceManager at server/192.168.201.10:8032
Container: container_1571297750618_0128_01_000003
LogAggregationType: AGGREGATED
==============================================================================
LogType:dask.worker.log
LogLastModifiedTime:Mon Nov 18 20:32:58 +0100 2019
LogLength:285
LogContents:
Traceback (most recent call last):
  File "/data03/yarn/nm/usercache/user/appcache/application_1571297750618_0128/container_1571297750618_0128_01_000003/environment/bin/dask-yarn", line 7, in <module>
    from dask_yarn.cli import main
ImportError: No module named dask_yarn.cli

End of LogType:dask.worker.log
********************************************************************************

End of LogType:prelaunch.err
******************************************************************************

Container: container_1571297750618_0128_01_000003
LogAggregationType: AGGREGATED
==============================================================================
LogType:prelaunch.out
LogLastModifiedTime:Mon Nov 18 20:32:58 +0100 2019
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container

End of LogType:prelaunch.out
******************************************************************************

Container: container_1571297750618_0128_01_000002
LogAggregationType: AGGREGATED
==============================================================================
LogType:dask.scheduler.log
LogLastModifiedTime:Mon Nov 18 20:32:58 +0100 2019
LogLength:285
LogContents:
Traceback (most recent call last):
  File "/data04/yarn/nm/usercacheuser/appcache/application_1571297750618_0128/container_1571297750618_0128_01_000002/environment/bin/dask-yarn", line 7, in <module>
    from dask_yarn.cli import main
ImportError: No module named dask_yarn.cli

End of LogType:dask.scheduler.log
***********************************************************************************

End of LogType:prelaunch.err
******************************************************************************

Container: container_1571297750618_0128_01_000002
LogAggregationType: AGGREGATED
==============================================================================
LogType:prelaunch.out
LogLastModifiedTime:Mon Nov 18 20:32:58 +0100 2019
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container

End of LogType:prelaunch.out
******************************************************************************

Container: container_1571297750618_0128_01_000001
LogAggregationType: AGGREGATED
==============================================================================
LogType:application.master.log
LogLastModifiedTime:Mon Nov 18 20:32:59 +0100 2019
LogLength:2973
LogContents:
19/11/18 20:33:59 INFO skein.ApplicationMaster: Starting Skein version 0.8.0
19/11/18 20:34:00 INFO skein.ApplicationMaster: Running as user hpda000013
19/11/18 20:34:00 INFO conf.Configuration: resource-types.xml not found
19/11/18 20:34:00 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
19/11/18 20:34:00 INFO skein.ApplicationMaster: Application specification successfully loaded
19/11/18 20:34:01 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
19/11/18 20:34:01 INFO client.RMProxy: Connecting to ResourceManager at server/192.168.201.10:8030
19/11/18 20:34:02 INFO skein.ApplicationMaster: gRPC server started at server:40289
19/11/18 20:34:03 INFO skein.ApplicationMaster: WebUI server started at server:44177
19/11/18 20:34:03 INFO skein.ApplicationMaster: Registering application with resource manager
19/11/18 20:34:03 INFO client.RMProxy: Connecting to ResourceManager at server/192.168.201.10:8032
19/11/18 20:34:03 INFO skein.ApplicationMaster: Initializing service 'dask.worker'.
19/11/18 20:34:03 INFO skein.ApplicationMaster: WAITING: dask.worker_0
19/11/18 20:34:03 INFO skein.ApplicationMaster: Initializing service 'dask.scheduler'.
19/11/18 20:34:03 INFO skein.ApplicationMaster: REQUESTED: dask.scheduler_0
19/11/18 20:34:04 INFO skein.ApplicationMaster: Starting container_1571297750618_0128_01_000002...
19/11/18 20:34:04 INFO skein.ApplicationMaster: RUNNING: dask.scheduler_0 on container_1571297750618_0128_01_000002
19/11/18 20:34:04 INFO skein.ApplicationMaster: REQUESTED: dask.worker_0
19/11/18 20:34:06 INFO skein.ApplicationMaster: Starting container_1571297750618_0128_01_000003...
19/11/18 20:34:06 INFO skein.ApplicationMaster: RUNNING: dask.worker_0 on container_1571297750618_0128_01_000003
19/11/18 20:34:16 WARN skein.ApplicationMaster: FAILED: dask.worker_0 - Container failed during execution, see logs for more information.
19/11/18 20:34:16 INFO skein.ApplicationMaster: RESTARTING: adding new container to replace dask.worker_0.
19/11/18 20:34:16 INFO skein.ApplicationMaster: REQUESTED: dask.worker_1
19/11/18 20:34:16 WARN skein.ApplicationMaster: FAILED: dask.scheduler_0 - Container failed during execution, see logs for more information.
19/11/18 20:34:16 INFO skein.ApplicationMaster: Shutting down: Failure in service dask.scheduler, see logs for more information.
19/11/18 20:34:16 INFO skein.ApplicationMaster: Unregistering application with status FAILED
19/11/18 20:34:16 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
19/11/18 20:34:16 INFO skein.ApplicationMaster: Deleted application directory user/.skein/application_1571297750618_0128
19/11/18 20:34:16 INFO skein.ApplicationMaster: WebUI server shut down
19/11/18 20:34:16 INFO skein.ApplicationMaster: gRPC server shut down

End of LogType:application.master.log
***************************************************************************************

End of LogType:prelaunch.err
******************************************************************************

Container: container_1571297750618_0128_01_000001
LogAggregationType: AGGREGATED
==============================================================================
LogType:prelaunch.out
LogLastModifiedTime:Mon Nov 18 20:32:59 +0100 2019
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container

End of LogType:prelaunch.out
******************************************************************************
  • Version information

    • Python 3.7.3 (anaconda)
    • Dask-Yarn version 0.8.0
    • Hadoop version 3.0.0, CDH 6.2.0

Thank you for the attention !

I have the exactly same error and I have tried everything mentioned above but still getting it I am using python 3.8 and venv-pack for packaging do you know how can I solve it