dask / dask-yarn

Deploy dask on YARN clusters
http://yarn.dask.org
BSD 3-Clause "New" or "Revised" License
69 stars 41 forks source link

'ZMQIOLoop' object has no attribute 'asyncio_loop' #114

Closed DanRunfola closed 4 years ago

DanRunfola commented 4 years ago

Hi everyone,

First and foremost, many thanks for your efforts on this project - I'm very excited about it's potential, and have learned a lot just trying to debug this error!

Issue I'm Seeing I am facing the below error when trying to initialize a YarnCluster with dask-yarn. It happens right off the bat:

`from dask_yarn import YarnCluster from dask.distributed import Client

cluster = YarnCluster(environment='./py3.tar.gz', worker_vcores=2, worker_memory="8GiB")`

Results in: AttributeError: 'ZMQIOLoop' object has no attribute 'asyncio_loop'

Environment

From command line, I can submit jobs to my yarn cluster with skein. I.e., this works:

skein driver start
skein application submit test.yaml

After running that, I can log into my Yarn application log and see it happilly chugging along (you can also see it via skein application ls on command line, as expected). While I don't think it will be helpful, I do include the yarn log from this successful job.

Relevant logs/tracebacks Python

AttributeError                            Traceback (most recent call last)
in engine
      1 cluster = YarnCluster(environment='./py3.tar.gz',
      2                       worker_vcores=2,
----> 3                       worker_memory="8GiB")

/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in __init__(self, environment, n_workers, worker_vcores, worker_memory, worker_restarts, worker_env, scheduler_vcores, scheduler_memory, deploy_mode, name, queue, tags, user, host, port, dashboard_address, skein_client, asynchronous, loop)
    390             asynchronous=asynchronous,
    391             loop=loop,
--> 392             skein_client=skein_client,
    393         )
    394 

/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in _init_common(self, spec, application_client, host, port, dashboard_address, asynchronous, loop, skein_client)
    533 
    534         if not self.asynchronous:
--> 535             self._sync(self._start_internal())
    536 
    537     def _start_cluster(self):

/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in _sync(self, task)
    699         if self.asynchronous:
    700             return task
--> 701         future = asyncio.run_coroutine_threadsafe(task, self.loop.asyncio_loop)
    702         try:
    703             return future.result()

AttributeError: 'ZMQIOLoop' object has no attribute 'asyncio_loop'

Yarn Test with Skein

Container: container_e43_1579106210787_0194_01_000001 on **.wm.edu_8041
LogAggregationType: AGGREGATED
***
LogType:container-localizer-syslog
LogLastModifiedTime:Fri Mar 13 15:12:26 +0000 2020
LogLength:1058
LogContents:
2020-03-13 11:11:21,726 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name ***@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 11:11:21,794 INFO [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer: Disk Validator: yarn.nodemanager.disk-validator is loaded.
2020-03-13 11:11:21,892 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name ***@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 11:11:21,978 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name ***@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 11:11:22,578 WARN [ContainerLocalizer Downloader] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error

End of LogType:container-localizer-syslog
*******************************************************************************************

Container: container_e43_1579106210787_0194_01_000001 on **.wm.edu_8041
LogAggregationType: AGGREGATED
***
LogType:application.driver.log
LogLastModifiedTime:Fri Mar 13 15:12:26 +0000 2020
LogLength:13
LogContents:
Hello World!
End of LogType:application.driver.log
***************************************************************************************
Container: container_e43_1579106210787_0194_01_000001 on **.wm.edu_8041
LogAggregationType: AGGREGATED
***
LogType:application.master.log
LogLastModifiedTime:Fri Mar 13 15:12:26 +0000 2020
LogLength:1950
LogContents:
20/03/13 11:11:23 INFO skein.ApplicationMaster: Starting Skein version 0.8.0
20/03/13 11:11:23 INFO util.KerberosName: Non-simple name ***@campus.wm.edu after auth_to_local rule RULE:[1:$1@$
0]/L
20/03/13 11:11:23 INFO skein.ApplicationMaster: Running as user ***@campus.wm.edu
20/03/13 11:11:23 INFO conf.Configuration: resource-types.xml not found
20/03/13 11:11:23 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/03/13 11:11:23 INFO skein.ApplicationMaster: Application specification successfully loaded
20/03/13 11:11:23 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java c
lasses where applicable
20/03/13 11:11:24 INFO skein.ApplicationMaster: gRPC server started at **.wm.edu:35450
20/03/13 11:11:24 INFO skein.ApplicationMaster: WebUI server started at **.wm.edu:36029
20/03/13 11:11:24 INFO skein.ApplicationMaster: Registering application with resource manager
20/03/13 11:11:24 INFO skein.ApplicationMaster: Starting application driver
20/03/13 11:12:24 INFO skein.ApplicationMaster: Shutting down: Application driver completed successfully.
20/03/13 11:12:24 INFO skein.ApplicationMaster: Unregistering application with status SUCCEEDED
20/03/13 11:12:24 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
20/03/13 11:12:24 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteExcept
ion(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apac
he.org/sbnn-error
20/03/13 11:12:25 INFO skein.ApplicationMaster: Deleted application directory hdfs://nameservice1/user/***@campus
.wm.edu/.skein/application_1579106210787_0194
20/03/13 11:12:25 INFO skein.ApplicationMaster: WebUI server shut down
20/03/13 11:12:25 INFO skein.ApplicationMaster: gRPC server shut down
End of LogType:application.master.log
***************************************************************************************
End of LogType:prelaunch.err
******************************************************************************
Container: container_e43_1579106210787_0194_01_000001 on **.wm.edu_8041
LogAggregationType: AGGREGATED
***
LogType:prelaunch.out
LogLastModifiedTime:Fri Mar 13 15:12:26 +0000 2020
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container
End of LogType:prelaunch.out
******************************************************************************
jcrist commented 4 years ago

Hmmm that's odd, I wonder why you're ending up with a ZMQIOLoop. A few questions:

DanRunfola commented 4 years ago

(1) This is from a CDSW "Workbench" (part of CDH now, I think). That said, prompted by your question I tried running the exact same script through a jupyter notebook (on the same node) and got the below traceback. If I run the script straight from console (python example.py), I get the third error log below, which is similar to the Jupyterhub case.

(2) Versions: dask-yarn - 0.8.0 distributed - 2.12.0 tornado - 6.0.3 pzmq - 19.0.0 (note I just updated this as a part of my own debugging; 18 had the same behavior).

Jupyter

=====================================
ConnectionError                           Traceback (most recent call last)
/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in submit_and_handle_failures(skein_client, spec)
     80         try:
---> 81             yield skein_client.connect(app_id, security=spec.master.security)
     82         except BaseException:

/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in _start_cluster(self)
    563             with submit_and_handle_failures(skein_client, self.spec) as app:
--> 564                 scheduler_address = app.kv.wait("dask.scheduler").decode()
    565                 dashboard_address = app.kv.get("dask.dashboard")

/home/cdsw/.local/lib/python3.6/site-packages/skein/kv.py in wait(self, key, return_owner)
    648 
--> 649             event = event_queue.get()
    650 

/home/cdsw/.local/lib/python3.6/site-packages/skein/kv.py in get(self, block, timeout)
    274             self._exception = out
--> 275             raise out
    276         return out

ConnectionError: Unable to connect to application

During handling of the above exception, another exception occurred:

DaskYarnError                             Traceback (most recent call last)
<ipython-input-1-22765fb9a46d> in <module>()
     23 cluster = YarnCluster(environment='./py3.tar.gz',
     24                       worker_vcores=2,
---> 25                       worker_memory="8GiB")
     26 # Scale out to ten such workers
     27 cluster.scale(10)

/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in __init__(self, environment, n_workers, worker_vcores, worker_memory, worker_restarts, worker_env, scheduler_vcores, scheduler_memory, deploy_mode, name, queue, tags, user, host, port, dashboard_address, skein_client, asynchronous, loop)
    390             asynchronous=asynchronous,
    391             loop=loop,
--> 392             skein_client=skein_client,
    393         )
    394 

/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in _init_common(self, spec, application_client, host, port, dashboard_address, asynchronous, loop, skein_client)
    533 
    534         if not self.asynchronous:
--> 535             self._sync(self._start_internal())
    536 
    537     def _start_cluster(self):

/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in _sync(self, task)
    701         future = asyncio.run_coroutine_threadsafe(task, self.loop.asyncio_loop)
    702         try:
--> 703             return future.result()
    704         except BaseException:
    705             future.cancel()

/usr/lib/python3.6/concurrent/futures/_base.py in result(self, timeout)
    430                 raise CancelledError()
    431             elif self._state == FINISHED:
--> 432                 return self.__get_result()
    433             else:
    434                 raise TimeoutError()

/usr/lib/python3.6/concurrent/futures/_base.py in __get_result(self)
    382     def __get_result(self):
    383         if self._exception:
--> 384             raise self._exception
    385         else:
    386             return self._result

/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in _start_internal(self)
    592             self._start_task = asyncio.ensure_future(self._start_async())
    593         try:
--> 594             await self._start_task
    595         except BaseException:
    596             # On exception, cleanup

/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in _start_async(self)
    607             else:
    608                 self._scheduler = None
--> 609             await self.loop.run_in_executor(None, self._start_cluster)
    610         else:
    611             # Connect to an existing cluster

/usr/lib/python3.6/concurrent/futures/thread.py in run(self)
     54 
     55         try:
---> 56             result = self.fn(*self.args, **self.kwargs)
     57         except BaseException as exc:
     58             self.future.set_exception(exc)

/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in _start_cluster(self)
    565                 dashboard_address = app.kv.get("dask.dashboard")
    566                 if dashboard_address is not None:
--> 567                     dashboard_address = dashboard_address.decode()
    568 
    569         # Ensure application gets cleaned up

/usr/lib/python3.6/contextlib.py in __exit__(self, type, value, traceback)
     97                 value = type()
     98             try:
---> 99                 self.gen.throw(type, value, traceback)
    100             except StopIteration as exc:
    101                 # Suppress StopIteration *unless* it's the same exception that

/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in submit_and_handle_failures(skein_client, spec)
     91                 "See the application logs for more information:\n\n"
     92                 "$ yarn logs -applicationId {app_id}"
---> 93             ).format(app_id=app_id)
     94         )
     95 

DaskYarnError: Failed to start dask-yarn application_1579106210787_0195
See the application logs for more information:

$ yarn logs -applicationId application_1579106210787_0195

Yarn

=====================================
WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
2020-03-13 15:53:09,282 INFO  [main] util.KerberosName (KerberosName.java:apply(327)) - No
n-simple name **@CAMPUS.WM.EDU after auth_to_local rule RULE:[1:$1@$0]/L
Container: container_e43_1579106210787_0195_01_000001 on **.wm.edu_8041
LogAggregationType: AGGREGATED
**
LogType:application.master.log
LogLastModifiedTime:Fri Mar 13 15:51:41 +0000 2020
LogLength:2505
LogContents:
20/03/13 11:51:32 INFO skein.ApplicationMaster: Starting Skein version 0.8.0
20/03/13 11:51:32 INFO util.KerberosName: Non-simple name **@campus.wm.edu aft
er auth_to_local rule RULE:[1:$1@$0]/L
20/03/13 11:51:32 INFO skein.ApplicationMaster: Running as user **@campus.wm.e
du
20/03/13 11:51:32 INFO conf.Configuration: resource-types.xml not found
20/03/13 11:51:32 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/03/13 11:51:32 INFO skein.ApplicationMaster: Application specification successfully loa
ded
20/03/13 11:51:33 WARN util.NativeCodeLoader: Unable to load native-hadoop library for you
r platform... using builtin-java classes where applicable
20/03/13 11:51:33 INFO skein.ApplicationMaster: gRPC server started at **wm
.edu:46694
20/03/13 11:51:33 INFO skein.ApplicationMaster: WebUI server started at **.w
m.edu:44206
20/03/13 11:51:33 INFO skein.ApplicationMaster: Registering application with resource mana
ger
20/03/13 11:51:34 INFO skein.ApplicationMaster: Initializing service 'dask.worker'.
20/03/13 11:51:34 INFO skein.ApplicationMaster: Initializing service 'dask.scheduler'.
20/03/13 11:51:34 INFO skein.ApplicationMaster: REQUESTED: dask.scheduler_0
20/03/13 11:51:35 INFO skein.ApplicationMaster: Starting container_e43_1579106210787_0195_
01_000002...
20/03/13 11:51:35 INFO skein.ApplicationMaster: RUNNING: dask.scheduler_0 on container_e43
_1579106210787_0195_01_000002
20/03/13 11:51:40 WARN skein.ApplicationMaster: FAILED: dask.scheduler_0 - Container faile
d during execution, see logs for more information.
20/03/13 11:51:40 INFO skein.ApplicationMaster: Shutting down: Failure in service dask.sch
eduler, see logs for more information.
20/03/13 11:51:40 INFO skein.ApplicationMaster: Unregistering application with status FAILED
20/03/13 11:51:40 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
20/03/13 11:51:40 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
20/03/13 11:51:40 INFO skein.ApplicationMaster: Deleted application directory hdfs://nameservice1/user/**@campus.wm.edu/.skein/application_1579106210787_0195
20/03/13 11:51:40 INFO skein.ApplicationMaster: WebUI server shut down
20/03/13 11:51:40 INFO skein.ApplicationMaster: gRPC server shut down

End of LogType:application.master.log
***************************************************************************************

End of LogType:prelaunch.err
******************************************************************************

Container: container_e43_1579106210787_0195_01_000001 on **.wm.edu_8041
LogAggregationType: AGGREGATED
**
LogType:prelaunch.out
LogLastModifiedTime:Fri Mar 13 15:51:41 +0000 2020
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container

End of LogType:prelaunch.out
******************************************************************************

Container: container_e43_1579106210787_0195_01_000001 on **.wm.edu_8041
LogAggregationType: AGGREGATED
**
LogType:container-localizer-syslog
LogLastModifiedTime:Fri Mar 13 15:51:41 +0000 2020
LogLength:1058
LogContents:
2020-03-13 11:51:31,117 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name **@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 11:51:31,188 INFO [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer: Disk Validator: yarn.nodemanager.disk-validator is loaded.
2020-03-13 11:51:31,221 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name **@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 11:51:31,288 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name **@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 11:51:31,869 WARN [ContainerLocalizer Downloader] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error

End of LogType:container-localizer-syslog
*******************************************************************************************

Container: container_e43_1579106210787_0195_01_000002 on **.wm.edu_8041
LogAggregationType: AGGREGATED
**
LogType:dask.scheduler.log
LogLastModifiedTime:Fri Mar 13 15:51:41 +0000 2020
LogLength:308
LogContents:
Traceback (most recent call last):
  File "/data/11/yarn/nm/usercache/**@campus.wm.edu/appcache/application_1579106210787_0195/container_e43_1579106210787_0195_01_000002/environment/bin/dask-yarn", line 7, in <module>
    from dask_yarn.cli import main
ImportError: No module named dask_yarn.cli

End of LogType:dask.scheduler.log
***********************************************************************************

End of LogType:prelaunch.err
******************************************************************************

Container: container_e43_1579106210787_0195_01_000002 on **.wm.edu_8041
LogAggregationType: AGGREGATED
**
LogType:prelaunch.out
LogLastModifiedTime:Fri Mar 13 15:51:41 +0000 2020
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container

End of LogType:prelaunch.out
**

Container: container_e43_1579106210787_0195_01_000002 on **.wm.edu_8041
LogAggregationType: AGGREGATED
**
LogType:container-localizer-syslog
LogLastModifiedTime:Fri Mar 13 15:51:41 +0000 2020
LogLength:1058
LogContents:
2020-03-13 11:51:35,731 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name **@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 11:51:35,802 INFO [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer: Disk Validator: yarn.nodemanager.disk-validator is loaded.
2020-03-13 11:51:35,823 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name **@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 11:51:35,889 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name **@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 11:51:36,483 WARN [ContainerLocalizer Downloader] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error

End of LogType:container-localizer-syslog
**
*
**Python3 Terminal Submission**
=====================================
cdsw@vumjlctektot60jh:~$ python3 example.py
WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
20/03/13 15:55:53 INFO util.KerberosName: Non-simple name @CAMPUS.WM.EDU aft
er auth_to_local rule RULE:[1:$1@$0]/L
20/03/13 15:55:54 INFO skein.Driver: Driver started, listening on 46488
E0313 15:55:54.415178961     493 uri_parser.cc:46]           bad uri.scheme: ''
E0313 15:55:54.415244895     493 uri_parser.cc:52]                            ^ here
E0313 15:55:54.415257782     493 http_proxy.cc:63]           cannot parse value of 'http_p
roxy' env var
20/03/13 15:55:54 INFO conf.Configuration: resource-types.xml not found
20/03/13 15:55:54 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/03/13 15:55:55 INFO util.KerberosName: Non-simple name @CAMPUS.WM.EDU after auth_to_local rule RULE:[1:$1@$0]/L
20/03/13 15:55:55 INFO hdfs.DFSClient: Created token for @campus.wm.edu: HDFS_DELEGATION_TOKEN owner=@CAMPUS.WM.EDU, renewer=yarn, realUser=, issueDate=1584114954985, maxDate=1584719754985, sequenceNumber=1858, masterKeyId=254 on ha-hdfs:nameservice1
20/03/13 15:55:55 INFO security.TokenCache: Got dt for hdfs://nameservice1; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameservice1, Ident: (token for @campus.wm.edu: HDFS_DELEGATION_TOKEN owner=@CAMPUS.WM.EDU, renewer=yarn, realUser=, issueDate=1584114954985, maxDate=1584719754985, sequenceNumber=1858, masterKeyId=254)
20/03/13 15:55:55 INFO skein.Driver: Uploading application resources to hdfs://nameservice1/user/@campus.wm.edu/.skein/application_1579106210787_0196
20/03/13 15:55:55 INFO skein.Driver: Submitting application...
20/03/13 15:55:56 INFO impl.YarnClientImpl: Submitted application application_1579106210787_0196
E0313 15:56:01.258570075     493 uri_parser.cc:46]           bad uri.scheme: ''
E0313 15:56:01.258592119     493 uri_parser.cc:52]                            ^ here
E0313 15:56:01.258598897     493 http_proxy.cc:63]           cannot parse value of 'http_proxy' env var
20/03/13 15:56:06 INFO impl.YarnClientImpl: Killed application application_1579106210787_0196
Traceback (most recent call last):
  File "/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py", line 81, in submit_and_handle_failures
    yield skein_client.connect(app_id, security=spec.master.security)
  File "/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py", line 564, in _start_cluster
    scheduler_address = app.kv.wait("dask.scheduler").decode()
  File "/home/cdsw/.local/lib/python3.6/site-packages/skein/kv.py", line 649, in wait
    event = event_queue.get()
  File "/home/cdsw/.local/lib/python3.6/site-packages/skein/kv.py", line 275, in get
    raise out
skein.exceptions.ConnectionError: Unable to connect to application

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "example.py", line 25, in <module>
    worker_memory="8GiB")
  File "/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py", line 392, in __init__
    skein_client=skein_client,
  File "/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py", line 535, in _init_common
    self._sync(self._start_internal())
  File "/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py", line 703, in _sync
    return future.result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 432, in result
    return self.__get_result()
  File "/usr/lib/python3.6/concurrent/futures/_base.py", line 384, in __get_result
    raise self._exception
  File "/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py", line 594, in _start_internal
    await self._start_task
  File "/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py", line 609, in _start_async
    await self.loop.run_in_executor(None, self._start_cluster)
  File "/usr/lib/python3.6/concurrent/futures/thread.py", line 56, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py", line 567, in _start_cluster
    dashboard_address = dashboard_address.decode()
  File "/usr/lib/python3.6/contextlib.py", line 99, in __exit__
    self.gen.throw(type, value, traceback)
  File "/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py", line 93, in submit_and_handle_failures
    ).format(app_id=app_id)
dask_yarn.core.DaskYarnError: Failed to start dask-yarn application_1579106210787_0196
See the application logs for more information:

$ yarn logs -applicationId application_1579106210787_0196
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
  File "/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py", line 752, in __del__
  File "/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py", line 744, in close
  File "/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py", line 735, in shutdown
  File "/home/cdsw/.local/lib/python3.6/site-packages/distributed/utils.py", line 462, in stop
  File "/home/cdsw/.local/lib/python3.6/site-packages/distributed/utils.py", line 477, in _stop_unlocked
  File "/home/cdsw/.local/lib/python3.6/site-packages/distributed/utils.py", line 486, in _real_stop
  File "/usr/local/lib/python3.6/dist-packages/tornado/platform/asyncio.py", line 265, in close
  File "/usr/local/lib/python3.6/dist-packages/tornado/platform/asyncio.py", line 89, in close
  File "/usr/lib/python3.6/asyncio/unix_events.py", line 63, in close
  File "/usr/lib/python3.6/asyncio/selector_events.py", line 96, in close
RuntimeError: Cannot close a running event loop
jcrist commented 4 years ago

Ah, that makes sense. ZMQIOLoop is deprecated (and has been for a while) - CDSW must be using pretty old code and manually starting up the ZMQIOLoop. We could probably work around this in dask-yarn, but I'm hesitant to support a deprecated system. It may be an easy fix though.

The other errors look like a problem with your packaged environment (it can't import dask_yarn inside the YARN container). Others have run into this situation before, and I'm not sure what the issue is, as I'm unable to reproduce myself. How are you packaging your environment (i.e. are you using conda-pack, venv-pack, local files, etc...)?


Note: I've reformated your tracebacks by putting triple backticks around them - this makes it easier to read. In the future when posting tracebacks and logs please do the same. For reference on github's markdown support see https://guides.github.com/features/mastering-markdown/

DanRunfola commented 4 years ago

Thanks for the assist on markdown. I'm using conda-pack, but I can debug solo on that path for a little while and open a seperate issue if it's something legitimate and not just my poor code!

It sounds like the core issue here is that CDSW itself has something, somewhere that is quite old, and thus throwing everything for a loop. I can also open a ticket with Cloudera and ask them to look into this issue and potentially update the underlying code, but I am well out of my depth at this stage. Other than reference this issue discussion, is there anything you might suggest I include re: what libraries are causing issue?

jcrist commented 4 years ago

It sounds like the core issue here is that CDSW itself has something, somewhere that is quite old, and thus throwing everything for a loop. I can also open a ticket with Cloudera and ask them to look into this issue and potentially update the underlying code, but I am well out of my depth at this stage. Other than reference this ticket, is there anything you might suggest I include re: what libraries are causing issue?

That's just the issue with the asyncio_loop attribute error. While I'd like to not have to support ZMQIOLoop, I also recognize that Cloudera is a massive corporation and updating this may not feel worth it to them. We may be able to work around this. Is CDSW your preferred environment over Jupyter?

I'm using conda-pack, but I can debug solo on that path for a little while and open a seperate issue if it's something legitimate and not just my poor code!

Others have reported the same issue when using conda-pack, but I'm unable to reproduce. Could you run the following for me and report back with the yarn logs (note that you may need to change environment.tar.gz to point to your archive)?

import skein
import time

# An application specification.
# *Note that `environment.tar.gz` should be the relative path to your archive file*
spec = skein.ApplicationSpec.from_yaml("""
name: test-run
master:
  script: |
    which python
    source environment/bin/activate
    ls
    which python
    python -m site
    which dask-yarn
    dask-yarn --version
  files:
    environment: environment.tar.gz
""")

# Submit the application and wait for it to complete
with skein.Client() as client:
    app_id = client.submit(spec)
    print("Application id: %s" % app_id)
    # Wait for application to finish
    while client.application_report(app_id).state not in ("FINISHED", "FAILED", "KILLED"):
        time.sleep(1)

print("Run `yarn logs -applicationId %s and report back with the results" % app_id)
DanRunfola commented 4 years ago

On your first question, yes - CDSW is the primary point of access into our cluster environment for nearly all of my students. That said, it is trivial for us to spin JupyterNotebook interfaces up within CDSW (CDSW gives two options - the Workbench or a Jupyter Notebook - for interaction). So, the work around could simply be to use Jupyter Notebooks for dask jobs interacting with Yarn.

Regarding the script, here are the results:

Python (via the workbench on CDSW)

WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
20/03/13 16:41:41 INFO util.KerberosName: Non-simple name **@CAMPUS.WM.EDU after auth_to_local rule RULE:[1:$1@$0]/L
20/03/13 16:41:43 INFO skein.Driver: Driver started, listening on 37198
E0313 16:41:43.216683601      89 uri_parser.cc:46]           bad uri.scheme: ''
E0313 16:41:43.216725482      89 uri_parser.cc:52]                            ^ here
E0313 16:41:43.216731964      89 http_proxy.cc:63]           cannot parse value of 'http_proxy' env var
20/03/13 16:41:43 INFO conf.Configuration: resource-types.xml not found
20/03/13 16:41:43 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/03/13 16:41:43 INFO util.KerberosName: Non-simple name **@CAMPUS.WM.EDU after auth_to_local rule RULE:[1:$1@$0]/L
20/03/13 16:41:43 INFO hdfs.DFSClient: Created token for **@campus.wm.edu: HDFS_DELEGATION_TOKEN owner=**@CAMPUS.WM.EDU, renewer=yarn, realUser=, issueDate=1584117703796, maxDate=1584722503796, sequenceNumber=1859, masterKeyId=254 on ha-hdfs:nameservice1
20/03/13 16:41:43 INFO security.TokenCache: Got dt for hdfs://nameservice1; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameservice1, Ident: (token for **@campus.wm.edu: HDFS_DELEGATION_TOKEN owner=**@CAMPUS.WM.EDU, renewer=yarn, realUser=, issueDate=1584117703796, maxDate=1584722503796, sequenceNumber=1859, masterKeyId=254)
20/03/13 16:41:43 INFO skein.Driver: Uploading application resources to hdfs://nameservice1/user/**@campus.wm.edu/.skein/application_1579106210787_0197
20/03/13 16:41:44 INFO skein.Driver: Submitting application...
20/03/13 16:41:44 INFO impl.YarnClientImpl: Submitted application application_1579106210787_0197
Application id: application_1579106210787_0197
yarn logs -applicationId application_1579106210787_0197
WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
2020-03-13 16:46:08,138 INFO  [main] util.KerberosName (KerberosName.java:apply(327)) - No
n-simple name **@CAMPUS.WM.EDU after auth_to_local rule RULE:[1:$1@$0]/L
Container: container_e43_1579106210787_0197_01_000001 on **.wm.edu_8041
LogAggregationType: AGGREGATED
**
LogType:container-localizer-syslog
LogLastModifiedTime:Fri Mar 13 16:41:53 +0000 2020
LogLength:1058
LogContents:
2020-03-13 12:41:46,071 INFO [main] org.apache.hadoop.security.authentication.util.Kerbero
sName: Non-simple name **@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0
]/L
2020-03-13 12:41:46,140 INFO [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer: Disk Validator: yarn.nodemanager.disk-validator is loaded.
2020-03-13 12:41:46,172 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name **@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 12:41:46,237 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name **@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 12:41:46,835 WARN [ContainerLocalizer Downloader] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error

End of LogType:container-localizer-syslog
*******************************************************************************************

Container: container_e43_1579106210787_0197_01_000001 on **.wm.edu_8041
LogAggregationType: AGGREGATED
**
LogType:application.driver.log
LogLastModifiedTime:Fri Mar 13 16:41:53 +0000 2020
LogLength:865
LogContents:
which: no python in ((null))
container_tokens
environment
launch_container.sh
tmp
which: no python in ((null))
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Fatal Python error: Py_Initialize: Unable to get the locale encoding
ModuleNotFoundError: No module named 'encodings'

Current thread 0x00007f5a8ba07740 (most recent call first):
.skein.sh: line 5: 383477 Aborted                 python -m site
which: no dask-yarn in ((null))
Traceback (most recent call last):
  File "/data/10/yarn/nm/usercache/**@campus.wm.edu/appcache/application_1579106210787_0197/container_e43_1579106210787_0197_01_000001/environment/bin/dask-yarn", line 7, in <module>
    from dask_yarn.cli import main
ImportError: No module named dask_yarn.cli

End of LogType:application.driver.log
***************************************************************************************

Container: container_e43_1579106210787_0197_01_000001 on **.wm.edu_8041
LogAggregationType: AGGREGATED
**
LogType:application.master.log
LogLastModifiedTime:Fri Mar 13 16:41:53 +0000 2020
LogLength:1979
LogContents:
20/03/13 12:41:50 INFO skein.ApplicationMaster: Starting Skein version 0.8.0
20/03/13 12:41:50 INFO util.KerberosName: Non-simple name **@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
20/03/13 12:41:50 INFO skein.ApplicationMaster: Running as user **@campus.wm.edu
20/03/13 12:41:50 INFO conf.Configuration: resource-types.xml not found
20/03/13 12:41:50 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/03/13 12:41:50 INFO skein.ApplicationMaster: Application specification successfully loaded
20/03/13 12:41:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/03/13 12:41:51 INFO skein.ApplicationMaster: gRPC server started at **.wm.edu:41092
20/03/13 12:41:51 INFO skein.ApplicationMaster: WebUI server started at **.wm.edu:39826
20/03/13 12:41:51 INFO skein.ApplicationMaster: Registering application with resource manager
20/03/13 12:41:52 INFO skein.ApplicationMaster: Starting application driver
20/03/13 12:41:52 INFO skein.ApplicationMaster: Shutting down: Application driver failed with exit code 1, see logs for more information.
20/03/13 12:41:52 INFO skein.ApplicationMaster: Unregistering application with status FAILED
20/03/13 12:41:52 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
20/03/13 12:41:52 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
20/03/13 12:41:52 INFO skein.ApplicationMaster: Deleted application directory hdfs://nameservice1/user/**@campus.wm.edu/.skein/application_1579106210787_0197
20/03/13 12:41:52 INFO skein.ApplicationMaster: WebUI server shut down
20/03/13 12:41:52 INFO skein.ApplicationMaster: gRPC server shut down

End of LogType:application.master.log
***************************************************************************************

End of LogType:prelaunch.err
******************************************************************************

Container: container_e43_1579106210787_0197_01_000001 on**.wm.edu_8041
LogAggregationType: AGGREGATED
**
LogType:prelaunch.out
LogLastModifiedTime:Fri Mar 13 16:41:53 +0000 2020
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container

End of LogType:prelaunch.out
******************************************************************************
jcrist commented 4 years ago

Hmmm, ok. How about this one:

import skein
import time

# An application specification.
# *Note that `environment.tar.gz` should be the relative path to your archive file*
spec = skein.ApplicationSpec.from_yaml("""
name: test-run
master:
  script: |
    set -x
    which python
    python --version
    python -m site
    source environment/bin/activate
    which python
    python --version
    python -m site
    which dask-yarn
    dask-yarn --version
    ./environment/bin/python --version
    ./environment/bin/python -m dask_yarn.cli --version
  files:
    environment: environment.tar.gz
""")

# Submit the application and wait for it to complete
with skein.Client() as client:
    app_id = client.submit(spec)
    print("Application id: %s" % app_id)
    # Wait for application to finish
    while client.application_report(app_id).state not in ("FINISHED", "FAILED", "KILLED"):
        time.sleep(1)

print("Run `yarn logs -applicationId %s and report back with the results" % app_id)

I only need the logs from YARN, no need to post the output from the Python process.

DanRunfola commented 4 years ago

Here you go:

yarn logs -applicationId application_1579106210787_0199
WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
2020-03-13 16:59:04,840 INFO  [main] util.KerberosName (KerberosName.java:apply(327)) - No
n-simple name *@CAMPUS.WM.EDU after auth_to_local rule RULE:[1:$1@$0]/L
Container: container_e43_1579106210787_0199_01_000001 on **.wm.edu_8041
LogAggregationType: AGGREGATED
=====================================================================================
LogType:application.driver.log
LogLastModifiedTime:Fri Mar 13 16:57:31 +0000 2020
LogLength:3945
LogContents:
+ which python
which: no python in ((null))
+ python --version
Python 2.7.5
+ python -m site
sys.path = [
    '/data/02/yarn/nm/usercache/*@campus.wm.edu/appcache/application_1579106210787_0199/container_e43_1579106210787_0199_01_000001',
    '/usr/lib64/python27.zip',
    '/usr/lib64/python2.7',
    '/usr/lib64/python2.7/plat-linux2',
    '/usr/lib64/python2.7/lib-tk',
    '/usr/lib64/python2.7/lib-old',
    '/usr/lib64/python2.7/lib-dynload',
    '/usr/lib64/python2.7/site-packages',
    '/usr/lib/python2.7/site-packages',
]
USER_BASE: '/home/.local' (doesn't exist)
USER_SITE: '/home/.local/lib/python2.7/site-packages' (doesn't exist)
ENABLE_USER_SITE: True
+ source environment/bin/activate
++ _conda_pack_activate
++ local _CONDA_SHELL_FLAVOR
++ '[' -n x ']'
++ _CONDA_SHELL_FLAVOR=bash
++ local script_dir
++ case "$_CONDA_SHELL_FLAVOR" in
+++ dirname environment/bin/activate
++ script_dir=environment/bin
+++ cd environment/bin
+++ pwd
++ local full_path_script_dir=/data/02/yarn/nm/usercache/*@campus.wm.edu/appcache/application_1579106210787_0199/container_e43_1579106210787_0199_01_000001/environment/bin
+++ dirname /data/02/yarn/nm/usercache/*@campus.wm.edu/appcache/application_1579106210787_0199/container_e43_1579106210787_0199_01_000001/environment/bin
++ local full_path_env=/data/02/yarn/nm/usercache/*@campus.wm.edu/appcache/application_1579106210787_0199/container_e43_1579106210787_0199_01_000001/environment
+++ basename /data/02/yarn/nm/usercache/*@campus.wm.edu/appcache/application_1579106210787_0199/container_e43_1579106210787_0199_01_000001/environment
++ local env_name=environment
++ '[' -n '' ']'
++ export CONDA_PREFIX=/data/02/yarn/nm/usercache/*@campus.wm.edu/appcache/application_1579106210787_0199/container_e43_1579106210787_0199_01_000001/environment
++ CONDA_PREFIX=/data/02/yarn/nm/usercache/*@campus.wm.edu/appcache/application_1579106210787_0199/container_e43_1579106210787_0199_01_000001/environment
++ export _CONDA_PACK_OLD_PS1=
++ _CONDA_PACK_OLD_PS1=
++ PATH=/data/02/yarn/nm/usercache/*@campus.wm.edu/appcache/application_1579106210787_0199/container_e43_1579106210787_0199_01_000001/environment/bin:/usr/local/bin:/usr/bin
++ PS1='(environment) '
++ case "$_CONDA_SHELL_FLAVOR" in
++ hash -r
++ local _script_dir=/data/02/yarn/nm/usercache/*@campus.wm.edu/appcache/application_1579106210787_0199/container_e43_1579106210787_0199_01_000001/environment/etc/conda/activate.d
++ '[' -d /data/02/yarn/nm/usercache/*@campus.wm.edu/appcache/application_1579106210787_0199/container_e43_1579106210787_0199_01_000001/environment/etc/conda/activate.d ']'
+ which python
which: no python in ((null))
+ python --version
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Python 3.6.8 :: Anaconda, Inc.
+ python -m site
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Fatal Python error: Py_Initialize: Unable to get the locale encoding
ModuleNotFoundError: No module named 'encodings'

Current thread 0x00007f1e0edf9740 (most recent call first):
.skein.sh: line 8: 210984 Aborted                 python -m site
+ which dask-yarn
which: no dask-yarn in ((null))
+ dask-yarn --version
Traceback (most recent call last):
  File "/data/02/yarn/nm/usercache/*@campus.wm.edu/appcache/application_1579106210787_0199/container_e43_1579106210787_0199_01_000001/environment/bin/dask-yarn", line 7, in <module>
    from dask_yarn.cli import main
ImportError: No module named dask_yarn.cli
+ ./environment/bin/python --version
Python 3.6.8 :: Anaconda, Inc.
+ ./environment/bin/python -m dask_yarn.cli --version
dask-yarn 0.8.0

End of LogType:application.driver.log
***************************************************************************************

Container: container_e43_1579106210787_0199_01_000001 on **.wm.edu_8041
LogAggregationType: AGGREGATED
=====================================================================================
LogType:application.master.log
LogLastModifiedTime:Fri Mar 13 16:57:31 +0000 2020
LogLength:1950
LogContents:
20/03/13 12:57:27 INFO skein.ApplicationMaster: Starting Skein version 0.8.0
20/03/13 12:57:27 INFO util.KerberosName: Non-simple name *@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
20/03/13 12:57:27 INFO skein.ApplicationMaster: Running as user *@campus.wm.edu
20/03/13 12:57:27 INFO conf.Configuration: resource-types.xml not found
20/03/13 12:57:27 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/03/13 12:57:27 INFO skein.ApplicationMaster: Application specification successfully loaded
20/03/13 12:57:28 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/03/13 12:57:28 INFO skein.ApplicationMaster: gRPC server started at **.wm.edu:39812
20/03/13 12:57:28 INFO skein.ApplicationMaster: WebUI server started at **.wm.edu:38617
20/03/13 12:57:28 INFO skein.ApplicationMaster: Registering application with resource manager
20/03/13 12:57:29 INFO skein.ApplicationMaster: Starting application driver
20/03/13 12:57:29 INFO skein.ApplicationMaster: Shutting down: Application driver completed successfully.
20/03/13 12:57:29 INFO skein.ApplicationMaster: Unregistering application with status SUCCEEDED
20/03/13 12:57:29 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
20/03/13 12:57:29 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
20/03/13 12:57:29 INFO skein.ApplicationMaster: Deleted application directory hdfs://nameservice1/user/*@campus.wm.edu/.skein/application_1579106210787_0199
20/03/13 12:57:29 INFO skein.ApplicationMaster: WebUI server shut down
20/03/13 12:57:29 INFO skein.ApplicationMaster: gRPC server shut down

End of LogType:application.master.log
***************************************************************************************

End of LogType:prelaunch.err
******************************************************************************

Container: container_e43_1579106210787_0199_01_000001 on**.wm.edu_8041
LogAggregationType: AGGREGATED
=====================================================================================
LogType:prelaunch.out
LogLastModifiedTime:Fri Mar 13 16:57:31 +0000 2020
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container

End of LogType:prelaunch.out
******************************************************************************

Container: container_e43_1579106210787_0199_01_000001 on **.wm.edu_8041
LogAggregationType: AGGREGATED
=====================================================================================
LogType:container-localizer-syslog
LogLastModifiedTime:Fri Mar 13 16:57:31 +0000 2020
LogLength:1058
LogContents:
2020-03-13 12:57:22,923 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name *@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 12:57:22,996 INFO [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer: Disk Validator: yarn.nodemanager.disk-validator is loaded.
2020-03-13 12:57:23,041 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name *@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 12:57:23,111 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name *@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 12:57:23,697 WARN [ContainerLocalizer Downloader] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error

End of LogType:container-localizer-syslog
******************************************************************************************
*
jcrist commented 4 years ago

Ok, so everything is there, our activate script just isn't working properly. One more:

import skein
import time

# An application specification.
# *Note that `environment.tar.gz` should be the relative path to your archive file*
spec = skein.ApplicationSpec.from_yaml("""
name: test-run
master:
  script: |
    set -x
    which python
    python --version
    python -m site
    env
    unset PYTHONHOME
    unset PYTHONPATH
    source environment/bin/activate
    which python
    python --version
    python -m site
    which dask-yarn
    dask-yarn --version
    ./environment/bin/python --version
    ./environment/bin/python -m site
    ./environment/bin/python -m dask_yarn.cli --version
  files:
    environment: environment.tar.gz
""")

# Submit the application and wait for it to complete
with skein.Client() as client:
    app_id = client.submit(spec)
    print("Application id: %s" % app_id)
    # Wait for application to finish
    while client.application_report(app_id).state not in ("FINISHED", "FAILED", "KILLED"):
        time.sleep(1)

print("Run `yarn logs -applicationId %s and report back with the results" % app_id)
DanRunfola commented 4 years ago

Here it is:

 yarn logs -applicationId application_1579106210787_0201
WARNING: log4j.properties is not found. HADOOP_CONF_DIR may be incomplete.
WARNING: YARN_OPTS has been replaced by HADOOP_OPTS. Using value of YARN_OPTS.
2020-03-13 17:35:24,008 INFO  [main] util.KerberosName (KerberosName.java:apply(327)) - Non-simple name @@CAMPUS.WM.EDU after auth_to_local rule RULE:[1:$1@$0]/L
Container: container_e43_1579106210787_0201_01_000001 on w01.@.wm.edu_8041
LogAggregationType: AGGREGATED
=====================================================================================
LogType:container-localizer-syslog
LogLastModifiedTime:Fri Mar 13 17:34:54 +0000 2020
LogLength:1058
LogContents:
2020-03-13 13:34:45,646 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name @@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 13:34:45,717 INFO [main] org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ContainerLocalizer: Disk Validator: yarn.nodemanager.disk-validator is loaded.
2020-03-13 13:34:45,770 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name @@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 13:34:45,840 INFO [main] org.apache.hadoop.security.authentication.util.KerberosName: Non-simple name @@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
2020-03-13 13:34:46,431 WARN [ContainerLocalizer Downloader] org.apache.hadoop.ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation category READ is not supported in state standby. Visit https://s.apache.org/sbnn-error

End of LogType:container-localizer-syslog
*******************************************************************************************

Container: container_e43_1579106210787_0201_01_000001 on w01.@.wm.edu_8041
LogAggregationType: AGGREGATED
=====================================================================================
LogType:application.driver.log
LogLastModifiedTime:Fri Mar 13 17:34:54 +0000 2020
LogLength:9984
LogContents:
+ which python
which: no python in ((null))
+ python --version
Python 2.7.5
+ python -m site
sys.path = [
    '/data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001',
    '/usr/lib64/python27.zip',
    '/usr/lib64/python2.7',
    '/usr/lib64/python2.7/plat-linux2',
    '/usr/lib64/python2.7/lib-tk',
    '/usr/lib64/python2.7/lib-old',
    '/usr/lib64/python2.7/lib-dynload',
    '/usr/lib64/python2.7/site-packages',
    '/usr/lib/python2.7/site-packages',
]
USER_BASE: '/home/.local' (doesn't exist)
USER_SITE: '/home/.local/lib/python2.7/site-packages' (doesn't exist)
ENABLE_USER_SITE: True
+ env
SKEIN_RESOURCE_VCORES=1
NM_HOST=w01.@.wm.edu
NM_AUX_SERVICE_mapreduce_shuffle=AAA0+gAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA=
NM_HTTP_PORT=8044
APPLICATION_WEB_PROXY_BASE=/proxy/application_1579106210787_0201
LOCAL_DIRS=/data/00/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201,/data/01/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201,/data/02/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201,/data/03/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201,/data/04/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201,/data/05/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201,/data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201,/data/07/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201,/data/08/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201,/data/09/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201,/data/10/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201,/data/11/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201
USER=@@campus.wm.edu
PRELAUNCH_OUT=/data/11/yarn/container-logs/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/prelaunch.out
LOCAL_USER_DIRS=/data/00/yarn/nm/usercache/@@campus.wm.edu/,/data/01/yarn/nm/usercache/@@campus.wm.edu/,/data/02/yarn/nm/usercache/@@campus.wm.edu/,/data/03/yarn/nm/usercache/@@campus.wm.edu/,/data/04/yarn/nm/usercache/@@campus.wm.edu/,/data/05/yarn/nm/usercache/@@campus.wm.edu/,/data/06/yarn/nm/usercache/@@campus.wm.edu/,/data/07/yarn/nm/usercache/@@campus.wm.edu/,/data/08/yarn/nm/usercache/@@campus.wm.edu/,/data/09/yarn/nm/usercache/@@campus.wm.edu/,/data/10/yarn/nm/usercache/@@campus.wm.edu/,/data/11/yarn/nm/usercache/@@campus.wm.edu/
HADOOP_TOKEN_FILE_LOCATION=/data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/container_tokens
SKEIN_RESOURCE_MEMORY=1024
LOG_DIRS=/data/00/yarn/container-logs/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001,/data/01/yarn/container-logs/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001,/data/02/yarn/container-logs/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001,/data/03/yarn/container-logs/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001,/data/04/yarn/container-logs/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001,/data/05/yarn/container-logs/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001,/data/06/yarn/container-logs/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001,/data/07/yarn/container-logs/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001,/data/08/yarn/container-logs/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001,/data/09/yarn/container-logs/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001,/data/10/yarn/container-logs/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001,/data/11/yarn/container-logs/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001,/opt/yarn/container-logs/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001
MALLOC_ARENA_MAX=4
HADOOP_HDFS_HOME=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hadoop-hdfs
SKEIN_APPMASTER_ADDRESS=w01.@.wm.edu:43156
HADOOP_COMMON_HOME=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hadoop
PWD=/data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001
HADOOP_YARN_HOME=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hadoop-yarn
JAVA_HOME=/usr/java/jdk1.8.0_181-amd64
LANG=C.UTF-8
HADOOP_CONF_DIR=/var/run/cloudera-scm-agent/process/4062-yarn-NODEMANAGER
SKEIN_APPLICATION_ID=application_1579106210787_0201
HADOOP_CLIENT_CONF_DIR=/etc/hadoop/conf.cloudera.yarn
SHLVL=2
HOME=/home/
JVM_PID=252425
HADOOP_MAPRED_HOME=/opt/cloudera/parcels/CDH-6.2.0-1.cdh6.2.0.p0.967373/lib/hadoop-mapreduce
NM_PORT=8041
LOGNAME=@@campus.wm.edu
NM_AUX_SERVICE_spark_shuffle=
APP_SUBMIT_TIME_ENV=1584120880963
CONTAINER_ID=container_e43_1579106210787_0201_01_000001
PRELAUNCH_ERR=/data/11/yarn/container-logs/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/prelaunch.err
_=/usr/bin/env
+ unset PYTHONHOME
+ unset PYTHONPATH
+ source environment/bin/activate
++ _conda_pack_activate
++ local _CONDA_SHELL_FLAVOR
++ '[' -n x ']'
++ _CONDA_SHELL_FLAVOR=bash
++ local script_dir
++ case "$_CONDA_SHELL_FLAVOR" in
+++ dirname environment/bin/activate
++ script_dir=environment/bin
+++ cd environment/bin
+++ pwd
++ local full_path_script_dir=/data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/environment/bin
+++ dirname /data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/environment/bin
++ local full_path_env=/data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/environment
+++ basename /data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/environment
++ local env_name=environment
++ '[' -n '' ']'
++ export CONDA_PREFIX=/data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/environment
++ CONDA_PREFIX=/data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/environment
++ export _CONDA_PACK_OLD_PS1=
++ _CONDA_PACK_OLD_PS1=
++ PATH=/data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/environment/bin:/usr/local/bin:/usr/bin
++ PS1='(environment) '
++ case "$_CONDA_SHELL_FLAVOR" in
++ hash -r
++ local _script_dir=/data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/environment/etc/conda/activate.d
++ '[' -d /data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/environment/etc/conda/activate.d ']'
+ which python
which: no python in ((null))
+ python --version
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Python 3.6.8 :: Anaconda, Inc.
+ python -m site
Could not find platform independent libraries <prefix>
Could not find platform dependent libraries <exec_prefix>
Consider setting $PYTHONHOME to <prefix>[:<exec_prefix>]
Fatal Python error: Py_Initialize: Unable to get the locale encoding
ModuleNotFoundError: No module named 'encodings'

Current thread 0x00007fa1b9828740 (most recent call first):
.skein.sh: line 11: 252574 Aborted                 python -m site
+ which dask-yarn
which: no dask-yarn in ((null))
+ dask-yarn --version
Traceback (most recent call last):
  File "/data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/environment/bin/dask-yarn", line 7, in <module>
    from dask_yarn.cli import main
ImportError: No module named dask_yarn.cli
+ ./environment/bin/python --version
Python 3.6.8 :: Anaconda, Inc.
+ ./environment/bin/python -m site
sys.path = [
    '/data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001',
    '/data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/environment/lib/python36.zip',
    '/data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/environment/lib/python3.6',
    '/data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/environment/lib/python3.6/lib-dynload',
    '/data/06/yarn/nm/usercache/@@campus.wm.edu/appcache/application_1579106210787_0201/container_e43_1579106210787_0201_01_000001/environment/lib/python3.6/site-packages',
]
USER_BASE: '/home/.local' (doesn't exist)
USER_SITE: '/home/.local/lib/python3.6/site-packages' (doesn't exist)
ENABLE_USER_SITE: True
+ ./environment/bin/python -m dask_yarn.cli --version
dask-yarn 0.8.0

End of LogType:application.driver.log
***************************************************************************************

Container: container_e43_1579106210787_0201_01_000001 on w01.@.wm.edu_8041
LogAggregationType: AGGREGATED
=====================================================================================
LogType:application.master.log
LogLastModifiedTime:Fri Mar 13 17:34:54 +0000 2020
LogLength:1950
LogContents:
20/03/13 13:34:50 INFO skein.ApplicationMaster: Starting Skein version 0.8.0
20/03/13 13:34:50 INFO util.KerberosName: Non-simple name @@campus.wm.edu after auth_to_local rule RULE:[1:$1@$0]/L
20/03/13 13:34:50 INFO skein.ApplicationMaster: Running as user @@campus.wm.edu
20/03/13 13:34:50 INFO conf.Configuration: resource-types.xml not found
20/03/13 13:34:50 INFO resource.ResourceUtils: Unable to find 'resource-types.xml'.
20/03/13 13:34:50 INFO skein.ApplicationMaster: Application specification successfully loaded
20/03/13 13:34:51 WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
20/03/13 13:34:51 INFO skein.ApplicationMaster: gRPC server started at w01.@.wm.edu:43156
20/03/13 13:34:51 INFO skein.ApplicationMaster: WebUI server started at w01.@.wm.edu:44537
20/03/13 13:34:51 INFO skein.ApplicationMaster: Registering application with resource manager
20/03/13 13:34:52 INFO skein.ApplicationMaster: Starting application driver
20/03/13 13:34:52 INFO skein.ApplicationMaster: Shutting down: Application driver completed successfully.
20/03/13 13:34:52 INFO skein.ApplicationMaster: Unregistering application with status SUCCEEDED
20/03/13 13:34:52 INFO impl.AMRMClientImpl: Waiting for application to be successfully unregistered.
20/03/13 13:34:52 WARN ipc.Client: Exception encountered while connecting to the server : org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.StandbyException): Operation c
ategory READ is not supported in state standby. Visit https://s.apache.org/sbnn-error
20/03/13 13:34:52 INFO skein.ApplicationMaster: Deleted application directory hdfs://names
ervice1/user/@@campus.wm.edu/.skein/application_1579106210787_0201
20/03/13 13:34:52 INFO skein.ApplicationMaster: WebUI server shut down
20/03/13 13:34:52 INFO skein.ApplicationMaster: gRPC server shut down
End of LogType:application.master.log
***************************************************************************************
End of LogType:prelaunch.err
******************************************************************************
Container: container_e43_1579106210787_0201_01_000001 on w01.@.wm.edu_8041
LogAggregationType: AGGREGATED
=====================================================================================
LogType:prelaunch.out
LogLastModifiedTime:Fri Mar 13 17:34:54 +0000 2020
LogLength:70
LogContents:
Setting up env variables
Setting up job resources
Launching container
End of LogType:prelaunch.out
******************************************************************************
jcrist commented 4 years ago

Thanks, this has been really helpful. I think I have a patch that should fix this issue. For now, I think the following should work for you (untested, there may be a typo):

import skein
import dask_yarn

spec = skein.ApplicationSpec.from_yaml("""
name: dask
queue: myqueue

services:
  dask.scheduler:
    # Restrict scheduler to 2 GiB and 1 core
    resources:
      memory: 2 GiB
      vcores: 1
    files:
      environment: environment.tar.gz
    script: |
      source environment/bin/activate
      ./environment/bin/python -m dask_yarn.cli services scheduler

  dask.worker:
    # Don't start any workers initially
    instances: 0
    # Workers can restart infinite number of times
    max_restarts: -1
    # Workers should only be started after the scheduler starts
    depends:
      - dask.scheduler
    # Restrict workers to 4 GiB and 2 cores each
    resources:
      memory: 4 GiB
      vcores: 2
    files:
      environment: environment.tar.gz
    script: |
      source environment/bin/activate
      ./environment/bin/python -m dask_yarn.cli services worker
""")

cluster = dask_yarn.YarnCluster.from_specification(spec)
jcrist commented 4 years ago

Oops, there was a typo in the above, now fixed. If you've already tried it, try again.

DanRunfola commented 4 years ago

Always happy to help, even when it's far over my head!

The code you posted does work in the Jupyter environment on CDSW; as expected (I think), the async error still comes up in the normal "Workbench" on CDSW.

On the original concern (the async error in the very specific Workbench environment), is this something that you think dask-yarn may eventually handle with a workaround, or is this something that ya'll believe is better handled on the Cloudera side? I am more than happy to run this up that chain to get an eventual fix.

jcrist commented 4 years ago

The code you posted does work in the Jupyter environment on CDSW; as expected (I think)

Yay! Glad to hear it. Still not sure why it happens on your system and not mine, but I have a fix for this at least. I'll post a patch later today.

On the original concern (the async error in the very specific Workbench environment), is this something that you think dask-yarn may eventually handle with a workaround, or is this something that ya'll believe is better handled on the Cloudera side?

I'll see if there's an easy patch, but If you wouldn't mind pinging cloudera to see if they can update things that'd be swell.

DanRunfola commented 4 years ago

Great, ticket in to Cloudera. I'll loop back with any updates here, though I put it in as a low priority so don't expect a quick response.

jcrist commented 4 years ago

Only one of the two bugs was fixed in #115, closed by mistake. Reopening.

jcrist commented 4 years ago

@DanRunfola, I've pushed a patch that should (hopefully) let things run fine on the ZMQIOLoop. If you have time, would you mind testing #116 on your system? You can install it using pip:

$ pip install git+https://github.com/jcrist/dask-yarn.git@support-legacy-tornado-loops
DanRunfola commented 4 years ago

Done! FYI, I changed the pip install you recommended to the below - hopefully that's what you were after:

!pip3 install git+https://github.com/dask/dask-yarn.git@support-legacy-tornado-loops

Still get an error from the stock code - here's the trace: from dask_yarn import YarnCluster from dask.distributed import Client cluster = YarnCluster(environment='./py3.tar.gz', worker_vcores=2, worker_memory="8GiB")

results in:

RuntimeError: There is no current event loop in thread 'IO loop'.
RuntimeError                              Traceback (most recent call last)
in engine
      1 cluster = YarnCluster(environment='./py3.tar.gz',
      2 worker_vcores=2,
----> 3 worker_memory="8GiB")

/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in __init__(self, environment, n_workers, worker_vcores, worker_memory, worker_restarts, worker_env, scheduler_vcores, scheduler_memory, deploy_mode, name, queue, tags, user, host, port, dashboard_address, skein_client, asynchronous, loop)
    388             asynchronous=asynchronous,
    389             loop=loop,
--> 390             skein_client=skein_client,
    391         )
    392 

/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in _init_common(self, spec, application_client, host, port, dashboard_address, asynchronous, loop, skein_client)
    531 
    532         if not self.asynchronous:
--> 533             self._sync(self._start_internal())
    534 
    535     def _start_cluster(self):

/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in _sync(self, task)
    714                 return await task
    715 
--> 716             return sync(self.loop, f)
    717 
    718     @cached_property

/home/cdsw/.local/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    346     if error[0]:
    347         typ, exc, tb = error[0]
--> 348         raise exc.with_traceback(tb)
    349     else:
    350         return result[0]

/home/cdsw/.local/lib/python3.6/site-packages/distributed/utils.py in f()
    330             if callback_timeout is not None:
    331                 future = asyncio.wait_for(future, callback_timeout)
--> 332             result[0] = yield future
    333         except Exception as exc:
    334             error[0] = sys.exc_info()

/var/lib/cdsw/python3-engine-deps/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053 
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

/var/lib/cdsw/python3-engine-deps/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/var/lib/cdsw/python3-engine-deps/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

/var/lib/cdsw/python3-engine-deps/lib/python3.6/site-packages/tornado/gen.py in wrapper(*args, **kwargs)
    305                 try:
    306                     orig_stack_contexts = stack_context._state.contexts
--> 307                     yielded = next(result)
    308                     if stack_context._state.contexts is not orig_stack_contexts:
    309                         yielded = TracebackFuture()

/var/lib/cdsw/python3-engine-deps/lib/python3.6/site-packages/tornado/gen.py in _wrap_awaitable(x)

/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in f()
    712 
    713             async def f():
--> 714                 return await task
    715 
    716             return sync(self.loop, f)

/home/cdsw/.local/lib/python3.6/site-packages/dask_yarn/core.py in _start_internal(self)
    588     async def _start_internal(self):
    589         if self._start_task is None:
--> 590             self._start_task = asyncio.ensure_future(self._start_async())
    591         try:
    592             await self._start_task

/usr/lib/python3.6/asyncio/tasks.py in ensure_future(coro_or_future, loop)
    516     elif coroutines.iscoroutine(coro_or_future):
    517         if loop is None:
--> 518             loop = events.get_event_loop()
    519         task = loop.create_task(coro_or_future)
    520         if task._source_traceback:

/usr/lib/python3.6/asyncio/events.py in get_event_loop()
    692     if current_loop is not None:
    693         return current_loop
--> 694     return get_event_loop_policy().get_event_loop()
    695 
    696 

/usr/lib/python3.6/asyncio/events.py in get_event_loop(self)
    600         if self._local._loop is None:
    601             raise RuntimeError('There is no current event loop in thread %r.'
--> 602                                % threading.current_thread().name)
    603         return self._local._loop
    604 

RuntimeError: There is no current event loop in thread 'IO loop'.
jcrist commented 4 years ago

Hmmm, interesting. Does the following work for you?

from dask.distributed import Client

client = Client()
client.submit(lambda x: x + 1, 1).result()
DanRunfola commented 4 years ago

No dice -


from dask.distributed import Client
client = Client()
RuntimeError: There is no current event loop in thread 'IO loop'.
RuntimeError                              Traceback (most recent call last)
in engine
----> 1 client = Client()

/home/cdsw/.local/lib/python3.6/site-packages/distributed/client.py in __init__(self, address, loop, timeout, set_as_default, scheduler_file, security, asynchronous, name, heartbeat_interval, serializers, deserializers, extensions, direct_to_workers, **kwargs)
    721             ext(self)
    722 
--> 723         self.start(timeout=timeout)
    724         Client._instances.add(self)
    725 

/home/cdsw/.local/lib/python3.6/site-packages/distributed/client.py in start(self, **kwargs)
    894             self._started = asyncio.ensure_future(self._start(**kwargs))
    895         else:
--> 896             sync(self.loop, self._start, **kwargs)
    897 
    898     def __await__(self):

/home/cdsw/.local/lib/python3.6/site-packages/distributed/utils.py in sync(loop, func, callback_timeout, *args, **kwargs)
    346     if error[0]:
    347         typ, exc, tb = error[0]
--> 348         raise exc.with_traceback(tb)
    349     else:
    350         return result[0]

/home/cdsw/.local/lib/python3.6/site-packages/distributed/utils.py in f()
    330             if callback_timeout is not None:
    331                 future = asyncio.wait_for(future, callback_timeout)
--> 332             result[0] = yield future
    333         except Exception as exc:
    334             error[0] = sys.exc_info()

/var/lib/cdsw/python3-engine-deps/lib/python3.6/site-packages/tornado/gen.py in run(self)
   1053 
   1054                     try:
-> 1055                         value = future.result()
   1056                     except Exception:
   1057                         self.had_exception = True

/var/lib/cdsw/python3-engine-deps/lib/python3.6/site-packages/tornado/concurrent.py in result(self, timeout)
    236         if self._exc_info is not None:
    237             try:
--> 238                 raise_exc_info(self._exc_info)
    239             finally:
    240                 self = None

/var/lib/cdsw/python3-engine-deps/lib/python3.6/site-packages/tornado/util.py in raise_exc_info(exc_info)

/var/lib/cdsw/python3-engine-deps/lib/python3.6/site-packages/tornado/gen.py in wrapper(*args, **kwargs)
    305                 try:
    306                     orig_stack_contexts = stack_context._state.contexts
--> 307                     yielded = next(result)
    308                     if stack_context._state.contexts is not orig_stack_contexts:
    309                         yielded = TracebackFuture()

/var/lib/cdsw/python3-engine-deps/lib/python3.6/site-packages/tornado/gen.py in _wrap_awaitable(x)

/home/cdsw/.local/lib/python3.6/site-packages/distributed/client.py in _start(self, timeout, **kwargs)
    960                     loop=self.loop,
    961                     asynchronous=self._asynchronous,
--> 962                     **self._startup_kwargs
    963                 )
    964             except (OSError, socket.error) as e:

/home/cdsw/.local/lib/python3.6/site-packages/distributed/deploy/spec.py in _()
    362         async def _():
    363             if self.status == "created":
--> 364                 await self._start()
    365             await self.scheduler
    366             await self._correct_state()

/home/cdsw/.local/lib/python3.6/site-packages/distributed/deploy/spec.py in _start(self)
    265             raise ValueError("Cluster is closed")
    266 
--> 267         self._lock = asyncio.Lock()
    268 
    269         if self.scheduler_spec is None:

/usr/lib/python3.6/asyncio/locks.py in __init__(self, loop)
    147             self._loop = loop
    148         else:
--> 149             self._loop = events.get_event_loop()
    150 
    151     def __repr__(self):

/usr/lib/python3.6/asyncio/events.py in get_event_loop()
    692     if current_loop is not None:
    693         return current_loop
--> 694     return get_event_loop_policy().get_event_loop()
    695 
    696 

/usr/lib/python3.6/asyncio/events.py in get_event_loop(self)
    600         if self._local._loop is None:
    601             raise RuntimeError('There is no current event loop in thread %r.'
--> 602                                % threading.current_thread().name)
    603         return self._local._loop
    604 

RuntimeError: There is no current event loop in thread 'IO loop'.
jcrist commented 4 years ago

Cool, so it isn't dask-yarn specific, this has to deal with how all of dask uses event loops. I'm going to punt on this for now and say that we don't support event-loop implementations that are this old. ZMQIOLoop has been deprecated since 2017, I don't currently feel that the effort it would take for us to track down all the issues and make things work is worth it (apologies). If someone else made a PR to fix things I'd happily merge it, but I don't plan to continue work on this myself.

With #115 though, you should be able to get things working fully on your JupyterHub setup, so hopefully that's sufficient for your needs.

jcrist commented 4 years ago

I just released version 0.8.1 with the fix from #115. Up on PyPI now, should be on conda-forge later today.