Closed mrocklin closed 7 years ago
Ah, poor error reporting indeed! I updated the quickstart docs to show that you must call .start()
to actually assign workers - instantiation only starts the scheduler.
Ah, thanks (and thanks for making the answer a commit for future users :))
Next problem:
In [4]: cluster.start(4)
2017-09-12 10:31:30,725 - knit.env - INFO - Creating new env dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58
2017-09-12 10:31:30,725 - knit.env - INFO - /home/mrocklin/anaconda/bin/conda create -p /home/mrocklin/workspace/knit/knit/tmp_conda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58 -y -q dask>=0.14 distributed>=1.16
Error: Unknown option --callbackHost
Error: Unknown argument '127.0.0.1'
Error: Unknown option --callbackPort
Error: Unknown argument '57663'
Try --help for more information.
Is this conda or pip installed? If using the source, you need to first build the jar that executes in java
> python setup.py install mvn
mrocklin@workstation:~/workspace/knit$ python setup.py install mvn
-------------------------------------------------------
T E S T S
-------------------------------------------------------
Results :
Tests run: 0, Failures: 0, Errors: 0, Skipped: 0
running install
running bdist_egg
running egg_info
writing knit.egg-info/PKG-INFO
writing dependency_links to knit.egg-info/dependency_links.txt
writing requirements to knit.egg-info/requires.txt
writing top-level names to knit.egg-info/top_level.txt
reading manifest file 'knit.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'knit.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
copying knit/compatibility.py -> build/lib/knit
copying knit/dask_yarn.py -> build/lib/knit
copying knit/exceptions.py -> build/lib/knit
copying knit/__init__.py -> build/lib/knit
copying knit/env.py -> build/lib/knit
copying knit/core.py -> build/lib/knit
copying knit/yarn_api.py -> build/lib/knit
copying knit/utils.py -> build/lib/knit
copying knit/conf.py -> build/lib/knit
copying knit/java_libs/knit-1.0-SNAPSHOT.jar -> build/lib/knit/java_libs
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/compatibility.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/dask_yarn.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/exceptions.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/_version.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/__init__.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/env.py -> build/bdist.linux-x86_64/egg/knit
creating build/bdist.linux-x86_64/egg/knit/java_libs
copying build/lib/knit/java_libs/knit-1.0-SNAPSHOT.jar -> build/bdist.linux-x86_64/egg/knit/java_libs
copying build/lib/knit/core.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/yarn_api.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/utils.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/conf.py -> build/bdist.linux-x86_64/egg/knit
byte-compiling build/bdist.linux-x86_64/egg/knit/compatibility.py to compatibility.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/dask_yarn.py to dask_yarn.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/exceptions.py to exceptions.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/_version.py to _version.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/env.py to env.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/core.py to core.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/yarn_api.py to yarn_api.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/utils.py to utils.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/conf.py to conf.cpython-36.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/not-zip-safe -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
creating 'dist/knit-0.2.2-py3.6.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing knit-0.2.2-py3.6.egg
creating /home/mrocklin/anaconda/lib/python3.6/site-packages/knit-0.2.2-py3.6.egg
Extracting knit-0.2.2-py3.6.egg to /home/mrocklin/anaconda/lib/python3.6/site-packages
Adding knit 0.2.2 to easy-install.pth file
Installed /home/mrocklin/anaconda/lib/python3.6/site-packages/knit-0.2.2-py3.6.egg
Processing dependencies for knit==0.2.2
Searching for py4j==0.10.4
Best match: py4j 0.10.4
Adding py4j 0.10.4 to easy-install.pth file
Using /home/mrocklin/anaconda/lib/python3.6/site-packages
Searching for requests==2.12.4
Best match: requests 2.12.4
Adding requests 2.12.4 to easy-install.pth file
Using /home/mrocklin/anaconda/lib/python3.6/site-packages
Searching for lxml==3.7.2
Best match: lxml 3.7.2
Adding lxml 3.7.2 to easy-install.pth file
Using /home/mrocklin/anaconda/lib/python3.6/site-packages
Finished processing dependencies for knit==0.2.2
mrocklin@workstation:~/workspace/knit$ ipython
fPython 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 12:22:00)
Type "copyright", "credits" or "license" for more information.
IPython 5.1.0 -- An enhanced Interactive Python.
? -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help -> Python's own help system.
object? -> Details about 'object', use 'object??' for extra details.
In [1]: from knit.dask_yarn import DaskYARNCluster
In [2]: cluster = DaskYARNCluster()
In [3]: cluster.start(4)
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/yarn/client/api/YarnClient
at java.lang.Class.getDeclaredMethods0(Native Method)
at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
at java.lang.Class.getMethod0(Class.java:3018)
at java.lang.Class.getMethod(Class.java:1784)
at org.apache.hadoop.util.RunJar.main(RunJar.java:202)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.yarn.client.api.YarnClient
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 6 more
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-3-610bf2513091> in <module>()
----> 1 cluster.start(4)
/home/mrocklin/workspace/knit/knit/dask_yarn.py in start(self, n_workers, cpus, memory)
123 app_id = self.knit.start(command, env=self.env,
124 num_containers=n_workers,
--> 125 virtual_cores=cpus, memory=memory)
126 self.app_id = app_id
127 return app_id
/home/mrocklin/workspace/knit/knit/core.py in start(self, cmd, num_containers, virtual_cores, memory, env, files, app_name, queue)
208
209 if gateway_port is None:
--> 210 raise Exception("Java gateway process exited before sending the"
211 " driver its port number")
212
Exception: Java gateway process exited before sending the driver its port number
To be sure, this is within the mdurant/hadoop container? It should have the java_home and classpath set, I've never seen something like this!
Yarn is running within that container, but my ipython process is not.
Until kicking java out of the client in the distant future, you need to run on the "edge" node, with access to the yarn classes. It will be nice to be able to run as a remote client, but not yet.
Suggestions of where to put such information (when you have things working) great appreciated.
Now running in the container
In [1]: from knit.dask_yarn import DaskYARNCluster
In [2]: cluster = DaskYARNCluster()
In [3]: cluster.start()
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-3-91c70aa40c2d> in <module>()
----> 1 cluster.start()
TypeError: start() missing 1 required positional argument: 'n_workers'
In [4]: cluster.start(4)
2017-09-13 12:21:09,118 - knit.env - INFO - Creating new env dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58
2017-09-13 12:21:09,118 - knit.env - INFO - /opt/conda/bin/conda create -p /home/knit/knit/tmp_conda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58 -y -q dask>=0.14 distributed>=1.16
17/09/13 12:21:50 INFO knit.Client$: Starting Application Master
17/09/13 12:21:51 INFO hdfs.DFSClient: Cannot get delegation token from yarn
17/09/13 12:21:51 INFO knit.Utils$: Setting Replication Factor to: 3
17/09/13 12:21:51 INFO knit.Utils$: Attemping upload of /home/knit/knit/java_libs/knit-1.0-SNAPSHOT.jar to /user/root/.knitDeps
17/09/13 12:21:52 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/13 12:21:52 INFO knit.Utils$: Attemping upload of /home/knit/knit/tmp_conda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip to hdfs://0.0.0.0:8020/user/root/.knitDeps
17/09/13 12:22:08 INFO knit.Client$: Submitting application application_1505225287100_0003
17/09/13 12:22:08 INFO impl.YarnClientImpl: Submitted application application_1505225287100_0003
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-4-610bf2513091> in <module>()
----> 1 cluster.start(4)
/home/knit/knit/dask_yarn.py in start(self, n_workers, cpus, memory)
123 app_id = self.knit.start(command, env=self.env,
124 num_containers=n_workers,
--> 125 virtual_cores=cpus, memory=memory)
126 self.app_id = app_id
127 return app_id
/home/knit/knit/core.py in start(self, cmd, num_containers, virtual_cores, memory, env, files, app_name, queue)
228
229 if master_rpcport == -1:
--> 230 raise Exception("YARN master container did not report back")
231 master_rpchost = self.client.masterRPCHost()
232
Exception: YARN master container did not report back
In [5]:
In [5]: cluster
Out[5]: <knit.dask_yarn.DaskYARNCluster at 0x7f537274f9e8>
In [6]: cluster.wo17/09/13 12:23:56 INFO knit.Client$: Getting containers for application_attempt_id { application_id { id: 3 cluster_timestamp: 1505225287100 } attemptId: 1 } host: "N/A" rpc_port: -1 tracking_url: "http://73187ceb633b:8088/proxy/application_1505225287100_0003/" diagnostics: "[Wed Sep 13 12:22:08 +0000 2017] Application is added to the scheduler and is not yet activated. Skipping AM assignment as cluster resource is empty. Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:1024, vCores:1>; Queue Resource Limit for AM = <memory:0, vCores:0>; User AM Resource Limit of the queue = <memory:0, vCores:0>; Queue AM Resource Usage = <memory:0, vCores:0>; " yarn_application_attempt_state: APP_ATTEMPT_SCHEDULED original_tracking_url: "N/A" startTime: 1505305328938 finishTime: 0
17/09/13 12:23:56 INFO knit.Client$: Container ID:
17/09/13 12:23:56 INFO knit.Client$: Getting containers for application_attempt_id { application_id { id: 3 cluster_timestamp: 1505225287100 } attemptId: 1 } host: "N/A" rpc_port: -1 tracking_url: "http://73187ceb633b:8088/proxy/application_1505225287100_0003/" diagnostics: "[Wed Sep 13 12:22:08 +0000 2017] Application is added to the scheduler and is not yet activated. Skipping AM assignment as cluster resource is empty. Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:1024, vCores:1>; Queue Resource Limit for AM = <memory:0, vCores:0>; User AM Resource Limit of the queue = <memory:0, vCores:0>; Queue AM Resource Usage = <memory:0, vCores:0>; " yarn_application_attempt_state: APP_ATTEMPT_SCHEDULED original_tracking_url: "N/A" startTime: 1505305328938 finishTime: 0
17/09/13 12:23:56 INFO knit.Client$: Container ID:
In [6]: cluster.workers
17/09/13 12:23:58 INFO knit.Client$: Getting containers for application_attempt_id { application_id { id: 3 cluster_timestamp: 1505225287100 } attemptId: 1 } host: "N/A" rpc_port: -1 tracking_url: "http://73187ceb633b:8088/proxy/application_1505225287100_0003/" diagnostics: "[Wed Sep 13 12:22:08 +0000 2017] Application is added to the scheduler and is not yet activated. Skipping AM assignment as cluster resource is empty. Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:1024, vCores:1>; Queue Resource Limit for AM = <memory:0, vCores:0>; User AM Resource Limit of the queue = <memory:0, vCores:0>; Queue AM Resource Usage = <memory:0, vCores:0>; " yarn_application_attempt_state: APP_ATTEMPT_SCHEDULED original_tracking_url: "N/A" startTime: 1505305328938 finishTime: 0
17/09/13 12:23:58 INFO knit.Client$: Container ID:
Out[6]: []
How I test:
docker run -d -v /Users/mdurant/code/:/code mdurant/hadoop
docker exec -it 5bcfd2e5d9fd bash
cd /code/knit
conda install -y -q dask distributed
conda install -y -q -c conda-forge lxml py4j
python setup.py install mvn
ipython
In [1]: from knit import DaskYARNCluster
In [2]: cluster = DaskYARNCluster()
In [3]: cluster.start(2, memory=256, cpus=1) # guaranteed not to exhaust my docker VM
2017-09-13 13:22:46,672 - knit.env - INFO - Creating new env dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58
2017-09-13 13:22:46,672 - knit.env - INFO - /opt/conda/bin/conda create -p /code/knit/knit/tmp_conda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58 -y -q dask>=0.14 distributed>=1.16
17/09/13 13:24:57 INFO knit.Client$: Starting Application Master
17/09/13 13:24:59 INFO hdfs.DFSClient: Cannot get delegation token from yarn
17/09/13 13:24:59 INFO knit.Utils$: Setting Replication Factor to: 3
17/09/13 13:24:59 INFO knit.Utils$: Attemping upload of /code/knit/knit/java_libs/knit-1.0-SNAPSHOT.jar to /user/root/.knitDeps
17/09/13 13:25:00 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/13 13:25:00 INFO knit.Utils$: Attemping upload of /code/knit/knit/tmp_conda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip to hdfs://0.0.0.0:8020/user/root/.knitDeps
17/09/13 13:25:07 INFO knit.Client$: Submitting application application_1505308493094_0001
17/09/13 13:25:08 INFO impl.YarnClientImpl: Submitted application application_1505308493094_0001
Out[3]: 'application_1505308493094_0001'
In [4]: from dask.distributed import Client
In [5]: c = Client(cluster)
In [6]: c
Out[6]: <Client: scheduler='tcp://172.17.0.2:43117' processes=2 cores=2>
(`conda install -y -q -c conda-forge lxml py4j; python setup.py install mvn` can be replaced with `conda install knit -c conda-forge`, and then there's no need to mount the source in the docker container)
mrocklin@workstation:~$ docker run -p 8020:8020 -p 8088:8088 mdurant/hadoop * Restarting OpenBSD Secure Shell server sshd
...done.
Generating public/private dsa key pair.
Created directory '/root/.ssh'.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
...
# localhost SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8
# localhost SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8
# 0.0.0.0 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8
# 0.0.0.0 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8
Starting namenodes on [0.0.0.0]
0.0.0.0: starting namenode, logging to /opt/hadoop/logs/hadoop-root-namenode-abb365839374.out
localhost: starting datanode, logging to /opt/hadoop/logs/hadoop-root-datanode-abb365839374.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /opt/hadoop/logs/hadoop-root-secondarynamenode-abb365839374.out
starting yarn daemons
starting resourcemanager, logging to /opt/hadoop/logs/yarn--resourcemanager-abb365839374.out
localhost: starting nodemanager, logging to /opt/hadoop/logs/yarn-root-nodemanager-abb365839374.out
mrocklin@workstation:~$ docker exec -it elegant_franklin bash
root@abb365839374:/# cd /code/knit
bash: cd: /code/knit: No such file or directory
root@abb365839374:/# ls
bin boot dev etc home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var
root@abb365839374:/# cd home/
root@abb365839374:/home# ls
root@abb365839374:/home# git clone https://github.com/dask/knit.git
Cloning into 'knit'...
cd kremote: Counting objects: 1446, done.
remote: Compressing objects: 100% (90/90), done.
remote: Total 1446 (delta 58), reused 88 (delta 32), pack-reused 1318
Receiving objects: 100% (1446/1446), 267.88 KiB | 0 bytes/s, done.
Resolving deltas: 100% (745/745), done.
Checking connectivity... done.
root@abb365839374:/home# cd knit
root@abb365839374:/home/knit# conda install -y -q dask distributed
Package plan for installation in environment /opt/conda:
The following NEW packages will be INSTALLED:
bkcharts: 0.2-py36_0
bokeh: 0.12.7-py36_0
click: 6.7-py36_0
cloudpickle: 0.4.0-py36_0
dask: 0.15.2-py36_0
distributed: 1.18.1-py36_0
heapdict: 1.0.0-py36_1
locket: 0.2.0-py36_1
mkl: 2017.0.3-0
msgpack-python: 0.4.8-py36_0
numpy: 1.13.1-py36_0
pandas: 0.20.3-py36_0
partd: 0.3.8-py36_0
psutil: 5.2.2-py36_0
sortedcontainers: 1.5.7-py36_0
tblib: 1.3.2-py36_0
toolz: 0.8.2-py36_0
tornado: 4.5.2-py36_0
zict: 0.1.2-py36_0
The following packages will be UPDATED:
conda: 4.3.23-py36_0 conda-forge --> 4.3.25-py36_0
The following packages will be SUPERSEDED by a higher-priority channel:
conda-env: 2.6.0-0 conda-forge --> 2.6.0-0
root@abb365839374:/home/knit# conda install -y -q -c conda-forge lxml py4j
Package plan for installation in environment /opt/conda:
The following NEW packages will be INSTALLED:
libxslt: 1.1.29-5 conda-forge
lxml: 3.8.0-py36_0 conda-forge
py4j: 0.10.6-py36_1 conda-forge
The following packages will be SUPERSEDED by a higher-priority channel:
conda: 4.3.25-py36_0 --> 4.3.23-py36_0 conda-forge
conda-env: 2.6.0-0 --> 2.6.0-0 conda-forge
root@abb365839374:/home/knit# python setup.py install mvn
running install
running bdist_egg
running egg_info
creating knit.egg-info
writing knit.egg-info/PKG-INFO
writing dependency_links to knit.egg-info/dependency_links.txt
writing requirements to knit.egg-info/requires.txt
writing top-level names to knit.egg-info/top_level.txt
writing manifest file 'knit.egg-info/SOURCES.txt'
reading manifest file 'knit.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
no previously-included directories found matching 'knit/tmp_conda'
writing manifest file 'knit.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib
creating build/lib/knit
copying knit/compatibility.py -> build/lib/knit
copying knit/dask_yarn.py -> build/lib/knit
copying knit/exceptions.py -> build/lib/knit
copying knit/__init__.py -> build/lib/knit
copying knit/env.py -> build/lib/knit
copying knit/core.py -> build/lib/knit
copying knit/yarn_api.py -> build/lib/knit
copying knit/utils.py -> build/lib/knit
copying knit/conf.py -> build/lib/knit
creating build/lib/knit/java_libs
copying knit/java_libs/knit-1.0-SNAPSHOT.jar -> build/lib/knit/java_libs
creating build/bdist.linux-x86_64
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/compatibility.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/dask_yarn.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/exceptions.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/__init__.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/env.py -> build/bdist.linux-x86_64/egg/knit
creating build/bdist.linux-x86_64/egg/knit/java_libs
copying build/lib/knit/java_libs/knit-1.0-SNAPSHOT.jar -> build/bdist.linux-x86_64/egg/knit/java_libs
copying build/lib/knit/core.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/yarn_api.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/utils.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/conf.py -> build/bdist.linux-x86_64/egg/knit
byte-compiling build/bdist.linux-x86_64/egg/knit/compatibility.py to compatibility.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/dask_yarn.py to dask_yarn.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/exceptions.py to exceptions.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/env.py to env.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/core.py to core.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/yarn_api.py to yarn_api.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/utils.py to utils.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/conf.py to conf.cpython-36.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/not-zip-safe -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
creating dist
creating 'dist/knit-0.2.2-py3.6.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing knit-0.2.2-py3.6.egg
creating /opt/conda/lib/python3.6/site-packages/knit-0.2.2-py3.6.egg
Extracting knit-0.2.2-py3.6.egg to /opt/conda/lib/python3.6/site-packages
Adding knit 0.2.2 to easy-install.pth file
Installed /opt/conda/lib/python3.6/site-packages/knit-0.2.2-py3.6.egg
Processing dependencies for knit==0.2.2
Searching for py4j==0.10.6
Best match: py4j 0.10.6
Adding py4j 0.10.6 to easy-install.pth file
Using /opt/conda/lib/python3.6/site-packages
Searching for requests==2.14.2
Best match: requests 2.14.2
Adding requests 2.14.2 to easy-install.pth file
Using /opt/conda/lib/python3.6/site-packages
Searching for lxml==3.8.0
Best match: lxml 3.8.0
Adding lxml 3.8.0 to easy-install.pth file
Using /opt/conda/lib/python3.6/site-packages
Finished processing dependencies for knit==0.2.2
root@abb365839374:/home/knit# ipython
Python 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:09:58)
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.
In [1]: from knit import DaskYARNCluster
In [2]: cluster = DaskYARNCluster()
In [3]: cluster.start(2, memory=256, cpus=1)
2017-09-13 16:38:02,388 - knit.env - INFO - Creating new env dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58
2017-09-13 16:38:02,388 - knit.env - INFO - /opt/conda/bin/conda create -p /home/knit/knit/tmp_conda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58 -y -q dask>=0.14 distributed>=1.16
17/09/13 16:39:24 INFO knit.Client$: Starting Application Master
17/09/13 16:39:25 INFO hdfs.DFSClient: Cannot get delegation token from yarn
17/09/13 16:39:25 INFO knit.Utils$: Setting Replication Factor to: 3
17/09/13 16:39:25 INFO knit.Utils$: Attemping upload of /home/knit/knit/java_libs/knit-1.0-SNAPSHOT.jar to /user/root/.knitDeps
17/09/13 16:39:26 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/13 16:39:26 INFO knit.Utils$: Attemping upload of /home/knit/knit/tmp_conda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip to hdfs://0.0.0.0:8020/user/root/.knitDeps
17/09/13 16:40:12 INFO knit.Client$: Submitting application application_1505318878467_0001
17/09/13 16:40:13 INFO impl.YarnClientImpl: Submitted application application_1505318878467_0001
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
<ipython-input-3-31269a53dae8> in <module>()
----> 1 cluster.start(2, memory=256, cpus=1)
/home/knit/knit/dask_yarn.py in start(self, n_workers, cpus, memory)
123 app_id = self.knit.start(command, env=self.env,
124 num_containers=n_workers,
--> 125 virtual_cores=cpus, memory=memory)
126 self.app_id = app_id
127 return app_id
/home/knit/knit/core.py in start(self, cmd, num_containers, virtual_cores, memory, env, files, app_name, queue)
228
229 if master_rpcport == -1:
--> 230 raise Exception("YARN master container did not report back")
231 master_rpchost = self.client.masterRPCHost()
232
Exception: YARN master container did not report back
In [4]:
I run the identical same code, and succeed. My VM is set to 4CPUs and 3.8GB, and after starting dask as above, free
reports 1594MB available. Unfortunately, YARN's introspection of the system (available as cluster.knit.yarn_api.cluster_metrics()
) seems to misidentify the amount of available RAM, for me gives 8GB.
I am surprised that you seem to get stuck before even starting the AM. That means there are not even any logs we could get. Can you check memory with free
and disc with df -h
and df -h -i
?
I'm not sure what I'm looking for here:
root@73d7b672700a:/# free -h
total used free shared buffers cached
Mem: 15G 15G 322M 818M 136M 2.2G
-/+ buffers/cache: 12G 2.6G
Swap: 15G 414M 15G
root@73d7b672700a:/# df -h
Filesystem Size Used Avail Use% Mounted on
none 453G 404G 27G 94% /
tmpfs 7.8G 0 7.8G 0% /dev
tmpfs 7.8G 0 7.8G 0% /sys/fs/cgroup
/dev/dm-1 453G 404G 27G 94% /etc/hosts
shm 64M 0 64M 0% /dev/shm
tmpfs 7.8G 0 7.8G 0% /sys/firmware
root@73d7b672700a:/# df -h -i
Filesystem Inodes IUsed IFree IUse% Mounted on
none 29M 3.6M 26M 13% /
tmpfs 2.0M 16 2.0M 1% /dev
tmpfs 2.0M 11 2.0M 1% /sys/fs/cgroup
/dev/dm-1 29M 3.6M 26M 13% /etc/hosts
shm 2.0M 1 2.0M 1% /dev/shm
tmpfs 2.0M 1 2.0M 1% /sys/firmware
Can you look for errors in
/opt/hadoop/logs/yarn--resourcemanager-*.out
/opt/hadoop/logs/yarn--resourcemanager-*.log
/opt/hadoop/logs/yarn-root-nodemanager-*.out
/opt/hadoop/logs/yarn-root-nodemanager-*.log
This one has an error message
Happy to provide the others as well if desired
used space above threshold of 90.0%
!! Node manager failed to start because it assumed the disc would soon become full.
This suggests I should add more diagnostics to the YARN api - the information about the state of nodemanagers is readily available.
Apparently the following yarn config would solve the issue, after yarn restart (or need to rebuild docker image)
<property>
<name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
<value>99.5</value>
</property>
My machine was genuinely low on disk space, which warrants handling in its own right.
On Wed, Sep 13, 2017 at 2:18 PM, Martin Durant notifications@github.com wrote:
Apparently the following yarn config would solve the issue, after yarn restart (or need to rebuild docker image)
yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage 99.5 — You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/knit/issues/80#issuecomment-329253194, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszISoAjhGjlmUgao5n9q3hgIKMPb3ks5siBxegaJpZM4PUthB .
The config change would have saved me from Travis woes too, so I'll probably put it into the image at some point.
I am writing some "pre-flight checks" and diagnostics, but I notice that YARN doesn't actually know how much memory it has available, only what is specified in the config (8GB, 8CPUs by default), and container allocations are set to 1GB minimum by default (of which potentially only a small amount is used, and python is not actually restricted in going beyond). There is (potentially) information about YARN's guess at physical memory usage on worker node machines, but the log files that were useful above are not in general available to the user, as they are scattered on various machines and not may need privileged access. Similarly, general disc available space.
So I'm wondering how much I can check versus trying to give comprehensive troubleshooting guidelines on failure. Certainly "master didn't report back" is totally opaque.
My case of not having enough disk space is probably odd enough not to worry about.
On Thu, Sep 14, 2017 at 12:34 PM, Martin Durant notifications@github.com wrote:
I am writing some "pre-flight checks" and diagnostics, but I notice that YARN doesn't actually know how much memory it has available, only what is specified in the config (8GB, 8CPUs by default), and container allocations are set to 1GB minimum by default (of which potentially only a small amount is used, and python is not actually restricted in going beyond). There is (potentially) information about YARN's guess at physical memory usage on worker node machines, but the log files that were useful above are not in general available to the user, as they are scattered on various machines and not may need privileged access. Similarly, general disc available space.
So I'm wondering how much I can check versus trying to give comprehensive troubleshooting guidelines on failure. Certainly "master didn't report back" is totally opaque.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/knit/issues/80#issuecomment-329538431, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszGLciOwuWmOSZ3zR2uoaDajtQDqMks5siVWegaJpZM4PUthB .
I've started up Yarn (I think) on my local machine using @martindurant 's docker setup
Then on the same machine (but not in the docker container) I try to connect using
knit.dask_yarn
I probably did something wrong, but from this error I don't know what it is.