dask / knit

Deprecated, please use https://github.com/jcrist/skein or https://github.com/dask/dask-yarn instead
http://knit.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
53 stars 10 forks source link

Uninformative errors on naive connection #80

Closed mrocklin closed 7 years ago

mrocklin commented 7 years ago

I've started up Yarn (I think) on my local machine using @martindurant 's docker setup

$ docker run -p 8020:8020 -p 8088:8088 mdurant/hadoop
...
# localhost SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8
# localhost SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8
# 0.0.0.0 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8
# 0.0.0.0 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8
Starting namenodes on [0.0.0.0]
0.0.0.0: starting namenode, logging to /opt/hadoop/logs/hadoop-root-namenode-73187ceb633b.out
localhost: starting datanode, logging to /opt/hadoop/logs/hadoop-root-datanode-73187ceb633b.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /opt/hadoop/logs/hadoop-root-secondarynamenode-73187ceb633b.out
starting yarn daemons
starting resourcemanager, logging to /opt/hadoop/logs/yarn--resourcemanager-73187ceb633b.out
localhost: starting nodemanager, logging to /opt/hadoop/logs/yarn-root-nodemanager-73187ceb633b.out

Then on the same machine (but not in the docker container) I try to connect using knit.dask_yarn

In [1]: from knit.dask_yarn import DaskYARNCluster

In [2]: cluster = DaskYARNCluster?

In [3]: cluster = DaskYARNCluster()

In [4]: cluster
Out[4]: <knit.dask_yarn.DaskYARNCluster at 0x7fb9b5936e10>

In [5]: cluster.workers
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-5-7d7af4d87d9e> in <module>()
----> 1 cluster.workers

/home/mrocklin/workspace/knit/knit/dask_yarn.py in workers(self)
    150         # should not be remove or counted as a worker
    151 
--> 152         containers = self.knit.get_containers()
    153         containers.sort()
    154         self.application_master_container = containers.pop(0)

/home/mrocklin/workspace/knit/knit/core.py in get_containers(self)
    272 
    273         """
--> 274         return self.client.getContainers().split(',')
    275 
    276     def get_container_statuses(self):

AttributeError: 'NoneType' object has no attribute 'getContainers'

I probably did something wrong, but from this error I don't know what it is.

martindurant commented 7 years ago

Ah, poor error reporting indeed! I updated the quickstart docs to show that you must call .start() to actually assign workers - instantiation only starts the scheduler.

mrocklin commented 7 years ago

Ah, thanks (and thanks for making the answer a commit for future users :))

Next problem:

In [4]: cluster.start(4)
2017-09-12 10:31:30,725 - knit.env - INFO - Creating new env dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58
2017-09-12 10:31:30,725 - knit.env - INFO - /home/mrocklin/anaconda/bin/conda create -p /home/mrocklin/workspace/knit/knit/tmp_conda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58 -y -q dask>=0.14 distributed>=1.16

Error: Unknown option --callbackHost
Error: Unknown argument '127.0.0.1'
Error: Unknown option --callbackPort
Error: Unknown argument '57663'
Try --help for more information.
martindurant commented 7 years ago

Is this conda or pip installed? If using the source, you need to first build the jar that executes in java

> python setup.py install mvn
mrocklin commented 7 years ago
mrocklin@workstation:~/workspace/knit$ python setup.py install mvn

-------------------------------------------------------
 T E S T S
-------------------------------------------------------

Results :

Tests run: 0, Failures: 0, Errors: 0, Skipped: 0

running install
running bdist_egg
running egg_info
writing knit.egg-info/PKG-INFO
writing dependency_links to knit.egg-info/dependency_links.txt
writing requirements to knit.egg-info/requires.txt
writing top-level names to knit.egg-info/top_level.txt
reading manifest file 'knit.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
writing manifest file 'knit.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
copying knit/compatibility.py -> build/lib/knit
copying knit/dask_yarn.py -> build/lib/knit
copying knit/exceptions.py -> build/lib/knit
copying knit/__init__.py -> build/lib/knit
copying knit/env.py -> build/lib/knit
copying knit/core.py -> build/lib/knit
copying knit/yarn_api.py -> build/lib/knit
copying knit/utils.py -> build/lib/knit
copying knit/conf.py -> build/lib/knit
copying knit/java_libs/knit-1.0-SNAPSHOT.jar -> build/lib/knit/java_libs
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/compatibility.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/dask_yarn.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/exceptions.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/_version.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/__init__.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/env.py -> build/bdist.linux-x86_64/egg/knit
creating build/bdist.linux-x86_64/egg/knit/java_libs
copying build/lib/knit/java_libs/knit-1.0-SNAPSHOT.jar -> build/bdist.linux-x86_64/egg/knit/java_libs
copying build/lib/knit/core.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/yarn_api.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/utils.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/conf.py -> build/bdist.linux-x86_64/egg/knit
byte-compiling build/bdist.linux-x86_64/egg/knit/compatibility.py to compatibility.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/dask_yarn.py to dask_yarn.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/exceptions.py to exceptions.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/_version.py to _version.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/env.py to env.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/core.py to core.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/yarn_api.py to yarn_api.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/utils.py to utils.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/conf.py to conf.cpython-36.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/not-zip-safe -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
creating 'dist/knit-0.2.2-py3.6.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing knit-0.2.2-py3.6.egg
creating /home/mrocklin/anaconda/lib/python3.6/site-packages/knit-0.2.2-py3.6.egg
Extracting knit-0.2.2-py3.6.egg to /home/mrocklin/anaconda/lib/python3.6/site-packages
Adding knit 0.2.2 to easy-install.pth file

Installed /home/mrocklin/anaconda/lib/python3.6/site-packages/knit-0.2.2-py3.6.egg
Processing dependencies for knit==0.2.2
Searching for py4j==0.10.4
Best match: py4j 0.10.4
Adding py4j 0.10.4 to easy-install.pth file

Using /home/mrocklin/anaconda/lib/python3.6/site-packages
Searching for requests==2.12.4
Best match: requests 2.12.4
Adding requests 2.12.4 to easy-install.pth file

Using /home/mrocklin/anaconda/lib/python3.6/site-packages
Searching for lxml==3.7.2
Best match: lxml 3.7.2
Adding lxml 3.7.2 to easy-install.pth file

Using /home/mrocklin/anaconda/lib/python3.6/site-packages
Finished processing dependencies for knit==0.2.2
mrocklin@workstation:~/workspace/knit$ ipython
fPython 3.6.0 |Anaconda custom (64-bit)| (default, Dec 23 2016, 12:22:00) 
Type "copyright", "credits" or "license" for more information.

IPython 5.1.0 -- An enhanced Interactive Python.
?         -> Introduction and overview of IPython's features.
%quickref -> Quick reference.
help      -> Python's own help system.
object?   -> Details about 'object', use 'object??' for extra details.

In [1]: from knit.dask_yarn import DaskYARNCluster

In [2]: cluster = DaskYARNCluster()

In [3]: cluster.start(4)
Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/yarn/client/api/YarnClient
    at java.lang.Class.getDeclaredMethods0(Native Method)
    at java.lang.Class.privateGetDeclaredMethods(Class.java:2701)
    at java.lang.Class.privateGetMethodRecursive(Class.java:3048)
    at java.lang.Class.getMethod0(Class.java:3018)
    at java.lang.Class.getMethod(Class.java:1784)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:202)
Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.yarn.client.api.YarnClient
    at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
    ... 6 more
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-3-610bf2513091> in <module>()
----> 1 cluster.start(4)

/home/mrocklin/workspace/knit/knit/dask_yarn.py in start(self, n_workers, cpus, memory)
    123         app_id = self.knit.start(command, env=self.env,
    124                                  num_containers=n_workers,
--> 125                                  virtual_cores=cpus, memory=memory)
    126         self.app_id = app_id
    127         return app_id

/home/mrocklin/workspace/knit/knit/core.py in start(self, cmd, num_containers, virtual_cores, memory, env, files, app_name, queue)
    208 
    209         if gateway_port is None:
--> 210             raise Exception("Java gateway process exited before sending the"
    211                             " driver its port number")
    212 

Exception: Java gateway process exited before sending the driver its port number
martindurant commented 7 years ago

To be sure, this is within the mdurant/hadoop container? It should have the java_home and classpath set, I've never seen something like this!

mrocklin commented 7 years ago

Yarn is running within that container, but my ipython process is not.

martindurant commented 7 years ago

Until kicking java out of the client in the distant future, you need to run on the "edge" node, with access to the yarn classes. It will be nice to be able to run as a remote client, but not yet.

martindurant commented 7 years ago

Suggestions of where to put such information (when you have things working) great appreciated.

mrocklin commented 7 years ago

Now running in the container

In [1]: from knit.dask_yarn import DaskYARNCluster

In [2]: cluster = DaskYARNCluster()

In [3]: cluster.start()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-3-91c70aa40c2d> in <module>()
----> 1 cluster.start()

TypeError: start() missing 1 required positional argument: 'n_workers'

In [4]: cluster.start(4)
2017-09-13 12:21:09,118 - knit.env - INFO - Creating new env dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58
2017-09-13 12:21:09,118 - knit.env - INFO - /opt/conda/bin/conda create -p /home/knit/knit/tmp_conda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58 -y -q dask>=0.14 distributed>=1.16
17/09/13 12:21:50 INFO knit.Client$: Starting Application Master
17/09/13 12:21:51 INFO hdfs.DFSClient: Cannot get delegation token from yarn
17/09/13 12:21:51 INFO knit.Utils$: Setting Replication Factor to: 3
17/09/13 12:21:51 INFO knit.Utils$: Attemping upload of /home/knit/knit/java_libs/knit-1.0-SNAPSHOT.jar to /user/root/.knitDeps
17/09/13 12:21:52 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/13 12:21:52 INFO knit.Utils$: Attemping upload of /home/knit/knit/tmp_conda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip to hdfs://0.0.0.0:8020/user/root/.knitDeps
17/09/13 12:22:08 INFO knit.Client$: Submitting application application_1505225287100_0003
17/09/13 12:22:08 INFO impl.YarnClientImpl: Submitted application application_1505225287100_0003

---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-4-610bf2513091> in <module>()
----> 1 cluster.start(4)

/home/knit/knit/dask_yarn.py in start(self, n_workers, cpus, memory)
    123         app_id = self.knit.start(command, env=self.env,
    124                                  num_containers=n_workers,
--> 125                                  virtual_cores=cpus, memory=memory)
    126         self.app_id = app_id
    127         return app_id

/home/knit/knit/core.py in start(self, cmd, num_containers, virtual_cores, memory, env, files, app_name, queue)
    228 
    229         if master_rpcport == -1:
--> 230             raise Exception("YARN master container did not report back")
    231         master_rpchost = self.client.masterRPCHost()
    232 

Exception: YARN master container did not report back

In [5]: 

In [5]: cluster
Out[5]: <knit.dask_yarn.DaskYARNCluster at 0x7f537274f9e8>

In [6]: cluster.wo17/09/13 12:23:56 INFO knit.Client$: Getting containers for application_attempt_id { application_id { id: 3 cluster_timestamp: 1505225287100 } attemptId: 1 } host: "N/A" rpc_port: -1 tracking_url: "http://73187ceb633b:8088/proxy/application_1505225287100_0003/" diagnostics: "[Wed Sep 13 12:22:08 +0000 2017] Application is added to the scheduler and is not yet activated. Skipping AM assignment as cluster resource is empty.  Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:1024, vCores:1>; Queue Resource Limit for AM = <memory:0, vCores:0>; User AM Resource Limit of the queue = <memory:0, vCores:0>; Queue AM Resource Usage = <memory:0, vCores:0>; " yarn_application_attempt_state: APP_ATTEMPT_SCHEDULED original_tracking_url: "N/A" startTime: 1505305328938 finishTime: 0
17/09/13 12:23:56 INFO knit.Client$: Container ID: 
17/09/13 12:23:56 INFO knit.Client$: Getting containers for application_attempt_id { application_id { id: 3 cluster_timestamp: 1505225287100 } attemptId: 1 } host: "N/A" rpc_port: -1 tracking_url: "http://73187ceb633b:8088/proxy/application_1505225287100_0003/" diagnostics: "[Wed Sep 13 12:22:08 +0000 2017] Application is added to the scheduler and is not yet activated. Skipping AM assignment as cluster resource is empty.  Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:1024, vCores:1>; Queue Resource Limit for AM = <memory:0, vCores:0>; User AM Resource Limit of the queue = <memory:0, vCores:0>; Queue AM Resource Usage = <memory:0, vCores:0>; " yarn_application_attempt_state: APP_ATTEMPT_SCHEDULED original_tracking_url: "N/A" startTime: 1505305328938 finishTime: 0
17/09/13 12:23:56 INFO knit.Client$: Container ID: 
In [6]: cluster.workers
17/09/13 12:23:58 INFO knit.Client$: Getting containers for application_attempt_id { application_id { id: 3 cluster_timestamp: 1505225287100 } attemptId: 1 } host: "N/A" rpc_port: -1 tracking_url: "http://73187ceb633b:8088/proxy/application_1505225287100_0003/" diagnostics: "[Wed Sep 13 12:22:08 +0000 2017] Application is added to the scheduler and is not yet activated. Skipping AM assignment as cluster resource is empty.  Details : AM Partition = <DEFAULT_PARTITION>; AM Resource Request = <memory:1024, vCores:1>; Queue Resource Limit for AM = <memory:0, vCores:0>; User AM Resource Limit of the queue = <memory:0, vCores:0>; Queue AM Resource Usage = <memory:0, vCores:0>; " yarn_application_attempt_state: APP_ATTEMPT_SCHEDULED original_tracking_url: "N/A" startTime: 1505305328938 finishTime: 0
17/09/13 12:23:58 INFO knit.Client$: Container ID: 
Out[6]: []
martindurant commented 7 years ago

How I test:


docker run -d -v /Users/mdurant/code/:/code mdurant/hadoop
docker exec -it 5bcfd2e5d9fd bash
cd /code/knit
conda install -y -q dask distributed
conda install -y -q -c conda-forge lxml py4j
python setup.py install mvn
ipython

In [1]: from knit import DaskYARNCluster
In [2]: cluster = DaskYARNCluster()
In [3]: cluster.start(2, memory=256, cpus=1)   # guaranteed not to exhaust my docker VM
2017-09-13 13:22:46,672 - knit.env - INFO - Creating new env dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58
2017-09-13 13:22:46,672 - knit.env - INFO - /opt/conda/bin/conda create -p /code/knit/knit/tmp_conda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58 -y -q dask>=0.14 distributed>=1.16
17/09/13 13:24:57 INFO knit.Client$: Starting Application Master
17/09/13 13:24:59 INFO hdfs.DFSClient: Cannot get delegation token from yarn
17/09/13 13:24:59 INFO knit.Utils$: Setting Replication Factor to: 3
17/09/13 13:24:59 INFO knit.Utils$: Attemping upload of /code/knit/knit/java_libs/knit-1.0-SNAPSHOT.jar to /user/root/.knitDeps
17/09/13 13:25:00 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/13 13:25:00 INFO knit.Utils$: Attemping upload of /code/knit/knit/tmp_conda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip to hdfs://0.0.0.0:8020/user/root/.knitDeps
17/09/13 13:25:07 INFO knit.Client$: Submitting application application_1505308493094_0001
17/09/13 13:25:08 INFO impl.YarnClientImpl: Submitted application application_1505308493094_0001
Out[3]: 'application_1505308493094_0001'
In [4]: from dask.distributed import Client
In [5]: c = Client(cluster)
In [6]: c
Out[6]: <Client: scheduler='tcp://172.17.0.2:43117' processes=2 cores=2>

(`conda install -y -q -c conda-forge lxml py4j; python setup.py install mvn` can be replaced with `conda install knit -c conda-forge`, and then there's no need to mount the source in the docker container)
mrocklin commented 7 years ago
mrocklin@workstation:~$ docker run -p 8020:8020 -p 8088:8088 mdurant/hadoop * Restarting OpenBSD Secure Shell server sshd
   ...done.
Generating public/private dsa key pair.
Created directory '/root/.ssh'.
Your identification has been saved in /root/.ssh/id_dsa.
Your public key has been saved in /root/.ssh/id_dsa.pub.
...
# localhost SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8
# localhost SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8
# 0.0.0.0 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8
# 0.0.0.0 SSH-2.0-OpenSSH_6.6.1p1 Ubuntu-2ubuntu2.8
Starting namenodes on [0.0.0.0]
0.0.0.0: starting namenode, logging to /opt/hadoop/logs/hadoop-root-namenode-abb365839374.out
localhost: starting datanode, logging to /opt/hadoop/logs/hadoop-root-datanode-abb365839374.out
Starting secondary namenodes [0.0.0.0]
0.0.0.0: starting secondarynamenode, logging to /opt/hadoop/logs/hadoop-root-secondarynamenode-abb365839374.out
starting yarn daemons
starting resourcemanager, logging to /opt/hadoop/logs/yarn--resourcemanager-abb365839374.out
localhost: starting nodemanager, logging to /opt/hadoop/logs/yarn-root-nodemanager-abb365839374.out
mrocklin@workstation:~$ docker exec -it elegant_franklin bash
root@abb365839374:/# cd /code/knit
bash: cd: /code/knit: No such file or directory
root@abb365839374:/# ls
bin  boot  dev  etc  home  lib  lib64  media  mnt  opt  proc  root  run  sbin  srv  sys  tmp  usr  var
root@abb365839374:/# cd home/
root@abb365839374:/home# ls
root@abb365839374:/home# git clone https://github.com/dask/knit.git
Cloning into 'knit'...
cd kremote: Counting objects: 1446, done.
remote: Compressing objects: 100% (90/90), done.
remote: Total 1446 (delta 58), reused 88 (delta 32), pack-reused 1318
Receiving objects: 100% (1446/1446), 267.88 KiB | 0 bytes/s, done.
Resolving deltas: 100% (745/745), done.
Checking connectivity... done.
root@abb365839374:/home# cd knit
root@abb365839374:/home/knit# conda install -y -q dask distributed

Package plan for installation in environment /opt/conda:

The following NEW packages will be INSTALLED:

    bkcharts:         0.2-py36_0               
    bokeh:            0.12.7-py36_0            
    click:            6.7-py36_0               
    cloudpickle:      0.4.0-py36_0             
    dask:             0.15.2-py36_0            
    distributed:      1.18.1-py36_0            
    heapdict:         1.0.0-py36_1             
    locket:           0.2.0-py36_1             
    mkl:              2017.0.3-0               
    msgpack-python:   0.4.8-py36_0             
    numpy:            1.13.1-py36_0            
    pandas:           0.20.3-py36_0            
    partd:            0.3.8-py36_0             
    psutil:           5.2.2-py36_0             
    sortedcontainers: 1.5.7-py36_0             
    tblib:            1.3.2-py36_0             
    toolz:            0.8.2-py36_0             
    tornado:          4.5.2-py36_0             
    zict:             0.1.2-py36_0             

The following packages will be UPDATED:

    conda:            4.3.23-py36_0 conda-forge --> 4.3.25-py36_0

The following packages will be SUPERSEDED by a higher-priority channel:

    conda-env:        2.6.0-0       conda-forge --> 2.6.0-0      

root@abb365839374:/home/knit# conda install -y -q -c conda-forge lxml py4j

Package plan for installation in environment /opt/conda:

The following NEW packages will be INSTALLED:

    libxslt:   1.1.29-5      conda-forge
    lxml:      3.8.0-py36_0  conda-forge
    py4j:      0.10.6-py36_1 conda-forge

The following packages will be SUPERSEDED by a higher-priority channel:

    conda:     4.3.25-py36_0             --> 4.3.23-py36_0 conda-forge
    conda-env: 2.6.0-0                   --> 2.6.0-0       conda-forge

root@abb365839374:/home/knit# python setup.py install mvn
running install
running bdist_egg
running egg_info
creating knit.egg-info
writing knit.egg-info/PKG-INFO
writing dependency_links to knit.egg-info/dependency_links.txt
writing requirements to knit.egg-info/requires.txt
writing top-level names to knit.egg-info/top_level.txt
writing manifest file 'knit.egg-info/SOURCES.txt'
reading manifest file 'knit.egg-info/SOURCES.txt'
reading manifest template 'MANIFEST.in'
no previously-included directories found matching 'knit/tmp_conda'
writing manifest file 'knit.egg-info/SOURCES.txt'
installing library code to build/bdist.linux-x86_64/egg
running install_lib
running build_py
creating build
creating build/lib
creating build/lib/knit
copying knit/compatibility.py -> build/lib/knit
copying knit/dask_yarn.py -> build/lib/knit
copying knit/exceptions.py -> build/lib/knit
copying knit/__init__.py -> build/lib/knit
copying knit/env.py -> build/lib/knit
copying knit/core.py -> build/lib/knit
copying knit/yarn_api.py -> build/lib/knit
copying knit/utils.py -> build/lib/knit
copying knit/conf.py -> build/lib/knit
creating build/lib/knit/java_libs
copying knit/java_libs/knit-1.0-SNAPSHOT.jar -> build/lib/knit/java_libs
creating build/bdist.linux-x86_64
creating build/bdist.linux-x86_64/egg
creating build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/compatibility.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/dask_yarn.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/exceptions.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/__init__.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/env.py -> build/bdist.linux-x86_64/egg/knit
creating build/bdist.linux-x86_64/egg/knit/java_libs
copying build/lib/knit/java_libs/knit-1.0-SNAPSHOT.jar -> build/bdist.linux-x86_64/egg/knit/java_libs
copying build/lib/knit/core.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/yarn_api.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/utils.py -> build/bdist.linux-x86_64/egg/knit
copying build/lib/knit/conf.py -> build/bdist.linux-x86_64/egg/knit
byte-compiling build/bdist.linux-x86_64/egg/knit/compatibility.py to compatibility.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/dask_yarn.py to dask_yarn.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/exceptions.py to exceptions.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/__init__.py to __init__.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/env.py to env.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/core.py to core.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/yarn_api.py to yarn_api.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/utils.py to utils.cpython-36.pyc
byte-compiling build/bdist.linux-x86_64/egg/knit/conf.py to conf.cpython-36.pyc
creating build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/PKG-INFO -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/SOURCES.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/dependency_links.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/not-zip-safe -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/requires.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
copying knit.egg-info/top_level.txt -> build/bdist.linux-x86_64/egg/EGG-INFO
creating dist
creating 'dist/knit-0.2.2-py3.6.egg' and adding 'build/bdist.linux-x86_64/egg' to it
removing 'build/bdist.linux-x86_64/egg' (and everything under it)
Processing knit-0.2.2-py3.6.egg
creating /opt/conda/lib/python3.6/site-packages/knit-0.2.2-py3.6.egg
Extracting knit-0.2.2-py3.6.egg to /opt/conda/lib/python3.6/site-packages
Adding knit 0.2.2 to easy-install.pth file

Installed /opt/conda/lib/python3.6/site-packages/knit-0.2.2-py3.6.egg
Processing dependencies for knit==0.2.2
Searching for py4j==0.10.6
Best match: py4j 0.10.6
Adding py4j 0.10.6 to easy-install.pth file

Using /opt/conda/lib/python3.6/site-packages
Searching for requests==2.14.2
Best match: requests 2.14.2
Adding requests 2.14.2 to easy-install.pth file

Using /opt/conda/lib/python3.6/site-packages
Searching for lxml==3.8.0
Best match: lxml 3.8.0
Adding lxml 3.8.0 to easy-install.pth file

Using /opt/conda/lib/python3.6/site-packages
Finished processing dependencies for knit==0.2.2
root@abb365839374:/home/knit# ipython
Python 3.6.1 |Continuum Analytics, Inc.| (default, May 11 2017, 13:09:58) 
Type 'copyright', 'credits' or 'license' for more information
IPython 6.1.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: from knit import DaskYARNCluster

In [2]: cluster = DaskYARNCluster()

In [3]: cluster.start(2, memory=256, cpus=1)
2017-09-13 16:38:02,388 - knit.env - INFO - Creating new env dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58
2017-09-13 16:38:02,388 - knit.env - INFO - /opt/conda/bin/conda create -p /home/knit/knit/tmp_conda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58 -y -q dask>=0.14 distributed>=1.16
17/09/13 16:39:24 INFO knit.Client$: Starting Application Master
17/09/13 16:39:25 INFO hdfs.DFSClient: Cannot get delegation token from yarn
17/09/13 16:39:25 INFO knit.Utils$: Setting Replication Factor to: 3
17/09/13 16:39:25 INFO knit.Utils$: Attemping upload of /home/knit/knit/java_libs/knit-1.0-SNAPSHOT.jar to /user/root/.knitDeps
17/09/13 16:39:26 INFO client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8032
17/09/13 16:39:26 INFO knit.Utils$: Attemping upload of /home/knit/knit/tmp_conda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58.zip to hdfs://0.0.0.0:8020/user/root/.knitDeps
17/09/13 16:40:12 INFO knit.Client$: Submitting application application_1505318878467_0001
17/09/13 16:40:13 INFO impl.YarnClientImpl: Submitted application application_1505318878467_0001
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
<ipython-input-3-31269a53dae8> in <module>()
----> 1 cluster.start(2, memory=256, cpus=1)

/home/knit/knit/dask_yarn.py in start(self, n_workers, cpus, memory)
    123         app_id = self.knit.start(command, env=self.env,
    124                                  num_containers=n_workers,
--> 125                                  virtual_cores=cpus, memory=memory)
    126         self.app_id = app_id
    127         return app_id

/home/knit/knit/core.py in start(self, cmd, num_containers, virtual_cores, memory, env, files, app_name, queue)
    228 
    229         if master_rpcport == -1:
--> 230             raise Exception("YARN master container did not report back")
    231         master_rpchost = self.client.masterRPCHost()
    232 

Exception: YARN master container did not report back

In [4]: 
martindurant commented 7 years ago

I run the identical same code, and succeed. My VM is set to 4CPUs and 3.8GB, and after starting dask as above, free reports 1594MB available. Unfortunately, YARN's introspection of the system (available as cluster.knit.yarn_api.cluster_metrics()) seems to misidentify the amount of available RAM, for me gives 8GB. I am surprised that you seem to get stuck before even starting the AM. That means there are not even any logs we could get. Can you check memory with free and disc with df -h and df -h -i ?

mrocklin commented 7 years ago

I'm not sure what I'm looking for here:

root@73d7b672700a:/# free -h
             total       used       free     shared    buffers     cached
Mem:           15G        15G       322M       818M       136M       2.2G
-/+ buffers/cache:        12G       2.6G
Swap:          15G       414M        15G
root@73d7b672700a:/# df -h
Filesystem      Size  Used Avail Use% Mounted on
none            453G  404G   27G  94% /
tmpfs           7.8G     0  7.8G   0% /dev
tmpfs           7.8G     0  7.8G   0% /sys/fs/cgroup
/dev/dm-1       453G  404G   27G  94% /etc/hosts
shm              64M     0   64M   0% /dev/shm
tmpfs           7.8G     0  7.8G   0% /sys/firmware
root@73d7b672700a:/# df -h -i
Filesystem     Inodes IUsed IFree IUse% Mounted on
none              29M  3.6M   26M   13% /
tmpfs            2.0M    16  2.0M    1% /dev
tmpfs            2.0M    11  2.0M    1% /sys/fs/cgroup
/dev/dm-1         29M  3.6M   26M   13% /etc/hosts
shm              2.0M     1  2.0M    1% /dev/shm
tmpfs            2.0M     1  2.0M    1% /sys/firmware
martindurant commented 7 years ago

Can you look for errors in /opt/hadoop/logs/yarn--resourcemanager-*.out /opt/hadoop/logs/yarn--resourcemanager-*.log /opt/hadoop/logs/yarn-root-nodemanager-*.out /opt/hadoop/logs/yarn-root-nodemanager-*.log

mrocklin commented 7 years ago

This one has an error message

``` root@2bfb069c9357:/# cat /opt/hadoop/logs/yarn-root-nodemanager-*.log 2017-09-13 17:37:43,752 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: STARTUP_MSG: /************************************************************ STARTUP_MSG: Starting NodeManager STARTUP_MSG: user = root STARTUP_MSG: host = 2bfb069c9357/172.17.0.1 STARTUP_MSG: args = [] STARTUP_MSG: version = 2.8.1 STARTUP_MSG: classpath = /opt/hadoop/etc/hadoop:/opt/hadoop/etc/hadoop:/opt/hadoop/etc/hadoop:/opt/hadoop/share/hadoop/common/lib/zookeeper-3.4.6.jar:/opt/hadoop/share/hadoop/common/lib/jaxb-api-2.2.2.jar:/opt/hadoop/share/hadoop/common/lib/httpcore-4.4.4.jar:/opt/hadoop/share/hadoop/common/lib/commons-collections-3.2.2.jar:/opt/hadoop/share/hadoop/common/lib/mockito-all-1.8.5.jar:/opt/hadoop/share/hadoop/common/lib/jackson-core-asl-1.9.13.jar:/opt/hadoop/share/hadoop/common/lib/slf4j-api-1.7.10.jar:/opt/hadoop/share/hadoop/common/lib/paranamer-2.3.jar:/opt/hadoop/share/hadoop/common/lib/java-xmlbuilder-0.4.jar:/opt/hadoop/share/hadoop/common/lib/curator-recipes-2.7.1.jar:/opt/hadoop/share/hadoop/common/lib/xmlenc-0.52.jar:/opt/hadoop/share/hadoop/common/lib/jsch-0.1.51.jar:/opt/hadoop/share/hadoop/common/lib/curator-client-2.7.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-math3-3.1.1.jar:/opt/hadoop/share/hadoop/common/lib/jsr305-3.0.0.jar:/opt/hadoop/share/hadoop/common/lib/json-smart-1.1.1.jar:/opt/hadoop/share/hadoop/common/lib/stax-api-1.0-2.jar:/opt/hadoop/share/hadoop/common/lib/junit-4.11.jar:/opt/hadoop/share/hadoop/common/lib/avro-1.7.4.jar:/opt/hadoop/share/hadoop/common/lib/commons-beanutils-core-1.8.0.jar:/opt/hadoop/share/hadoop/common/lib/jersey-server-1.9.jar:/opt/hadoop/share/hadoop/common/lib/commons-logging-1.1.3.jar:/opt/hadoop/share/hadoop/common/lib/jersey-json-1.9.jar:/opt/hadoop/share/hadoop/common/lib/guava-11.0.2.jar:/opt/hadoop/share/hadoop/common/lib/jackson-mapper-asl-1.9.13.jar:/opt/hadoop/share/hadoop/common/lib/commons-beanutils-1.7.0.jar:/opt/hadoop/share/hadoop/common/lib/jersey-core-1.9.jar:/opt/hadoop/share/hadoop/common/lib/commons-io-2.4.jar:/opt/hadoop/share/hadoop/common/lib/nimbus-jose-jwt-3.9.jar:/opt/hadoop/share/hadoop/common/lib/commons-digester-1.8.jar:/opt/hadoop/share/hadoop/common/lib/jetty-util-6.1.26.jar:/opt/hadoop/share/hadoop/common/lib/commons-compress-1.4.1.jar:/opt/hadoop/share/hadoop/common/lib/api-asn1-api-1.0.0-M20.jar:/opt/hadoop/share/hadoop/common/lib/jcip-annotations-1.0.jar:/opt/hadoop/share/hadoop/common/lib/slf4j-log4j12-1.7.10.jar:/opt/hadoop/share/hadoop/common/lib/apacheds-i18n-2.0.0-M15.jar:/opt/hadoop/share/hadoop/common/lib/log4j-1.2.17.jar:/opt/hadoop/share/hadoop/common/lib/gson-2.2.4.jar:/opt/hadoop/share/hadoop/common/lib/jets3t-0.9.0.jar:/opt/hadoop/share/hadoop/common/lib/netty-3.6.2.Final.jar:/opt/hadoop/share/hadoop/common/lib/snappy-java-1.0.4.1.jar:/opt/hadoop/share/hadoop/common/lib/jackson-jaxrs-1.9.13.jar:/opt/hadoop/share/hadoop/common/lib/hadoop-annotations-2.8.1.jar:/opt/hadoop/share/hadoop/common/lib/jsp-api-2.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-configuration-1.6.jar:/opt/hadoop/share/hadoop/common/lib/activation-1.1.jar:/opt/hadoop/share/hadoop/common/lib/apacheds-kerberos-codec-2.0.0-M15.jar:/opt/hadoop/share/hadoop/common/lib/curator-framework-2.7.1.jar:/opt/hadoop/share/hadoop/common/lib/jackson-xc-1.9.13.jar:/opt/hadoop/share/hadoop/common/lib/jaxb-impl-2.2.3-1.jar:/opt/hadoop/share/hadoop/common/lib/servlet-api-2.5.jar:/opt/hadoop/share/hadoop/common/lib/jetty-sslengine-6.1.26.jar:/opt/hadoop/share/hadoop/common/lib/jetty-6.1.26.jar:/opt/hadoop/share/hadoop/common/lib/commons-cli-1.2.jar:/opt/hadoop/share/hadoop/common/lib/jettison-1.1.jar:/opt/hadoop/share/hadoop/common/lib/commons-lang-2.6.jar:/opt/hadoop/share/hadoop/common/lib/protobuf-java-2.5.0.jar:/opt/hadoop/share/hadoop/common/lib/htrace-core4-4.0.1-incubating.jar:/opt/hadoop/share/hadoop/common/lib/hadoop-auth-2.8.1.jar:/opt/hadoop/share/hadoop/common/lib/xz-1.0.jar:/opt/hadoop/share/hadoop/common/lib/commons-codec-1.4.jar:/opt/hadoop/share/hadoop/common/lib/hamcrest-core-1.3.jar:/opt/hadoop/share/hadoop/common/lib/httpclient-4.5.2.jar:/opt/hadoop/share/hadoop/common/lib/asm-3.2.jar:/opt/hadoop/share/hadoop/common/lib/api-util-1.0.0-M20.jar:/opt/hadoop/share/hadoop/common/lib/commons-net-3.1.jar:/opt/hadoop/share/hadoop/common/hadoop-common-2.8.1-tests.jar:/opt/hadoop/share/hadoop/common/hadoop-nfs-2.8.1.jar:/opt/hadoop/share/hadoop/common/hadoop-common-2.8.1.jar:/opt/hadoop/share/hadoop/hdfs:/opt/hadoop/share/hadoop/hdfs/lib/jackson-core-asl-1.9.13.jar:/opt/hadoop/share/hadoop/hdfs/lib/xmlenc-0.52.jar:/opt/hadoop/share/hadoop/hdfs/lib/jsr305-3.0.0.jar:/opt/hadoop/share/hadoop/hdfs/lib/leveldbjni-all-1.8.jar:/opt/hadoop/share/hadoop/hdfs/lib/jersey-server-1.9.jar:/opt/hadoop/share/hadoop/hdfs/lib/commons-logging-1.1.3.jar:/opt/hadoop/share/hadoop/hdfs/lib/guava-11.0.2.jar:/opt/hadoop/share/hadoop/hdfs/lib/jackson-mapper-asl-1.9.13.jar:/opt/hadoop/share/hadoop/hdfs/lib/jersey-core-1.9.jar:/opt/hadoop/share/hadoop/hdfs/lib/commons-io-2.4.jar:/opt/hadoop/share/hadoop/hdfs/lib/jetty-util-6.1.26.jar:/opt/hadoop/share/hadoop/hdfs/lib/xml-apis-1.3.04.jar:/opt/hadoop/share/hadoop/hdfs/lib/log4j-1.2.17.jar:/opt/hadoop/share/hadoop/hdfs/lib/okio-1.4.0.jar:/opt/hadoop/share/hadoop/hdfs/lib/netty-3.6.2.Final.jar:/opt/hadoop/share/hadoop/hdfs/lib/xercesImpl-2.9.1.jar:/opt/hadoop/share/hadoop/hdfs/lib/servlet-api-2.5.jar:/opt/hadoop/share/hadoop/hdfs/lib/jetty-6.1.26.jar:/opt/hadoop/share/hadoop/hdfs/lib/commons-cli-1.2.jar:/opt/hadoop/share/hadoop/hdfs/lib/netty-all-4.0.23.Final.jar:/opt/hadoop/share/hadoop/hdfs/lib/commons-lang-2.6.jar:/opt/hadoop/share/hadoop/hdfs/lib/protobuf-java-2.5.0.jar:/opt/hadoop/share/hadoop/hdfs/lib/htrace-core4-4.0.1-incubating.jar:/opt/hadoop/share/hadoop/hdfs/lib/okhttp-2.4.0.jar:/opt/hadoop/share/hadoop/hdfs/lib/commons-codec-1.4.jar:/opt/hadoop/share/hadoop/hdfs/lib/hadoop-hdfs-client-2.8.1.jar:/opt/hadoop/share/hadoop/hdfs/lib/commons-daemon-1.0.13.jar:/opt/hadoop/share/hadoop/hdfs/lib/asm-3.2.jar:/opt/hadoop/share/hadoop/hdfs/hadoop-hdfs-2.8.1-tests.jar:/opt/hadoop/share/hadoop/hdfs/hadoop-hdfs-native-client-2.8.1.jar:/opt/hadoop/share/hadoop/hdfs/hadoop-hdfs-2.8.1.jar:/opt/hadoop/share/hadoop/hdfs/hadoop-hdfs-client-2.8.1-tests.jar:/opt/hadoop/share/hadoop/hdfs/hadoop-hdfs-native-client-2.8.1-tests.jar:/opt/hadoop/share/hadoop/hdfs/hadoop-hdfs-nfs-2.8.1.jar:/opt/hadoop/share/hadoop/hdfs/hadoop-hdfs-client-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/lib/zookeeper-3.4.6.jar:/opt/hadoop/share/hadoop/yarn/lib/jaxb-api-2.2.2.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-collections-3.2.2.jar:/opt/hadoop/share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar:/opt/hadoop/share/hadoop/yarn/lib/guice-servlet-3.0.jar:/opt/hadoop/share/hadoop/yarn/lib/curator-client-2.7.1.jar:/opt/hadoop/share/hadoop/yarn/lib/jsr305-3.0.0.jar:/opt/hadoop/share/hadoop/yarn/lib/stax-api-1.0-2.jar:/opt/hadoop/share/hadoop/yarn/lib/leveldbjni-all-1.8.jar:/opt/hadoop/share/hadoop/yarn/lib/jersey-server-1.9.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-logging-1.1.3.jar:/opt/hadoop/share/hadoop/yarn/lib/jersey-json-1.9.jar:/opt/hadoop/share/hadoop/yarn/lib/guava-11.0.2.jar:/opt/hadoop/share/hadoop/yarn/lib/objenesis-2.1.jar:/opt/hadoop/share/hadoop/yarn/lib/jackson-mapper-asl-1.9.13.jar:/opt/hadoop/share/hadoop/yarn/lib/jersey-core-1.9.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-io-2.4.jar:/opt/hadoop/share/hadoop/yarn/lib/jetty-util-6.1.26.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-compress-1.4.1.jar:/opt/hadoop/share/hadoop/yarn/lib/curator-test-2.7.1.jar:/opt/hadoop/share/hadoop/yarn/lib/log4j-1.2.17.jar:/opt/hadoop/share/hadoop/yarn/lib/netty-3.6.2.Final.jar:/opt/hadoop/share/hadoop/yarn/lib/jackson-jaxrs-1.9.13.jar:/opt/hadoop/share/hadoop/yarn/lib/activation-1.1.jar:/opt/hadoop/share/hadoop/yarn/lib/jackson-xc-1.9.13.jar:/opt/hadoop/share/hadoop/yarn/lib/jaxb-impl-2.2.3-1.jar:/opt/hadoop/share/hadoop/yarn/lib/servlet-api-2.5.jar:/opt/hadoop/share/hadoop/yarn/lib/jetty-6.1.26.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-cli-1.2.jar:/opt/hadoop/share/hadoop/yarn/lib/jettison-1.1.jar:/opt/hadoop/share/hadoop/yarn/lib/javassist-3.18.1-GA.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-math-2.2.jar:/opt/hadoop/share/hadoop/yarn/lib/javax.inject-1.jar:/opt/hadoop/share/hadoop/yarn/lib/jersey-guice-1.9.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-lang-2.6.jar:/opt/hadoop/share/hadoop/yarn/lib/protobuf-java-2.5.0.jar:/opt/hadoop/share/hadoop/yarn/lib/zookeeper-3.4.6-tests.jar:/opt/hadoop/share/hadoop/yarn/lib/guice-3.0.jar:/opt/hadoop/share/hadoop/yarn/lib/xz-1.0.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-codec-1.4.jar:/opt/hadoop/share/hadoop/yarn/lib/aopalliance-1.0.jar:/opt/hadoop/share/hadoop/yarn/lib/jersey-client-1.9.jar:/opt/hadoop/share/hadoop/yarn/lib/asm-3.2.jar:/opt/hadoop/share/hadoop/yarn/lib/fst-2.24.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-api-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-common-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-timeline-pluginstorage-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-client-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-web-proxy-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-tests-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-sharedcachemanager-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-registry-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-applicationhistoryservice-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-common-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.8.1.jar:/opt/hadoop/share/hadoop/mapreduce/lib/jackson-core-asl-1.9.13.jar:/opt/hadoop/share/hadoop/mapreduce/lib/guice-servlet-3.0.jar:/opt/hadoop/share/hadoop/mapreduce/lib/paranamer-2.3.jar:/opt/hadoop/share/hadoop/mapreduce/lib/junit-4.11.jar:/opt/hadoop/share/hadoop/mapreduce/lib/leveldbjni-all-1.8.jar:/opt/hadoop/share/hadoop/mapreduce/lib/avro-1.7.4.jar:/opt/hadoop/share/hadoop/mapreduce/lib/jersey-server-1.9.jar:/opt/hadoop/share/hadoop/mapreduce/lib/jackson-mapper-asl-1.9.13.jar:/opt/hadoop/share/hadoop/mapreduce/lib/jersey-core-1.9.jar:/opt/hadoop/share/hadoop/mapreduce/lib/commons-io-2.4.jar:/opt/hadoop/share/hadoop/mapreduce/lib/commons-compress-1.4.1.jar:/opt/hadoop/share/hadoop/mapreduce/lib/log4j-1.2.17.jar:/opt/hadoop/share/hadoop/mapreduce/lib/netty-3.6.2.Final.jar:/opt/hadoop/share/hadoop/mapreduce/lib/snappy-java-1.0.4.1.jar:/opt/hadoop/share/hadoop/mapreduce/lib/hadoop-annotations-2.8.1.jar:/opt/hadoop/share/hadoop/mapreduce/lib/javax.inject-1.jar:/opt/hadoop/share/hadoop/mapreduce/lib/jersey-guice-1.9.jar:/opt/hadoop/share/hadoop/mapreduce/lib/protobuf-java-2.5.0.jar:/opt/hadoop/share/hadoop/mapreduce/lib/guice-3.0.jar:/opt/hadoop/share/hadoop/mapreduce/lib/xz-1.0.jar:/opt/hadoop/share/hadoop/mapreduce/lib/hamcrest-core-1.3.jar:/opt/hadoop/share/hadoop/mapreduce/lib/aopalliance-1.0.jar:/opt/hadoop/share/hadoop/mapreduce/lib/asm-3.2.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.8.1.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.8.1.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-app-2.8.1.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-common-2.8.1.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-2.8.1.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.8.1-tests.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-hs-plugins-2.8.1.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-core-2.8.1.jar:/opt/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-shuffle-2.8.1.jar:/contrib/capacity-scheduler/*.jar:/contrib/capacity-scheduler/*.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-resourcemanager-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-api-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-common-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-timeline-pluginstorage-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-client-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-web-proxy-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-applications-unmanaged-am-launcher-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-nodemanager-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-tests-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-sharedcachemanager-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-registry-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-server-applicationhistoryservice-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-common-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-applications-distributedshell-2.8.1.jar:/opt/hadoop/share/hadoop/yarn/lib/zookeeper-3.4.6.jar:/opt/hadoop/share/hadoop/yarn/lib/jaxb-api-2.2.2.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-collections-3.2.2.jar:/opt/hadoop/share/hadoop/yarn/lib/jackson-core-asl-1.9.13.jar:/opt/hadoop/share/hadoop/yarn/lib/guice-servlet-3.0.jar:/opt/hadoop/share/hadoop/yarn/lib/curator-client-2.7.1.jar:/opt/hadoop/share/hadoop/yarn/lib/jsr305-3.0.0.jar:/opt/hadoop/share/hadoop/yarn/lib/stax-api-1.0-2.jar:/opt/hadoop/share/hadoop/yarn/lib/leveldbjni-all-1.8.jar:/opt/hadoop/share/hadoop/yarn/lib/jersey-server-1.9.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-logging-1.1.3.jar:/opt/hadoop/share/hadoop/yarn/lib/jersey-json-1.9.jar:/opt/hadoop/share/hadoop/yarn/lib/guava-11.0.2.jar:/opt/hadoop/share/hadoop/yarn/lib/objenesis-2.1.jar:/opt/hadoop/share/hadoop/yarn/lib/jackson-mapper-asl-1.9.13.jar:/opt/hadoop/share/hadoop/yarn/lib/jersey-core-1.9.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-io-2.4.jar:/opt/hadoop/share/hadoop/yarn/lib/jetty-util-6.1.26.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-compress-1.4.1.jar:/opt/hadoop/share/hadoop/yarn/lib/curator-test-2.7.1.jar:/opt/hadoop/share/hadoop/yarn/lib/log4j-1.2.17.jar:/opt/hadoop/share/hadoop/yarn/lib/netty-3.6.2.Final.jar:/opt/hadoop/share/hadoop/yarn/lib/jackson-jaxrs-1.9.13.jar:/opt/hadoop/share/hadoop/yarn/lib/activation-1.1.jar:/opt/hadoop/share/hadoop/yarn/lib/jackson-xc-1.9.13.jar:/opt/hadoop/share/hadoop/yarn/lib/jaxb-impl-2.2.3-1.jar:/opt/hadoop/share/hadoop/yarn/lib/servlet-api-2.5.jar:/opt/hadoop/share/hadoop/yarn/lib/jetty-6.1.26.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-cli-1.2.jar:/opt/hadoop/share/hadoop/yarn/lib/jettison-1.1.jar:/opt/hadoop/share/hadoop/yarn/lib/javassist-3.18.1-GA.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-math-2.2.jar:/opt/hadoop/share/hadoop/yarn/lib/javax.inject-1.jar:/opt/hadoop/share/hadoop/yarn/lib/jersey-guice-1.9.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-lang-2.6.jar:/opt/hadoop/share/hadoop/yarn/lib/protobuf-java-2.5.0.jar:/opt/hadoop/share/hadoop/yarn/lib/zookeeper-3.4.6-tests.jar:/opt/hadoop/share/hadoop/yarn/lib/guice-3.0.jar:/opt/hadoop/share/hadoop/yarn/lib/xz-1.0.jar:/opt/hadoop/share/hadoop/yarn/lib/commons-codec-1.4.jar:/opt/hadoop/share/hadoop/yarn/lib/aopalliance-1.0.jar:/opt/hadoop/share/hadoop/yarn/lib/jersey-client-1.9.jar:/opt/hadoop/share/hadoop/yarn/lib/asm-3.2.jar:/opt/hadoop/share/hadoop/yarn/lib/fst-2.24.jar:/opt/hadoop/etc/hadoop/nm-config/log4j.properties STARTUP_MSG: build = https://git-wip-us.apache.org/repos/asf/hadoop.git -r 20fe5304904fc2f5a18053c389e43cd26f7a70fe; compiled by 'vinodkv' on 2017-06-02T06:14Z STARTUP_MSG: java = 1.8.0_121 ************************************************************/ 2017-09-13 17:37:43,763 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: registered UNIX signal handlers for [TERM, HUP, INT] 2017-09-13 17:37:44,922 INFO org.apache.hadoop.yarn.server.nodemanager.NodeManager: Node Manager health check script is not available or doesn't have execute permission, so not starting the node health script runner. 2017-09-13 17:37:45,023 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ContainerEventDispatcher 2017-09-13 17:37:45,024 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.application.ApplicationEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl$ApplicationEventDispatcher 2017-09-13 17:37:45,025 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.event.LocalizationEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService 2017-09-13 17:37:45,025 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServicesEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices 2017-09-13 17:37:45,026 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl 2017-09-13 17:37:45,027 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncherEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainersLauncher 2017-09-13 17:37:45,048 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.ContainerManagerEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl 2017-09-13 17:37:45,050 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.NodeManagerEventType for class org.apache.hadoop.yarn.server.nodemanager.NodeManager 2017-09-13 17:37:45,099 INFO org.apache.hadoop.metrics2.impl.MetricsConfig: loaded properties from hadoop-metrics2.properties 2017-09-13 17:37:45,192 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Scheduled Metric snapshot period at 10 second(s). 2017-09-13 17:37:45,192 INFO org.apache.hadoop.metrics2.impl.MetricsSystemImpl: NodeManager metrics system started 2017-09-13 17:37:45,259 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /tmp/hadoop-root/nm-local-dir error, used space above threshold of 90.0%, removing from list of valid directories 2017-09-13 17:37:45,259 WARN org.apache.hadoop.yarn.server.nodemanager.DirectoryCollection: Directory /opt/hadoop/logs/userlogs error, used space above threshold of 90.0%, removing from list of valid directories 2017-09-13 17:37:45,259 INFO org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Disk(s) failed: 1/1 local-dirs are bad: /tmp/hadoop-root/nm-local-dir; 1/1 log-dirs are bad: /opt/hadoop/logs/userlogs 2017-09-13 17:37:45,259 ERROR org.apache.hadoop.yarn.server.nodemanager.LocalDirsHandlerService: Most of the disks failed. 1/1 local-dirs are bad: /tmp/hadoop-root/nm-local-dir; 1/1 log-dirs are bad: /opt/hadoop/logs/userlogs 2017-09-13 17:37:45,286 INFO org.apache.hadoop.yarn.server.nodemanager.NodeResourceMonitorImpl: Using ResourceCalculatorPlugin : org.apache.hadoop.yarn.util.ResourceCalculatorPlugin@17f62e33 2017-09-13 17:37:45,289 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.loghandler.event.LogHandlerEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService 2017-09-13 17:37:45,291 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploadEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.sharedcache.SharedCacheUploadService 2017-09-13 17:37:45,291 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: AMRMProxyService is disabled 2017-09-13 17:37:45,291 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: per directory file limit = 8192 2017-09-13 17:37:45,306 INFO org.apache.hadoop.yarn.event.AsyncDispatcher: Registering class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.event.LocalizerEventType for class org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService$LocalizerTracker 2017-09-13 17:37:45,327 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.AuxServices: Adding auxiliary service mapreduce_shuffle, "mapreduce_shuffle" 2017-09-13 17:37:45,352 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Using ResourceCalculatorPlugin : org.apache.hadoop.yarn.util.ResourceCalculatorPlugin@69c81773 2017-09-13 17:37:45,353 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Using ResourceCalculatorProcessTree : null 2017-09-13 17:37:45,354 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Physical memory check enabled: true 2017-09-13 17:37:45,354 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: Virtual memory check enabled: true 2017-09-13 17:37:45,357 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.monitor.ContainersMonitorImpl: ContainersMonitor enabled: true 2017-09-13 17:37:45,359 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Nodemanager resources: memory set to 8192MB. 2017-09-13 17:37:45,360 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Nodemanager resources: vcores set to 8. 2017-09-13 17:37:45,375 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Initialized nodemanager with : physical-memory=8192 virtual-memory=17204 virtual-cores=8 2017-09-13 17:37:45,377 INFO org.apache.hadoop.util.JvmPauseMonitor: Starting JVM pause monitor 2017-09-13 17:37:45,431 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 2000 scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler 2017-09-13 17:37:45,453 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 58052 2017-09-13 17:37:45,638 INFO org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding protocol org.apache.hadoop.yarn.api.ContainerManagementProtocolPB to the server 2017-09-13 17:37:45,638 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: Blocking new container-requests as container manager rpc server is still starting. 2017-09-13 17:37:45,638 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2017-09-13 17:37:45,639 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 58052: starting 2017-09-13 17:37:45,647 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Updating node address : 2bfb069c9357:58052 2017-09-13 17:37:45,677 INFO org.apache.hadoop.ipc.CallQueueManager: Using callQueue: class java.util.concurrent.LinkedBlockingQueue queueCapacity: 500 scheduler: class org.apache.hadoop.ipc.DefaultRpcScheduler 2017-09-13 17:37:45,683 INFO org.apache.hadoop.ipc.Server: Starting Socket Reader #1 for port 8040 2017-09-13 17:37:45,689 INFO org.apache.hadoop.yarn.factories.impl.pb.RpcServerFactoryPBImpl: Adding protocol org.apache.hadoop.yarn.server.nodemanager.api.LocalizationProtocolPB to the server 2017-09-13 17:37:45,690 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2017-09-13 17:37:45,693 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Localizer started on port 8040 2017-09-13 17:37:45,695 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 8040: starting 2017-09-13 17:37:45,711 INFO org.apache.hadoop.mapred.IndexCache: IndexCache created with max memory = 10485760 2017-09-13 17:37:45,727 INFO org.apache.hadoop.mapred.ShuffleHandler: mapreduce_shuffle listening on port 13562 2017-09-13 17:37:45,732 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: ContainerManager started at 2bfb069c9357/172.17.0.1:58052 2017-09-13 17:37:45,733 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.ContainerManagerImpl: ContainerManager bound to 0.0.0.0/0.0.0.0:0 2017-09-13 17:37:45,736 INFO org.apache.hadoop.yarn.server.nodemanager.webapp.WebServer: Instantiating NMWebApp at 0.0.0.0:8042 2017-09-13 17:37:45,834 INFO org.mortbay.log: Logging to org.slf4j.impl.Log4jLoggerAdapter(org.mortbay.log) via org.mortbay.log.Slf4jLog 2017-09-13 17:37:45,862 INFO org.apache.hadoop.security.authentication.server.AuthenticationFilter: Unable to initialize FileSignerSecretProvider, falling back to use random secrets. 2017-09-13 17:37:45,876 INFO org.apache.hadoop.http.HttpRequestLog: Http request log for http.requests.nodemanager is not defined 2017-09-13 17:37:45,888 INFO org.apache.hadoop.http.HttpServer2: Added global filter 'safety' (class=org.apache.hadoop.http.HttpServer2$QuotingInputFilter) 2017-09-13 17:37:45,891 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context node 2017-09-13 17:37:45,891 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context logs 2017-09-13 17:37:45,891 INFO org.apache.hadoop.http.HttpServer2: Added filter static_user_filter (class=org.apache.hadoop.http.lib.StaticUserWebFilter$StaticUserFilter) to context static 2017-09-13 17:37:45,897 INFO org.apache.hadoop.http.HttpServer2: adding path spec: /node/* 2017-09-13 17:37:45,898 INFO org.apache.hadoop.http.HttpServer2: adding path spec: /ws/* 2017-09-13 17:37:46,472 INFO org.apache.hadoop.yarn.webapp.WebApps: Registered webapp guice modules 2017-09-13 17:37:46,475 INFO org.apache.hadoop.http.HttpServer2: Jetty bound to port 8042 2017-09-13 17:37:46,475 INFO org.mortbay.log: jetty-6.1.26 2017-09-13 17:37:46,502 INFO org.mortbay.log: Extract jar:file:/opt/hadoop/share/hadoop/yarn/hadoop-yarn-common-2.8.1.jar!/webapps/node to /tmp/Jetty_0_0_0_0_8042_node____19tj0x/webapp 2017-09-13 17:37:47,457 INFO org.mortbay.log: Started HttpServer2$SelectChannelConnectorWithSafeStartup@0.0.0.0:8042 2017-09-13 17:37:47,458 INFO org.apache.hadoop.yarn.webapp.WebApps: Web app node started at 8042 2017-09-13 17:37:47,458 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Node ID assigned is : 2bfb069c9357:58052 2017-09-13 17:37:47,463 INFO org.apache.hadoop.yarn.client.RMProxy: Connecting to ResourceManager at /0.0.0.0:8031 2017-09-13 17:37:47,491 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Sending out 0 NM container statuses: [] 2017-09-13 17:37:47,496 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registering with RM using containers :[] 2017-09-13 17:37:47,714 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMContainerTokenSecretManager: Rolling master-key for container-tokens, got key with id -644576666 2017-09-13 17:37:47,718 INFO org.apache.hadoop.yarn.server.nodemanager.security.NMTokenSecretManagerInNM: Rolling master-key for container-tokens, got key with id -894620788 2017-09-13 17:37:47,719 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Registered with ResourceManager as 2bfb069c9357:58052 with total resource of 2017-09-13 17:37:47,719 INFO org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl: Notifying ContainerManager to unblock new container-requests 2017-09-13 17:47:45,657 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Cache Size Before Clean: 0, Total Deleted: 0, Public Deleted: 0, Private Deleted: 0 2017-09-13 17:57:45,652 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.ResourceLocalizationService: Cache Size Before Clean: 0, Total Deleted: 0, Public Deleted: 0, Private Deleted: 0 ```

Happy to provide the others as well if desired

martindurant commented 7 years ago

used space above threshold of 90.0% !! Node manager failed to start because it assumed the disc would soon become full. This suggests I should add more diagnostics to the YARN api - the information about the state of nodemanagers is readily available.

martindurant commented 7 years ago

Apparently the following yarn config would solve the issue, after yarn restart (or need to rebuild docker image)

<property>
        <name>yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage</name>
        <value>99.5</value>
</property>
mrocklin commented 7 years ago

My machine was genuinely low on disk space, which warrants handling in its own right.

On Wed, Sep 13, 2017 at 2:18 PM, Martin Durant notifications@github.com wrote:

Apparently the following yarn config would solve the issue, after yarn restart (or need to rebuild docker image)

yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage 99.5

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/knit/issues/80#issuecomment-329253194, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszISoAjhGjlmUgao5n9q3hgIKMPb3ks5siBxegaJpZM4PUthB .

martindurant commented 7 years ago

The config change would have saved me from Travis woes too, so I'll probably put it into the image at some point.

martindurant commented 7 years ago

I am writing some "pre-flight checks" and diagnostics, but I notice that YARN doesn't actually know how much memory it has available, only what is specified in the config (8GB, 8CPUs by default), and container allocations are set to 1GB minimum by default (of which potentially only a small amount is used, and python is not actually restricted in going beyond). There is (potentially) information about YARN's guess at physical memory usage on worker node machines, but the log files that were useful above are not in general available to the user, as they are scattered on various machines and not may need privileged access. Similarly, general disc available space.

So I'm wondering how much I can check versus trying to give comprehensive troubleshooting guidelines on failure. Certainly "master didn't report back" is totally opaque.

mrocklin commented 7 years ago

My case of not having enough disk space is probably odd enough not to worry about.

On Thu, Sep 14, 2017 at 12:34 PM, Martin Durant notifications@github.com wrote:

I am writing some "pre-flight checks" and diagnostics, but I notice that YARN doesn't actually know how much memory it has available, only what is specified in the config (8GB, 8CPUs by default), and container allocations are set to 1GB minimum by default (of which potentially only a small amount is used, and python is not actually restricted in going beyond). There is (potentially) information about YARN's guess at physical memory usage on worker node machines, but the log files that were useful above are not in general available to the user, as they are scattered on various machines and not may need privileged access. Similarly, general disc available space.

So I'm wondering how much I can check versus trying to give comprehensive troubleshooting guidelines on failure. Certainly "master didn't report back" is totally opaque.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/knit/issues/80#issuecomment-329538431, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszGLciOwuWmOSZ3zR2uoaDajtQDqMks5siVWegaJpZM4PUthB .