dask / knit

Deprecated, please use https://github.com/jcrist/skein or https://github.com/dask/dask-yarn instead
http://knit.readthedocs.io/en/latest/
BSD 3-Clause "New" or "Revised" License
53 stars 10 forks source link

Worker restarted until killed #92

Closed quartox closed 7 years ago

quartox commented 7 years ago

I believe the key error is here and it appears that everything is fine even though the distributed/worker.py module thinks the response is unexpected.

"/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/worker.py", line 248, in _register_with_scheduler
    raise ValueError("Unexpected response from register: %r" % (resp,))
ValueError: Unexpected response from register: {'status': 'OK', 'time': 1507305239.205065}
distributed.nanny - WARNING - Restarting worker

Below is more of the container log. It continues restarting until it is killed. The final error when it is killed is at the bottom.

Container: container_e110_1506861552726_19299_01_000002 on hostname.allstate.com_8041
======================================================================================
LogType:stderr
Log Upload Time:Fri Oct 06 10:54:03 -0500 2017
LogLength:22808
Log Contents:
/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/config.py:55: UserWarning: Could not write default config file to '/home/.dask/config.yaml'. Received error [Errno 13] Permission denied: '/home/.dask'
  UserWarning)
distributed.nanny - INFO -         Start Nanny at: 'tcp://10.195.102.32:45126'
/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/config.py:55: UserWarning: Could not write default config file to '/home/.dask/config.yaml'. Received error [Errno 13] Permission denied: '/home/.dask'
  UserWarning)
distributed.worker - INFO -       Start worker at:  tcp://10.195.102.32:37045
distributed.worker - INFO -          Listening to:  tcp://10.195.102.32:37045
distributed.worker - INFO -              nanny at:        10.195.102.32:45126
distributed.worker - INFO -               http at:        10.195.102.32:32817
distributed.worker - INFO -              bokeh at:         10.195.102.32:8789
distributed.worker - INFO - Waiting to connect to: tcp://10.195.208.190:40025
distributed.worker - INFO - -------------------------------------------------
distributed.worker - INFO -               Threads:                          1
distributed.worker - INFO -                Memory:                    0.50 GB
distributed.worker - INFO -       Local Directory:            worker-nfomhqoz
distributed.worker - INFO - -------------------------------------------------
distributed.nanny - ERROR - Failed to start worker
Traceback (most recent call last):
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/nanny.py", line 467, in run
    yield worker._start(*worker_start_args)
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/worker.py", line 319, in _start
    yield self._register_with_scheduler()
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/worker.py", line 248, in _register_with_scheduler
    raise ValueError("Unexpected response from register: %r" % (resp,))
ValueError: Unexpected response from register: {'status': 'OK', 'time': 1507305239.205065}
distributed.nanny - WARNING - Restarting worker

Final error:

 tornado.application - ERROR - Exception in callback functools.partial(<function wrap.<locals>.null_wrapper at 0x7f96cdcfb510>, <tornado.concurrent.Future object at 0x7f96ce9c5978>)
Traceback (most recent call last):
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/ioloop.py", line 605, in _run_callback
    ret = callback()
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/stack_context.py", line 277, in null_wrapper
    return fn(*args, **kwargs)
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/ioloop.py", line 626, in _discard_future_result
    future.result()
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/nanny.py", line 138, in _start
    response = yield self.instantiate()
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/nanny.py", line 205, in instantiate
    yield self.process.start()
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/gen.py", line 1063, in run
    yielded = self.gen.throw(*exc_info)
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/nanny.py", line 311, in start
    yield self._wait_until_running()
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/gen.py", line 1055, in run
    value = future.result()
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/concurrent.py", line 238, in result
    raise_exc_info(self._exc_info)
  File "<string>", line 4, in raise_exc_info
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/tornado/gen.py", line 1069, in run
    yielded = self.gen.send(value)
  File "/hadoop02/yarn/nm/usercache/jlord/appcache/application_1506861552726_19299/container_e110_1506861552726_19299_01_000002/PYTHON_DIR/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58/lib/python3.6/site-packages/distributed/nanny.py", line 397, in _wait_until_running
    raise ValueError("Worker not started")
ValueError: Worker not started
martindurant commented 7 years ago

Initial suspicion is that the versions of distributed are incompatible; I would remove the temporary dask environment (normally in the knit source directory, and also in .knitDeps on HDFS) and also update dask/distributed in the environment from which you are launching knit.

mrocklin commented 7 years ago

What is distributed.__version__ on the worker nodes?

mrocklin commented 7 years ago

(I'm also unsure of what's happening here)

quartox commented 7 years ago

I have version 1.19.1 installed on the edge node, this is the create command that I am re-running right now /nas/isg_prodops_work/autowork/anaconda3/bin/conda create -p /home/jlord/.conda/envs/dask/lib/python3.6/site-packages/knit-0.2.2-py3.6.egg/knit/tmp_conda/envs/dask-35d2a1ee201208ae9fca6905fa88ea9e54557b58 -y -q dask>=0.14 distributed>=1.16

I can open it up to see the exact version when it finishes.

mrocklin commented 7 years ago

Yeah, that should be fine

On Fri, Oct 6, 2017 at 12:59 PM, Jesse Lord notifications@github.com wrote:

I have version 1.19.1 installed on the edge node, this is the create command that I am re-running right now /nas/isg_prodops_work/autowork/anaconda3/bin/conda create -p /home/jlord/.conda/envs/dask/lib/python3.6/site-packages/ knit-0.2.2-py3.6.egg/knit/tmp_conda/envs/dask- 35d2a1ee201208ae9fca6905fa88ea9e54557b58 -y -q dask>=0.14 distributed>=1.16

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/knit/issues/92#issuecomment-334812383, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszLIbgPOe5HIKfmjWfZ56Fx9f3OEgks5splx6gaJpZM4Pwv-l .

quartox commented 7 years ago

Looks like 1.18.1 in that conda env. Same error after rebuilding.

mrocklin commented 7 years ago

OK, thanks for checking

mrocklin commented 7 years ago

Hrm, can you verify that your client and scheduler are running the same version?

client.get_versions(check=True)
martindurant commented 7 years ago

Note that you can pass channels to conda via the programmatic interface, or, of you run from the command line as above, with -c conda-forge

quartox commented 7 years ago

Should I pass the channel to the DaskYarnCluster? I installed everything using conda-forge but it is building it automatically from a different channel.

martindurant commented 7 years ago

yes, DaskYarnCluster(channels=['conda-forge']) - you should have the installs from the same channels as far as possible.

quartox commented 7 years ago

That did it! I just needed to use the same channel.

martindurant commented 7 years ago

As another note: you can provide an absolute path to a conda environment, or give a conda environment name that already exists, which may be easier in such situations.

quartox commented 7 years ago

What is the argument name for that path?

martindurant commented 7 years ago

DaskYARNCluster(env='/my/conda/path') (where that directory contains /bin, /lib, etc).

martindurant commented 7 years ago

That can either be the .zip or a directory, which will then be zipped for you /my/conda/path => /my/conda/path.zip .

quartox commented 7 years ago

Excellent. I think we should definitely write up a troubleshooting guide at some point.