Open CMCDragonkai opened 4 years ago
My distributed version is:
distributed==2.3.0
Hmm, I haven't seen this error in some time. Would it be possible to upgrade to 2.9.1
?
@CMCDragonkai were you able to try with a more recent version of distributed?
I'm hitting ~this~ a very similar issue also on dask 2.9.1. I have this setup:
I can reproduce this systematically doing this:
This puts the scheduler in a corrupted state where:
client.restart
times out and outputs closing dangling stream
warnings.KeyError
(except for the /system
endpoint which works perfectly)After that I can't recover the scheduler. I've tried scaling down the cluster, restarting the client, creating a new client from the dask-kubernetes
cluster
object. Nothing has worked so far.
Here is an example stack-trace:
Traceback (most recent call last):
File "/root/.cache/pypoetry/virtualenvs/acquisition-uD-vDiZT-py3.6/lib/python3.6/site-packages/tornado/web.py", line 1592, in _execute
result = yield result
File "/root/.cache/pypoetry/virtualenvs/acquisition-uD-vDiZT-py3.6/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/root/.cache/pypoetry/virtualenvs/acquisition-uD-vDiZT-py3.6/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/usr/local/lib/python3.6/types.py", line 184, in throw
return self.__wrapped.throw(tp, *rest)
File "/root/.cache/pypoetry/virtualenvs/acquisition-uD-vDiZT-py3.6/lib/python3.6/site-packages/bokeh/server/views/doc_handler.py", line 56, in get
session = yield self.get_session()
File "/root/.cache/pypoetry/virtualenvs/acquisition-uD-vDiZT-py3.6/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/root/.cache/pypoetry/virtualenvs/acquisition-uD-vDiZT-py3.6/lib/python3.6/site-packages/tornado/gen.py", line 1141, in run
yielded = self.gen.throw(*exc_info)
File "/usr/local/lib/python3.6/types.py", line 184, in throw
return self.__wrapped.throw(tp, *rest)
File "/root/.cache/pypoetry/virtualenvs/acquisition-uD-vDiZT-py3.6/lib/python3.6/site-packages/bokeh/server/views/session_handler.py", line 79, in get_session
session = yield self.application_context.create_session_if_needed(session_id, self.request)
File "/root/.cache/pypoetry/virtualenvs/acquisition-uD-vDiZT-py3.6/lib/python3.6/site-packages/tornado/gen.py", line 1133, in run
value = future.result()
File "/root/.cache/pypoetry/virtualenvs/acquisition-uD-vDiZT-py3.6/lib/python3.6/site-packages/tornado/gen.py", line 1147, in run
yielded = self.gen.send(value)
File "/root/.cache/pypoetry/virtualenvs/acquisition-uD-vDiZT-py3.6/lib/python3.6/site-packages/bokeh/server/contexts.py", line 222, in create_session_if_needed
self._application.initialize_document(doc)
File "/root/.cache/pypoetry/virtualenvs/acquisition-uD-vDiZT-py3.6/lib/python3.6/site-packages/bokeh/application/application.py", line 178, in initialize_document
h.modify_document(doc)
File "/root/.cache/pypoetry/virtualenvs/acquisition-uD-vDiZT-py3.6/lib/python3.6/site-packages/bokeh/application/handlers/function.py", line 133, in modify_document
self._func(doc)
File "/root/.cache/pypoetry/virtualenvs/acquisition-uD-vDiZT-py3.6/lib/python3.6/site-packages/distributed/dashboard/components/scheduler.py", line 1748, in status_doc
current_load.update()
File "/root/.cache/pypoetry/virtualenvs/acquisition-uD-vDiZT-py3.6/lib/python3.6/site-packages/bokeh/core/property/validation.py", line 97, in func
return input_function(*args, **kwargs)
File "/root/.cache/pypoetry/virtualenvs/acquisition-uD-vDiZT-py3.6/lib/python3.6/site-packages/distributed/dashboard/components/scheduler.py", line 646, in update
cpu = [int(ws.metrics["cpu"]) for ws in workers]
File "/root/.cache/pypoetry/virtualenvs/acquisition-uD-vDiZT-py3.6/lib/python3.6/site-packages/distributed/dashboard/components/scheduler.py", line 646, in <listcomp>
cpu = [int(ws.metrics["cpu"]) for ws in workers]
KeyError: 'cpu'
I just tried this with a newer version of Dask 2021-.09-0
, and it looks like the minimal, reproducible example here needs to be updated slightly.
At the line:
./worker.py --scheduler-ip 127.0.0.1 --scheduler-port 3201 --name image-classifier-worker
...I get an error indicating the way we provide command line arguments has changed.
usage: worker.py [-h] [--host HOST] [--port PORT]
[--dashboard-address DASHBOARD_ADDRESS]
worker.py: error: unrecognized arguments: --scheduler-ip 127.0.0.1 --scheduler-port 3201 --name image-classifier-worker
So I tried changing it to this:
./worker.py --host 127.0.0.1 --port 3201
and then this (I don't think this is right, but it felt worth checking all the combinations of options, even if I didn't think they'd work):
./worker.py --dashboard-address 127.0.0.1 --port 3201
... which gave me errors saying the address is already in use
OSError: [Errno 98] Address already in use
I imagine it shouldn't be too difficult to update this example, I'm likely just missing something obvious. @quasiben do you have suggesionts?
I see that this error has been reported in: #3147 as well.
This is the error I'm seeing from running the scheduler.
This is how to reproduce this...
This is the scheduler.py.
This is the worker.py
On the second invocation of the worker I get:
But on the scheduler it starts having a failure regarding the
cpu
key.