Open FabioLuporini opened 4 years ago
Thank you for raising this issue @FabioLuporini .
At first glance I don't see any obvious cause. A minimal reproducer would be useful. Alternatively, given that this started affecting you only recently, you could also try git bisect
on this repository to see if you can isolate the commit that caused this change. This might be hard though if you can only reproduce the issue in CI.
I could...but it's gonna take forever :grimacing:
so, after an afternoon of CI re-spins, I think we're quite confident to say that 2.19 doesn't exhibit this behaviour
I guess we're gonna enforce <2.20 for now
I could...but it's gonna take forever ๐ฌ
There are only 13 commits between 2.19 and 2.20. It might not be terrible.
On Thu, Jul 9, 2020 at 9:49 AM Fabio Luporini notifications@github.com wrote:
I could...but it's gonna take forever ๐ฌ
so, after an afternoon of CI re-spins, I think we're quite confident to say that 2.19 doesn't exhibit this behaviour
I guess we're gonna enforce <2.20 for now
โ You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/3955#issuecomment-656236554, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTCDK24XT7IAD5WIOADR2XYIXANCNFSM4OVSA4DQ .
OK, I thought it was worse than that. Do you suggest I do it on all commits or just on the merge commits (haven't checked the commit history, will do shortly) ?
We squash-and-merge, so every commit in history is valid and clean
On Thu, Jul 9, 2020 at 10:01 AM Fabio Luporini notifications@github.com wrote:
OK, I thought it was worse than that. Do you suggest I do it on all commits or just on the merge commits (haven't checked the commit history, will do shortly) ?
โ You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/3955#issuecomment-656242216, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBHXFLWWERVVZDJBO3R2XZVZANCNFSM4OVSA4DQ .
I'm also having this issue. I just wanted to add that we're experiencing this about 1 in 30 times.
Thanks @jrdzha . Do you have time to try the git-bisect process? I would expect this to take around 30-45 minutes.
@mrocklin I'm quite busy for the next few days, but I'll take a look after that.
@mrocklin sorry -- we did try it (OK, not really git bisect
for the reasons explained above, but a conceptually analogous thing). We though bumped into other issues, and I eventually forgot to update this thread. I'll let @rhodrin explain our findings 'cause I don't quite remember what the latest news from this front are
@mrocklin Yes, sorry for the delayed response. So we set up an Actions matrix to run our tests many times for each commit between 2.19 and 2.20, and later some earlier and later versions.
Firstly, failures are rare and not particularly consistent (possibly related to load on the github hosted runners). One day we may see a failure 1 in 10 times, the next day we could see ~100 runs with no failure. We managed to produce failures on, I think, all commits between 2.19 and 2.20 and even into some post 2.20 commits. We then tried to produce failures on some older versions, but that day we saw absolutely no failures on any version between 2.16 and 2.21 - I think we ran the tests ~80 times on each version. After that I took a break from this.
So in summary, this one is a real mystery. But my inclining is that it's something memory related: we only managed to produce failures when running a large suite of tests, repeatedly running the failing test individually never produced any failures. Also, I've not managed to reproduce this on my local machine (which is pretty decent).
@rhodrin Yes, for our testing the fails seem to occur randomly. We run our tests using LocalCluster on a 'local' machine (not our Kubernetes cluster) though and they are failing there.
OK, it seems like the next thing to do here then is for someone to produce a reproducible example.
Just wanted to chime in that we are consistently seeing the nanny NoneType errors when running LocalCudaCluster/LocalCluster jobs on cloud instances (code gets wrapped into a container/AWS SageMaker Estimator) however the issue seems to not occur when running with containers on a local machine. Interestingly these errors are non-critical and the computations continue to completion, however the errors dominate/clobber the output.
https://github.com/rapidsai/cloud-ml-examples/blob/master/aws/notebooks/sagemaker_rapids_hpo.ipynb https://github.com/rapidsai/cloud-ml-examples/blob/master/aws/code/rapids_cloud_ml.py
cc @quasiben
@miroenev it looks like your issue comes is explained at the top of your log output:
FileExistsError: [Errno 17] File exists: '/opt/ml/output/dask-worker-space
There have been some changes to dask-worker directory creation. @FabioLuporini can you upgrade to latest (2.22.0) and try again ?
@quasiben I think they are separate errors.
My hypothesis is that the dask_worker_space issue is due to the way SageMaker attaches the same dataset S3 volume to multiple instances when running HPO. I've since tried to set the temporary directory location in a place that SageMaker's documentation suggests will not overlap between instances but its not entirely clear whether this fixes the dask_worker_space 'error.' We are less concerned about this since it has minimal impact on the output logs (only shows up once) and in practice does not seem to impact the performance of the upstream dask computations.
The nanny memory_monitor error however is quite annoying and dominates the output logs. I've tried to remove it from the logging configuration but it still persists.
@quasiben We're now using 2.22.0. We haven't seen this issue yet, but will report back if we do.
I just ran into this issue on my local machine, so it is alive and well.
from distributed import Client
client = Client(asynchronous=False)
Results in this being outputted over and over again:
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 3>>
Traceback (most recent call last):
File "/home/arthur/.local/lib/python3.8/site-packages/tornado/ioloop.py", line 907, in _run
return self.callback()
File "/home/arthur/.local/lib/python3.8/site-packages/distributed/nanny.py", line 414, in memory_monitor
process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
os.getcwd()
is in a directory without write privileges triggers the error every time.help(distributed)
gives version 2.30.0pip3 install dask[complete]
as rootfree -h
:
total used free shared buff/cache available
Mem: 15Gi 10Gi 318Mi 762Mi 4.8Gi 4.1Gi
Swap: 31Gi 1.0Gi 31Gi
I recommend getting logs and then posting them here
client = Client(asynchronous=False, silence_logs=False)
Huh, now those logs are interesting:
I suspect it has to do with how PyCharm configures everything. Especially since I confirmed I can start it via a shell/pipenv.
Edit: I have changed the "ProjectName", but everything else is exactly as the system output it. Including the "" directory.
import os
os.getcwd()
'/home/arthur/src/ProjectName'
from distributed import Client
client = Client(asynchronous=False, silence_logs=False)
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO - Scheduler at: tcp://127.0.0.1:42219
distributed.scheduler - INFO - dashboard at: 127.0.0.1:8787
distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:34001'
distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:45005'
distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:45505'
distributed.nanny - INFO - Start Nanny at: 'tcp://127.0.0.1:41067'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.8/runpy.py", line 264, in run_path
code, fname = _get_code_from_file(run_name, path_name)
File "/usr/lib/python3.8/runpy.py", line 234, in _get_code_from_file
with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arthur/src/ProjectName/<input>'
distributed.nanny - INFO - Worker process 200663 exited with status 1
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.8/runpy.py", line 264, in run_path
code, fname = _get_code_from_file(run_name, path_name)
File "/usr/lib/python3.8/runpy.py", line 234, in _get_code_from_file
with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arthur/src/ProjectName/<input>'
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:45005'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.8/runpy.py", line 264, in run_path
code, fname = _get_code_from_file(run_name, path_name)
File "/usr/lib/python3.8/runpy.py", line 234, in _get_code_from_file
with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arthur/src/ProjectName/<input>'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
prepare(preparation_data)
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
_fixup_main_from_path(data['init_main_from_path'])
File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
main_content = runpy.run_path(main_path,
File "/usr/lib/python3.8/runpy.py", line 264, in run_path
code, fname = _get_code_from_file(run_name, path_name)
File "/usr/lib/python3.8/runpy.py", line 234, in _get_code_from_file
with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arthur/src/ProjectName/<input>'
distributed.nanny - INFO - Worker process 200661 exited with status 1
distributed.nanny - INFO - Worker process 200666 exited with status 1
distributed.nanny - INFO - Worker process 200667 exited with status 1
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:34001'
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:41067'
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:45505'
distributed.scheduler - INFO - Receive client connection: Client-7ec868f1-0d58-11eb-8f4e-b7ca8d0cd9a4
distributed.core - INFO - Starting established connection
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 3>>
Traceback (most recent call last):
File "/home/arthur/.local/share/virtualenvs/ProjectName-GY1GZPBf/lib/python3.8/site-packages/tornado/ioloop.py", line 907, in _run
return self.callback()
File "/home/arthur/.local/share/virtualenvs/ProjectName-GY1GZPBf/lib/python3.8/site-packages/distributed/nanny.py", line 414, in memory_monitor
process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
With the last "tornado.application Error" repeating endlessly until I killed the console.
I've been able to turn this error on if I set silence_logs='error'
.
This example doesn't work on my local:
import contextlib
from concurrent.futures._base import as_completed
import dask
import distributed
from distributed.cfexecutor import ClientExecutor
@contextlib.contextmanager
def executor():
with dask.config.set({"distributed.worker.daemon": True}):
with distributed.LocalCluster(
n_workers=5,
processes=True,
silence_logs='error'
) as cluster:
with distributed.Client(cluster, ) as client:
yield ClientExecutor(client)
def add(z):
return z+ 1
if __name__ == '__main__':
with executor() as pool:
futures = [pool.submit(add, k) for k in range(10)]
for f in as_completed(futures):
print(f.result())
Failing with:
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 3>>
Traceback (most recent call last):
File "/home/cjwright/mc/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
return self.callback()
File "/home/cjwright/mc/lib/python3.8/site-packages/distributed/nanny.py", line 414, in memory_monitor
process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
This works though:
import contextlib
from concurrent.futures._base import as_completed
import dask
import distributed
from distributed.cfexecutor import ClientExecutor
@contextlib.contextmanager
def executor():
with dask.config.set({"distributed.worker.daemon": True}):
with distributed.LocalCluster(
n_workers=5,
processes=True,
# silence_logs='error'
) as cluster:
with distributed.Client(cluster, ) as client:
yield ClientExecutor(client)
def add(z):
return z+ 1
if __name__ == '__main__':
with executor() as pool:
futures = [pool.submit(add, k) for k in range(10)]
for f in as_completed(futures):
print(f.result())
Does that work for a reproducible example?
(dask version 2020.12.0)
I meet a very similar problem. I ask many nodes to use a single dask-worker-space
in the shared file system. When a node want to retire its own workers from cluster.adapt(), it will accidentally kill workers of other nodes.The problem gone when I set a local_directory for each nodes.
I am not sure whether this "using a single dask-worker-space among nodes" is a bad code should be avoided from the beginning. Or this is because of bugs when dealing with "using a single dask-worker-space among nodes". If it's the first case, I would recommend to support this, because when I want a quick use of Dask on a cluster, my intuition won't tell me: "set the local_directory first, otherwise the program will crash". If it's the second cases, I would like to help you to recreate this bug on your side.
@CJ-Wright when I run your first code snippet with the latest dask
and distributed
release (v2021.04.0
). I see AttributeError
you posted, however at the top of the traceback I see
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
File "/Users/james/miniforge3/envs/distributed/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/Users/james/miniforge3/envs/distributed/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/Users/james/projects/dask/distributed/distributed/process.py", line 192, in _run
target(*args, **kwargs)
File "/Users/james/projects/dask/distributed/distributed/nanny.py", line 726, in _run
logger.setLevel(silence_logs)
File "/Users/james/miniforge3/envs/distributed/lib/python3.8/logging/__init__.py", line 1409, in setLevel
self.level = _checkLevel(level)
File "/Users/james/miniforge3/envs/distributed/lib/python3.8/logging/__init__.py", line 194, in _checkLevel
raise ValueError("Unknown level: %r" % level)
ValueError: Unknown level: 'error'
However setting silence_logs=logging.ERROR
or silence_logs=False
runs successfully. Is this also the case for you?
I'm getting this error just by creating a cluster on an azure VM with dask and distributed versions 2014.04.01
. These errors occur in a fresh python 3.8 environment after installing these libraries with
pip install dask distributed pystac_client planetary_computer ipyleaflet geogif stackstac
import distributed
import dask
import pystac_client
import planetary_computer as pc
import ipyleaflet
import IPython.display as dsp
import geogif
import stackstac
cluster = distributed.LocalCluster(nworkers=2)
A cluster appears to be created but with 0 workers and 0 memory and these errors go on to infinity unless I kill the jupyter kernel. Sometimes the output is large enough with these repeating errors that the notebook freezes.
AttributeError: 'NoneType' object has no attribute 'process'
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 4>>
Traceback (most recent call last):
File "/anaconda/envs/pybayts/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
return self.callback()
File "/anaconda/envs/pybayts/lib/python3.8/site-packages/distributed/nanny.py", line 417, in memory_monitor
process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 4>>
Traceback (most recent call last):
File "/anaconda/envs/pybayts/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
return self.callback()
File "/anaconda/envs/pybayts/lib/python3.8/site-packages/distributed/nanny.py", line 417, in memory_monitor
process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
@rbavery the AttributeError: 'NoneType' object has no attribute 'process'
error you're seeing usually means that the worker process wasn't able to start up properly. After https://github.com/dask/distributed/pull/4733, which was just merged, any exceptions that are raised which caused the worker to not start up will now be raised. Could you try again with the current main
branch of distributed
?
@jrbourbeau thanks for the quick tip. I installed from main and now the error in the notebook where I'm running distributed.LocalCluster(nworkers=2)
looks like this. Not sure if there is a difference?:
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 4>>
Traceback (most recent call last):
File "/anaconda/envs/stackstac/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
return self.callback()
File "/anaconda/envs/stackstac/lib/python3.8/site-packages/distributed/nanny.py", line 417, in memory_monitor
process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 4>>
Traceback (most recent call last):
File "/anaconda/envs/stackstac/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
return self.callback()
File "/anaconda/envs/stackstac/lib/python3.8/site-packages/distributed/nanny.py", line 417, in memory_monitor
process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
distributed.nanny - WARNING - Restarting worker
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 4>>
However I just noticed these logs in the terminal where my notebook is running:
Process Dask Worker process (from Nanny):
Traceback (most recent call last):
File "/anaconda/envs/stackstac/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/anaconda/envs/stackstac/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/anaconda/envs/stackstac/lib/python3.8/site-packages/distributed/process.py", line 192, in _run
target(*args, **kwargs)
File "/anaconda/envs/stackstac/lib/python3.8/site-packages/distributed/nanny.py", line 736, in _run
worker = Worker(**worker_kwargs)
File "/anaconda/envs/stackstac/lib/python3.8/site-packages/distributed/worker.py", line 695, in __init__
super().__init__(
File "/anaconda/envs/stackstac/lib/python3.8/site-packages/distributed/core.py", line 241, in __init__
super().__init__(**kwargs)
TypeError: object.__init__() takes exactly one argument (the instance to initialize)
@rbavery note that the latter exception is caused by improper arguments to the workers. this was fixed in https://github.com/dask/distributed/pull/4733 and should now provide a better exception. I would encourage you to try again
Edit: I re-read your comment and realized you already installed from main branch. However, the line the traceback is referring to was deleted in the above referenced PR. Can you please verify that the installation is correct? Sometimes, especially with conda environments, the "from repo" installs are accidentally overwritten.
@fjetter thanks a bunch for the help. It looks like my conda environment displayed in the upper right of the notebook wasn't actually active for some reason. I ran the following in ipython instead with the environment actually active:
import distributed
cluster = distributed.LocalCluster(nworkers=2)
and was able to get a more informative traceback with the main branch installed
TypeError: __init__() got an unexpected keyword argument 'nworkers'
using the correct arg n_workers
solved my problem.
Hi everybody, since a few days we're seeing "random" failures in our CI due to
distributed
emitting:This feels like a race condition in some situation, e.g. closing of the
Nanny
because the periodic callbacks are still running butNanny.process
is already None.I think (still investigating) we've started seeing this only after 2.20 was released.
I cannot attach an MFE yet simply because we don't have one ๐ฌ we only experience this on CI. I'd welcome any sort of feedback. I can try to give some context though: the error is triggered in a Jupyter Notebook by a cell which calls
scipy.minimize
from a DASK worker (cell 14 here -- look foroptimize.minimize
). I doubt it's ever going to be useful, but here's an excerpt of the raw log (part of which I pasted above): note that it repeats over and over again for hundreds/thousands of lines...Thanks!