dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

Potential race condition in Nanny #3955

Open FabioLuporini opened 4 years ago

FabioLuporini commented 4 years ago

Hi everybody, since a few days we're seeing "random" failures in our CI due to distributed emitting:

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 1>>
Traceback (most recent call last):
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/opt/hostedtoolcache/Python/3.7.7/x64/lib/python3.7/site-packages/distributed/nanny.py", line 414, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'

This feels like a race condition in some situation, e.g. closing of the Nanny because the periodic callbacks are still running but Nanny.process is already None.

I think (still investigating) we've started seeing this only after 2.20 was released.

I cannot attach an MFE yet simply because we don't have one ๐Ÿ˜ฌ we only experience this on CI. I'd welcome any sort of feedback. I can try to give some context though: the error is triggered in a Jupyter Notebook by a cell which calls scipy.minimize from a DASK worker (cell 14 here -- look for optimize.minimize). I doubt it's ever going to be useful, but here's an excerpt of the raw log (part of which I pasted above): note that it repeats over and over again for hundreds/thousands of lines...

Thanks!

mrocklin commented 4 years ago

Thank you for raising this issue @FabioLuporini .

At first glance I don't see any obvious cause. A minimal reproducer would be useful. Alternatively, given that this started affecting you only recently, you could also try git bisect on this repository to see if you can isolate the commit that caused this change. This might be hard though if you can only reproduce the issue in CI.

FabioLuporini commented 4 years ago

I could...but it's gonna take forever :grimacing:

so, after an afternoon of CI re-spins, I think we're quite confident to say that 2.19 doesn't exhibit this behaviour

I guess we're gonna enforce <2.20 for now

mrocklin commented 4 years ago

I could...but it's gonna take forever ๐Ÿ˜ฌ

There are only 13 commits between 2.19 and 2.20. It might not be terrible.

On Thu, Jul 9, 2020 at 9:49 AM Fabio Luporini notifications@github.com wrote:

I could...but it's gonna take forever ๐Ÿ˜ฌ

so, after an afternoon of CI re-spins, I think we're quite confident to say that 2.19 doesn't exhibit this behaviour

I guess we're gonna enforce <2.20 for now

โ€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/3955#issuecomment-656236554, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTCDK24XT7IAD5WIOADR2XYIXANCNFSM4OVSA4DQ .

FabioLuporini commented 4 years ago

OK, I thought it was worse than that. Do you suggest I do it on all commits or just on the merge commits (haven't checked the commit history, will do shortly) ?

mrocklin commented 4 years ago

We squash-and-merge, so every commit in history is valid and clean

On Thu, Jul 9, 2020 at 10:01 AM Fabio Luporini notifications@github.com wrote:

OK, I thought it was worse than that. Do you suggest I do it on all commits or just on the merge commits (haven't checked the commit history, will do shortly) ?

โ€” You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/3955#issuecomment-656242216, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTBHXFLWWERVVZDJBO3R2XZVZANCNFSM4OVSA4DQ .

jrdzha commented 4 years ago

I'm also having this issue. I just wanted to add that we're experiencing this about 1 in 30 times.

mrocklin commented 4 years ago

Thanks @jrdzha . Do you have time to try the git-bisect process? I would expect this to take around 30-45 minutes.

jrdzha commented 4 years ago

@mrocklin I'm quite busy for the next few days, but I'll take a look after that.

FabioLuporini commented 4 years ago

@mrocklin sorry -- we did try it (OK, not really git bisect for the reasons explained above, but a conceptually analogous thing). We though bumped into other issues, and I eventually forgot to update this thread. I'll let @rhodrin explain our findings 'cause I don't quite remember what the latest news from this front are

rhodrin commented 4 years ago

@mrocklin Yes, sorry for the delayed response. So we set up an Actions matrix to run our tests many times for each commit between 2.19 and 2.20, and later some earlier and later versions.

Firstly, failures are rare and not particularly consistent (possibly related to load on the github hosted runners). One day we may see a failure 1 in 10 times, the next day we could see ~100 runs with no failure. We managed to produce failures on, I think, all commits between 2.19 and 2.20 and even into some post 2.20 commits. We then tried to produce failures on some older versions, but that day we saw absolutely no failures on any version between 2.16 and 2.21 - I think we ran the tests ~80 times on each version. After that I took a break from this.

So in summary, this one is a real mystery. But my inclining is that it's something memory related: we only managed to produce failures when running a large suite of tests, repeatedly running the failing test individually never produced any failures. Also, I've not managed to reproduce this on my local machine (which is pretty decent).

jrdzha commented 4 years ago

@rhodrin Yes, for our testing the fails seem to occur randomly. We run our tests using LocalCluster on a 'local' machine (not our Kubernetes cluster) though and they are failing there.

mrocklin commented 4 years ago

OK, it seems like the next thing to do here then is for someone to produce a reproducible example.

miroenev commented 4 years ago

Just wanted to chime in that we are consistently seeing the nanny NoneType errors when running LocalCudaCluster/LocalCluster jobs on cloud instances (code gets wrapped into a container/AWS SageMaker Estimator) however the issue seems to not occur when running with containers on a local machine. Interestingly these errors are non-critical and the computations continue to completion, however the errors dominate/clobber the output.

nanny_issues_7_30_2020.txt

https://github.com/rapidsai/cloud-ml-examples/blob/master/aws/notebooks/sagemaker_rapids_hpo.ipynb https://github.com/rapidsai/cloud-ml-examples/blob/master/aws/code/rapids_cloud_ml.py

mrocklin commented 4 years ago

cc @quasiben

quasiben commented 4 years ago

@miroenev it looks like your issue comes is explained at the top of your log output:

FileExistsError: [Errno 17] File exists: '/opt/ml/output/dask-worker-space

There have been some changes to dask-worker directory creation. @FabioLuporini can you upgrade to latest (2.22.0) and try again ?

miroenev commented 4 years ago

@quasiben I think they are separate errors.

My hypothesis is that the dask_worker_space issue is due to the way SageMaker attaches the same dataset S3 volume to multiple instances when running HPO. I've since tried to set the temporary directory location in a place that SageMaker's documentation suggests will not overlap between instances but its not entirely clear whether this fixes the dask_worker_space 'error.' We are less concerned about this since it has minimal impact on the output logs (only shows up once) and in practice does not seem to impact the performance of the upstream dask computations.

The nanny memory_monitor error however is quite annoying and dominates the output logs. I've tried to remove it from the logging configuration but it still persists.

https://github.com/rapidsai/cloud-ml-examples/blob/42d8ab333f8750beafd8d59a7f86991207c1488c/aws/code/rapids_cloud_ml.py#L430

rhodrin commented 4 years ago

@quasiben We're now using 2.22.0. We haven't seen this issue yet, but will report back if we do.

EmperorArthur commented 4 years ago

I just ran into this issue on my local machine, so it is alive and well.

from distributed import Client
client = Client(asynchronous=False)

Results in this being outputted over and over again:

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 3>>
Traceback (most recent call last):
  File "/home/arthur/.local/lib/python3.8/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/home/arthur/.local/lib/python3.8/site-packages/distributed/nanny.py", line 414, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'

Known Causes:

Version Information

mrocklin commented 4 years ago

I recommend getting logs and then posting them here

client = Client(asynchronous=False, silence_logs=False)
EmperorArthur commented 4 years ago

Huh, now those logs are interesting:

I suspect it has to do with how PyCharm configures everything. Especially since I confirmed I can start it via a shell/pipenv.

Edit: I have changed the "ProjectName", but everything else is exactly as the system output it. Including the "" directory.

import os
os.getcwd()
'/home/arthur/src/ProjectName'
from distributed import Client
client = Client(asynchronous=False, silence_logs=False)
distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
distributed.scheduler - INFO - Clear task state
distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:42219
distributed.scheduler - INFO -   dashboard at:            127.0.0.1:8787
distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:34001'
distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:45005'
distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:45505'
distributed.nanny - INFO -         Start Nanny at: 'tcp://127.0.0.1:41067'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/lib/python3.8/runpy.py", line 264, in run_path
    code, fname = _get_code_from_file(run_name, path_name)
  File "/usr/lib/python3.8/runpy.py", line 234, in _get_code_from_file
    with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arthur/src/ProjectName/<input>'
distributed.nanny - INFO - Worker process 200663 exited with status 1
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/lib/python3.8/runpy.py", line 264, in run_path
    code, fname = _get_code_from_file(run_name, path_name)
  File "/usr/lib/python3.8/runpy.py", line 234, in _get_code_from_file
    with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arthur/src/ProjectName/<input>'
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:45005'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/lib/python3.8/runpy.py", line 264, in run_path
    code, fname = _get_code_from_file(run_name, path_name)
  File "/usr/lib/python3.8/runpy.py", line 234, in _get_code_from_file
    with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arthur/src/ProjectName/<input>'
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 125, in _main
    prepare(preparation_data)
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "/usr/lib/python3.8/multiprocessing/spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "/usr/lib/python3.8/runpy.py", line 264, in run_path
    code, fname = _get_code_from_file(run_name, path_name)
  File "/usr/lib/python3.8/runpy.py", line 234, in _get_code_from_file
    with io.open_code(decoded_path) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/arthur/src/ProjectName/<input>'
distributed.nanny - INFO - Worker process 200661 exited with status 1
distributed.nanny - INFO - Worker process 200666 exited with status 1
distributed.nanny - INFO - Worker process 200667 exited with status 1
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:34001'
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:41067'
distributed.nanny - INFO - Closing Nanny at 'tcp://127.0.0.1:45505'
distributed.scheduler - INFO - Receive client connection: Client-7ec868f1-0d58-11eb-8f4e-b7ca8d0cd9a4
distributed.core - INFO - Starting established connection
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 3>>
Traceback (most recent call last):
  File "/home/arthur/.local/share/virtualenvs/ProjectName-GY1GZPBf/lib/python3.8/site-packages/tornado/ioloop.py", line 907, in _run
    return self.callback()
  File "/home/arthur/.local/share/virtualenvs/ProjectName-GY1GZPBf/lib/python3.8/site-packages/distributed/nanny.py", line 414, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'

With the last "tornado.application Error" repeating endlessly until I killed the console.

CJ-Wright commented 3 years ago

I've been able to turn this error on if I set silence_logs='error'.

This example doesn't work on my local:

import contextlib
from concurrent.futures._base import as_completed

import dask
import distributed
from distributed.cfexecutor import ClientExecutor

@contextlib.contextmanager
def executor():
    with dask.config.set({"distributed.worker.daemon": True}):
        with distributed.LocalCluster(
                n_workers=5,
                processes=True,
                silence_logs='error'
        ) as cluster:
            with distributed.Client(cluster, ) as client:
                yield ClientExecutor(client)

def add(z):
    return z+ 1

if __name__ == '__main__':

    with executor() as pool:
        futures = [pool.submit(add, k) for k in range(10)]
        for f in as_completed(futures):
            print(f.result())

Failing with:

tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 3>>
Traceback (most recent call last):
  File "/home/cjwright/mc/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
    return self.callback()
  File "/home/cjwright/mc/lib/python3.8/site-packages/distributed/nanny.py", line 414, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'

This works though:

import contextlib
from concurrent.futures._base import as_completed

import dask
import distributed
from distributed.cfexecutor import ClientExecutor

@contextlib.contextmanager
def executor():
    with dask.config.set({"distributed.worker.daemon": True}):
        with distributed.LocalCluster(
                n_workers=5,
                processes=True,
                # silence_logs='error'
        ) as cluster:
            with distributed.Client(cluster, ) as client:
                yield ClientExecutor(client)

def add(z):
    return z+ 1

if __name__ == '__main__':

    with executor() as pool:
        futures = [pool.submit(add, k) for k in range(10)]
        for f in as_completed(futures):
            print(f.result())

Does that work for a reproducible example?

(dask version 2020.12.0)

yuhengy commented 3 years ago

I meet a very similar problem. I ask many nodes to use a single dask-worker-space in the shared file system. When a node want to retire its own workers from cluster.adapt(), it will accidentally kill workers of other nodes.The problem gone when I set a local_directory for each nodes.

I am not sure whether this "using a single dask-worker-space among nodes" is a bad code should be avoided from the beginning. Or this is because of bugs when dealing with "using a single dask-worker-space among nodes". If it's the first case, I would recommend to support this, because when I want a quick use of Dask on a cluster, my intuition won't tell me: "set the local_directory first, otherwise the program will crash". If it's the second cases, I would like to help you to recreate this bug on your side.

jrbourbeau commented 3 years ago

@CJ-Wright when I run your first code snippet with the latest dask and distributed release (v2021.04.0). I see AttributeError you posted, however at the top of the traceback I see

Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/Users/james/miniforge3/envs/distributed/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/Users/james/miniforge3/envs/distributed/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/james/projects/dask/distributed/distributed/process.py", line 192, in _run
    target(*args, **kwargs)
  File "/Users/james/projects/dask/distributed/distributed/nanny.py", line 726, in _run
    logger.setLevel(silence_logs)
  File "/Users/james/miniforge3/envs/distributed/lib/python3.8/logging/__init__.py", line 1409, in setLevel
    self.level = _checkLevel(level)
  File "/Users/james/miniforge3/envs/distributed/lib/python3.8/logging/__init__.py", line 194, in _checkLevel
    raise ValueError("Unknown level: %r" % level)
ValueError: Unknown level: 'error'

However setting silence_logs=logging.ERROR or silence_logs=False runs successfully. Is this also the case for you?

rbavery commented 3 years ago

I'm getting this error just by creating a cluster on an azure VM with dask and distributed versions 2014.04.01. These errors occur in a fresh python 3.8 environment after installing these libraries with pip install dask distributed pystac_client planetary_computer ipyleaflet geogif stackstac

import distributed
import dask
import pystac_client
import planetary_computer as pc
import ipyleaflet
import IPython.display as dsp
import geogif
import stackstac

cluster = distributed.LocalCluster(nworkers=2)

A cluster appears to be created but with 0 workers and 0 memory and these errors go on to infinity unless I kill the jupyter kernel. Sometimes the output is large enough with these repeating errors that the notebook freezes.

AttributeError: 'NoneType' object has no attribute 'process'
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 4>>
Traceback (most recent call last):
  File "/anaconda/envs/pybayts/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
    return self.callback()
  File "/anaconda/envs/pybayts/lib/python3.8/site-packages/distributed/nanny.py", line 417, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 4>>
Traceback (most recent call last):
  File "/anaconda/envs/pybayts/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
    return self.callback()
  File "/anaconda/envs/pybayts/lib/python3.8/site-packages/distributed/nanny.py", line 417, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
jrbourbeau commented 3 years ago

@rbavery the AttributeError: 'NoneType' object has no attribute 'process' error you're seeing usually means that the worker process wasn't able to start up properly. After https://github.com/dask/distributed/pull/4733, which was just merged, any exceptions that are raised which caused the worker to not start up will now be raised. Could you try again with the current main branch of distributed?

rbavery commented 3 years ago

@jrbourbeau thanks for the quick tip. I installed from main and now the error in the notebook where I'm running distributed.LocalCluster(nworkers=2) looks like this. Not sure if there is a difference?:

distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
distributed.nanny - WARNING - Restarting worker
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 4>>
Traceback (most recent call last):
  File "/anaconda/envs/stackstac/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
    return self.callback()
  File "/anaconda/envs/stackstac/lib/python3.8/site-packages/distributed/nanny.py", line 417, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 4>>
Traceback (most recent call last):
  File "/anaconda/envs/stackstac/lib/python3.8/site-packages/tornado/ioloop.py", line 905, in _run
    return self.callback()
  File "/anaconda/envs/stackstac/lib/python3.8/site-packages/distributed/nanny.py", line 417, in memory_monitor
    process = self.process.process
AttributeError: 'NoneType' object has no attribute 'process'
distributed.nanny - WARNING - Restarting worker
tornado.application - ERROR - Exception in callback <bound method Nanny.memory_monitor of <Nanny: None, threads: 4>>

However I just noticed these logs in the terminal where my notebook is running:

Process Dask Worker process (from Nanny):
Traceback (most recent call last):
  File "/anaconda/envs/stackstac/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/anaconda/envs/stackstac/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/anaconda/envs/stackstac/lib/python3.8/site-packages/distributed/process.py", line 192, in _run
    target(*args, **kwargs)
  File "/anaconda/envs/stackstac/lib/python3.8/site-packages/distributed/nanny.py", line 736, in _run
    worker = Worker(**worker_kwargs)
  File "/anaconda/envs/stackstac/lib/python3.8/site-packages/distributed/worker.py", line 695, in __init__
    super().__init__(
  File "/anaconda/envs/stackstac/lib/python3.8/site-packages/distributed/core.py", line 241, in __init__
    super().__init__(**kwargs)
TypeError: object.__init__() takes exactly one argument (the instance to initialize)
fjetter commented 3 years ago

@rbavery note that the latter exception is caused by improper arguments to the workers. this was fixed in https://github.com/dask/distributed/pull/4733 and should now provide a better exception. I would encourage you to try again

Edit: I re-read your comment and realized you already installed from main branch. However, the line the traceback is referring to was deleted in the above referenced PR. Can you please verify that the installation is correct? Sometimes, especially with conda environments, the "from repo" installs are accidentally overwritten.

rbavery commented 3 years ago

@fjetter thanks a bunch for the help. It looks like my conda environment displayed in the upper right of the notebook wasn't actually active for some reason. I ran the following in ipython instead with the environment actually active:

import distributed
cluster = distributed.LocalCluster(nworkers=2)

and was able to get a more informative traceback with the main branch installed

TypeError: __init__() got an unexpected keyword argument 'nworkers'

using the correct arg n_workers solved my problem.