dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.58k stars 718 forks source link

LocalCluster not reaping workers with run_on_scheduler and multiprocessing-method: spawn #1943

Open adbull opened 6 years ago

adbull commented 6 years ago

Running the following:

from dask import distributed

def noop():
    pass

def noop_on_scheduler(i):
    distributed.get_worker().client.run_on_scheduler(noop)

if __name__ == '__main__':
    while True:
        with distributed.Client(n_workers=8, threads_per_worker=1) as client:
            distributed.wait(client.map(noop_on_scheduler, range(8)))

I get lots of worker processes being created but not destroyed, and when I hit Ctrl-C, I get lots of warning logs like:

distributed.process - WARNING - reaping stray process <SpawnProcess(SpawnProcess-18, started daemon)>

Versions:

  - dask=0.17.2=py36_0
  - dask-core=0.17.2=py36_0
  - distributed=1.21.6=py36_0
  - tornado=4.5.3=py36_0
mrocklin commented 6 years ago

First, let me apologize for the long delay in responding to this. Thank you for the well worded issue and the minimal example.

My first attempt to reproduce this failed. I checked for leaking processes by adding the following line to the while loop

print(len(psutil.pids()))

This returned a roughly constant number during execution.

Can I ask, how are you checking for leaked processes?

mrocklin commented 6 years ago

Ah, actually, I'm guessing that you're on Python 2, where we seem to run into an issue with multiprocessing Queues where, every once in a while, they seem to get hung up. This may not be a leaking process issue.

Can you test to verify if this is also a problem for you on Python 3?

adbull commented 6 years ago

Actually, this is with Python 3.6.5 on Linux. Leaked processes were diagnosed using Gnome System Monitor, it's pretty clear from the process list.

I can check that psutil agrees tomorrow, but I'd expect it to give the same result.

On Mon, 7 May 2018 at 13:50 Matthew Rocklin notifications@github.com wrote:

Ah, actually, I'm guessing that you're on Python 2, where we seem to run into an issue with multiprocessing Queues where, every once in a while, they seem to get hung up. This may not be a leaking process issue.

Can you test to verify if this is also a problem for you on Python 3?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1943#issuecomment-387054330, or mute the thread https://github.com/notifications/unsubscribe-auth/AG8S2cr4LdcbigNsUXQSKHIJYh5e5mA2ks5twEMigaJpZM4Tqn5B .

adbull commented 6 years ago

So in a clean environment I can't replicate either, sorry about that! Must have been something strange in my set-up before. Anyway, thanks for taking the time to check this out.

adbull commented 6 years ago

Ah, figured it out: the issue only happens with DASK_MULTIPROCESSING_METHOD=spawn set.

In that case, the process count keeps growing, as measured by psutil.