dask / distributed

A distributed task scheduler for Dask
https://distributed.dask.org
BSD 3-Clause "New" or "Revised" License
1.56k stars 714 forks source link

more routine garbage collection in distributed? #1516

Open jrmlhermitte opened 6 years ago

jrmlhermitte commented 6 years ago

I've noticed the memory seems to increase so I was worried of memory leaks. I haven't noticed any (as I'm sure you were very confident I'd say ;-) ).

However, what I have noticed is that sometimes when a python process is killed the memory usage on the cluster doesn't go to zero right away. This can be problematic if the memory usage is quite large.

For example, let's say we have the following code, called test_distributed.py

from distributed import Client
client = Client("IP:PORT") # put IP and PORT of sched here
import numpy as np

def foo(a):
    return a+1

arr = np.ones((100000000))
arr2 = np.zeros((100000000))

ff = client.submit(foo, arr)
ff2 = client.submit(foo, arr2)

If I manually run in 5 times (python test_distributed.py), I see the following result for the memory usage: Memory profiles

The memory usage goes up, then comes down when the process terminates, but does not go to zero. When I run the same process, it goes up again, but never exceeds the previous memory usage. So this suggests there is no memory leak.

I figured this could perhaps have something to do with the python garbage collection process, so I went one step further and ran the following script:

from distributed import Client
client = Client("IP:PORT") # put IP and PORT of sched here

def cleanup():
    import gc
    gc.collect()

client.submit(cleanup)

This brought down the memory back to zero. Memory profiles

My feeling is that the python garbage collector can sometimes be slightly more aggressive with memory. For long running applications like distributed, I think it could be a good idea to force the garbage collection process every once in a while.

What do you think? Am I correct in my guess, and would there be a way to resolve this on the distributed side? The other obvious solution is for the user to run a cron script sending gc messages to the cluster. However, this is not so clean (and for large intermittent loads may run at very irregular times).

I sort of looked around to see if this was mentioned before, and didn't see anything. I apologize if this is a repost. Thanks!

mrocklin commented 6 years ago

Are you running on master or latest release? There has been quite a bit of activity on this recently.

On Tue, Oct 31, 2017 at 6:33 PM, Julien Lhermitte notifications@github.com wrote:

I've noticed the memory seems to increase so I was worried of memory leaks. I haven't noticed any (as I'm sure you were very confident I'd say ;-) ).

However, what I have noticed is that sometimes when a python process is killed the memory usage on the cluster doesn't go to zero right away. This can be problematic if the memory usage is quite large.

For example, let's say we have the following code, called test_distributed.py

from distributed import Client client = Client("IP:PORT") # put IP and PORT of sched hereimport numpy as np def foo(a): return a+1

arr = np.ones((100000000)) arr2 = np.zeros((100000000))

ff = client.submit(foo, arr) ff2 = client.submit(foo, arr2)

If I manually run in 5 times (python test_distributed.py), I see the following result for the memory usage: [image: Memory profiles] https://camo.githubusercontent.com/101f556116e3b00ca9f40550eaf8d512babc1b08/68747470733a2f2f696d6775722e636f6d2f612f5a776b5a6c

The memory usage goes up, then comes down when the process terminates, but does not go to zero. When I run the same process, it goes up again, but never exceeds the previous memory usage. So this suggests there is no memory leak.

I figured this could perhaps have something to do with the python garbage collection process, so I went one step further and ran the following script:

from distributed import Client client = Client("IP:PORT") # put IP and PORT of sched here def cleanup(): import gc gc.collect()

client.submit(cleanup)

This brought down the memory back to zero. [image: Memory profiles] https://camo.githubusercontent.com/134dc070abcca613c6e047fa42865f06b8bf2d75/68747470733a2f2f696d6775722e636f6d2f612f5a64525275

My feeling is that the python garbage collector can sometimes be slightly more aggressive with memory. For long running applications like distributed, I think it could be a good idea to force the garbage collection process every once in a while.

What do you think? Am I correct in my guess, and would there be a way to resolve this on the distributed side? The other obvious solution is for the user to run a cron script sending gc messages to the cluster. However, this is not so clean (and for large intermittent loads may run at very irregular times).

I sort of looked around to see if this was mentioned before, and didn't see anything. I apologize if this is a repost. Thanks!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/dask/distributed/issues/1516, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszDQt6k65E5XimUR0CJrGLmXsrqNkks5sx6BBgaJpZM4QNg-j .

jrmlhermitte commented 6 years ago

The version is '1.19.3+17.g74cebfb' I can pull the latest and test this again right now.

jrmlhermitte commented 6 years ago

I pulled from master (made sure to delete the pip installed distributed. It was a pip install directly from github about a week ago I believe anyway). (Also send a print(distributed.__file__) just to be sure the correct file was submitted).

I see the same result.

In passing, is there a way to see the version of distributed used on the bokeh server? That could be a nice feature.

mrocklin commented 6 years ago

I recommend checking recent pull requests for the term GC. You'll find a few in the last few weeks.

This may interest @ogrisel and @bluenote10 . Any desire to add an infrequent periodic self._throttled_gc call in Worker.memory_monitor?

mrocklin commented 6 years ago

In passing, is there a way to see the version of distributed used on the bokeh server? That could be a nice feature.

If you're interested this could be added easily to the new HTML routes available in the info tab. The templates here are in distributed/bokeh/templates/. I recommend changing workers.html to index.html and including more information, alongside the workers table already there.