dask / dask-tutorial

Dask tutorial
https://tutorial.dask.org
BSD 3-Clause "New" or "Revised" License
1.83k stars 702 forks source link

Fix Remove processes=False for Client #187

Closed JimCircadian closed 4 years ago

JimCircadian commented 4 years ago

Fixes issue with running locally using the large weather dataset, process=False results in a multithreaded implementation according to the LocalCluster docs. This creates a lot of memory warning due to the overhead on the single process memory limit, whereas removing this results in a nice chew through the dataset with no warnings...

Hope this suggestion helps! I suspect this would have solved a lot of chatter about speed/warnings in the live tutorial :-)

review-notebook-app[bot] commented 4 years ago

Check out this pull request on  ReviewNB

Review Jupyter notebook visual diffs & provide feedback on notebooks.


Powered by ReviewNB

mrocklin commented 4 years ago

Thank you for raising this @JimCircadian .

Hrm, I'm surprised that switching to processes would reduce warnings about memory. Is this easy to reproduce on the binder deployment, or were you running this locally?

JimCircadian commented 4 years ago

Hi @mrocklin

I was running this locally. Indeed I found that a little bit odd as well, but I suspected there was a maximum process memory limit within a single process under Jupyter (not that I've ever heard of that before.) If you like I'm happy to delve into it and have a proper. Also seeking to contribute and join the dask sprint on the weekend (SciPy2020) so maybe can discuss if there's additional feedback / changes to consider?

Was happily tinkering with xarray and dask during tutorials today, so it's only stoking my enthusiasm!!! :+1:

JimCircadian commented 4 years ago

Worked on some issues in the sprint today, will attend to giving more info on this tomorrow! :-)

JimCircadian commented 4 years ago

Alas that tomorrow ended up being really sunny and I got out on the bike after a week of work and SciPY combined. I'm going to have another little burst through the issues I'm looking at this weekend ;-)

JimCircadian commented 4 years ago

processes_true processes_false

I've attached, with a recreated environment, the differences between processes=false and processes=true. I assume, because the implementation for workers will be assigned to the jupyter kernel process as threads, this might explain the difference, as opposed to independent processes without inherited memory limits from the jupyter kernel process.

Anyway, happy to look into this a bit more if you want. I've made a small amendment as the tutorial README to offer jupyter lab as well, since it's installed anyway...

martindurant commented 4 years ago

@mrocklin , did you have a chance to think about why the memory warnings might be happening? Could it really indicate leaks? I can see why, f that were the case, spreading out the tasks among multiple processes might also spread out the leak and halt the warnings.

mrocklin commented 4 years ago

@mrocklin , did you have a chance to think about why the memory warnings might be happening? Could it really indicate leaks?

No, I haven't given this any thought

JimCircadian commented 4 years ago

Hi @martindurant @mrocklin

I'll do some profiling on the processes and determine why this is. Like I mentioned I think it's more likely related to the jupyter kernel being created rather than a memory leak. Sounds like it needs some investigation and I'm happy to do that. :smile:

The main reason I picked up on it was that it has a detrimental effect on the processing speed when people are using it on their laptops in a tutorial, it wasn't just the warnings and it caused the chat to go mental at SciPy 2020, so it's worth solving I reckon. :wink:

martindurant commented 4 years ago

Those sound like reasonable points, and I don't see any counterargument for this change (although we can't remember why it was written like this). So I will merge this, but it would be good to find out the root cause of the warnings and slowness.