dask / scipy-tutorials-2018

5 stars 1 forks source link

Scikit-Learn tutorial #5

Open TomAugspurger opened 6 years ago

TomAugspurger commented 6 years ago

Splitting from https://github.com/dask/scipy-tutorials-2018/issues/3

cc @amueller

think the most interesting application might be parallelizing on a single machine, I'm not sure how well that works with your setup?

I think the distributed joblib is the best value in terms of additional teaching time / usefulness. Teaching-wise it's "We've seen n_jobs=-1 mean all the cores on a single machine. With this context manager, now n_jobs=-1 now means all the core on a cluster!"

What's the status on the broad-casting of data for doing a random forest in parallel?

RandomForest (and some others) hardcode the joblib backend to use threading. After the next joblib release, I plan to open issues on scikit-learn to

  1. Change the hardcoded backends to use the new preferred / requires API
  2. Document the default joblib backend used
  3. Write a glossary page on joblib backends
  4. Link to a distributed joblib example.

A good example for now might be a large grid search. Something like a bigger version of the first example in https://mybinder.org/v2/gh/dask/dask-examples/master?filepath=machine-learning.ipynb

To start conversation we might consider the following questions: 1. What sorts of applications might make for a good parallel computing exercise within this domain? 2. If these applications are data-intensive then are there publicly available datasets that are inconveniently large? 3. What software requirements are likely to be necessary to run these computations? 4. What challenges should we expect when parallelizing these algorithms or accessing this data at scale? Of course feel free to disregard these questions and engage in more direct conversation.
amueller commented 6 years ago

I was thinking more about searching over pipelines with dasksearchcv because that's a plug-in replacement that is just more efficient. But we could also think about large grid-searches over a cluster. That's definitely also useful. I'm just a bit worried about time. We generally have too much material already, but I haven't had time to actually go through the material again for this year (I have some more talks / tutorials before this one)

amueller commented 6 years ago

just mentioning that as a joblib backend is definitely possible, though.

TomAugspurger commented 6 years ago

Yes, time is unfortunately tight, especially in introductory tutorials :/

On Sun, Apr 22, 2018 at 9:07 AM, Andreas Mueller notifications@github.com wrote:

just mentioning that as a joblib backend is definitely possible, though.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/dask/scipy-tutorials-2018/issues/5#issuecomment-383384115, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIgw3vWm-NBaZUb1HfRgKUXu7JHkzks5trI6zgaJpZM4Tdtd2 .