TomAugspurger commented 4 years ago

It'd be nice to have some benchmarks for how our different hyperparameter optimizers perform. There are a few comparisons that would be useful

dask_ml's drop-in replacements for GridSearchCV & RandomizedSearchCV. We're able to deconstruct Pipeline objects to avoid redundant fit calls. This benchmark would compare a GridSearchCV(Pipeline(...)) for dask_ml.model_selection.GridSearchCV and sklearn.model_selection.GridSearchCV. We'd expect Dask-ML's to perform better the more CV splits there are and the more parameters that are explored early on in the pipeline (https://github.com/dask/dask-ml/issues/141 has some discussion).
Scaling of Dask's joblib backend for large problems. Internally, scikit-learn uses joblib for parallel for loops. With

with joblib.parallel_backend("dask"):
    ...

The items in the for loop are executed on the Dask Cluster. There are some issues with the backend (https://github.com/joblib/joblib/issues/1020, https://github.com/joblib/joblib/issues/1025). Fixing those aren't in scope for this work, but we'd like to have benchmarks to understand the current performance and measure the speedup from fixing those.

General performance on large datasets with Incremental, Hyperband, etc. We can't really compare to scikit-learn here, since it doesn't handle larger-than-memory datasets. @stsievert may have some thoughts / benchmarks to share here.

cc @dankerrigan. This is more than enough work I think. If you're able to make progress on any of these (or other things you think are important) it'd be great.

mrocklin commented 4 years ago

Is there a dataset or workflow on which it makes sense to perform this benchmark?

I'm more than happy to walk through performance profile information with anyone doing this work.

mrocklin commented 4 years ago

Also, as an output of this I'd love to see a blogpost

stsievert commented 4 years ago

It's good to see more work on Dask-ML's model selection! I have some ideas I like to see implemented, and would love to see benchmarks on modifications (e.g., to resolve #532).

Is there a dataset or workflow on which it makes sense to perform this benchmark?

I have a benchmark at https://github.com/stsievert/dask-hyperband-comparison. This benchmark focuses on heavy computation, not large datasets.

mrocklin commented 4 years ago

I'm also curious if @dankerrigan has applications from within his workplace that would be both relevant and open (my guess is that this is hard, but it's worth asking :) )

mmccarty commented 4 years ago

644 adds a classifier dataset generator that would be representative with something like,

import data_ml.datesets
df = dask_ml.datasets.make_classification_df(
        n_samples=1_000_000,
        n_features=1000,
        random_state=123,
        chunks=100,
        dates=(date(2019, 1, 1), date(2020, 1, 1)))

dankerrigan commented 4 years ago

@mrocklin I'll see what I can do!

stsievert commented 4 years ago

@dankerrigan I found clean separation of the model fitting and the searching to be useful. This allowed for quick iteration: I could make a small change than quickly see performance differences. The searches I ran simulated model fitting and recorded scores, and so they didn't require any data. This meant I could re-run the simulations without the same CPU or memory requirements.

My process looked something like this:

Run hyperparameter search with Dask on cluster.
Collect the history from each model, then save the history to disk. This history included the scores and the number of partial_fit calls for each score.
Replay the simulations locally with an implementation of ReplayModel
- ReplayModel would read in one model's scores/number of partial_fit calls and simulate computation by sleeping for a certain amount of time.

The implementation of ReplayModelin Simulate-Run.ipynb is pretty vanilla and reads in a model history. IIRC, I got the history from IncrementalSearchCV.model_history_.

The one exception is the amount of time to sleep partial_fit and score to simulate the required computation. I carefully choose values of 1 second and 1.5 seconds respectively. There are two facts behind these values:

The data provided to each function. I called score with a dataset 3× larger than the dataset provided to partial_fit.
A partial_fit call will takes about 1.5–3× as long as single score call with the same data. Good benchmarks on modern neural nets on GPUs are at https://github.com/soumith/convnet-benchmarks. (also see "On automatic differentiation" by Andreas Griewank for proof that flops(partial_fit) <= 5 flops(score)).

pierreglaser commented 4 years ago

FYI I'm working a lot on improving the joblib/dask integration these days. Among other things, I'm building a benchmark suite for the joblib using the dask backend for a variety of workloads and use-cases, including things like scikit-learn cross validations, GridSearch etc. So I'm very interested by this.

mrocklin commented 4 years ago

I'm very glad to hear that you're interested here. I'm also quite interested here to see how I can help. Would it make sense for a few of us to get together for a quick call? I would enjoy learning more about what you're up to.

On Wed, Apr 22, 2020 at 5:29 AM Pierre Glaser notifications@github.com wrote:

FYI I'm working a lot on improving the joblib/dask integration these days. Among other things, I'm building a benchmark suite for the joblib using the dask backend for a variety of workloads and use-cases, including things like scikit-learn cross validations, GridSearch etc. So I'm very interested by this.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-ml/issues/643#issuecomment-617749288, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACKZTG5M5V74ZBAXPSFR6LRN3PLNANCNFSM4MH3XZSA .

pierreglaser commented 4 years ago

Would it make sense for a few of us to get together for a quick call?

I think that's a great idea. Given the current circumstances, I'm pretty much available whenever during UTC daytime.

mmccarty commented 4 years ago

Yes, I would enjoy a call as well!

mrocklin commented 4 years ago

I think that I'm the western-most person who would be interested in this. My day starts around 2:30 UTC (7:30 US Pacific, 10:30 US Eastern). I suggest that if people are interested they click the Heart icon on this comment. I'll then send out an e-mail with some scheduling options.

mrocklin commented 4 years ago

Invitations sent. I focused on 7-10am US Pacific, 3-6 UTC

https://doodle.com/poll/6c3ityemymm8ncpr

JohnZed commented 4 years ago

I’ll aim to join too. Have been looking at a couple of different HPO approaches with a focus on GPU options. Thanks!

mrocklin commented 4 years ago

2020-04-28, 9am US Pacific, 4pm UTC (thank you Europeans for staying late)

@dankerrigan my apologies but I chose a time that excluded you. It was that or exclude others that were singletons from an organization. Hopefully @mmccarty can represent your views a bit.

@andremoeller my apologies but I don't have your e-mail. Regardless, the link below should have the relevant information for you to join.

https://docs.google.com/document/d/1USYpqW-pq5kfDoumoVC5gdXhIeGyrDLJ8dVNie5gXdc/edit?usp=sharing

dask / dask-ml

Hyperparameter optimization benchmarking #643

644 adds a classifier dataset generator that would be representative with something like,