dask / dask-ml

Scalable Machine Learning with Dask
http://ml.dask.org
BSD 3-Clause "New" or "Revised" License
892 stars 255 forks source link

Hyperparameter optimization on LTM datasets with Dask linear models #555

Open Vslira opened 4 years ago

Vslira commented 4 years ago

Hello all,

I've been trying to combine dask-ml's tools in the most vanilla way that I can think of and they don't seem to fit together (no pun intended).

Specifically, I want to both train models and optimize hyperparams on larger-than-memory datasets.

Initially I supposed I could just stick a linear model in a cv class. From the documentation, however, it seems that (all classes from dask_ml.model_selection) GridSearchCV and RandomizedSearchCV both require that the CV splits fit in memory, while IncrementalSearchCV, HyperbandSearchCV and SuccessiveHalvingSearchCV require that the estimator implements partial_fit. Since none of the linear models in the dask-ml API support partial_fit, I'm left wondering if there's a way to use pure dask-ml for a ML workflow.

Something like:

import numpy as np
from dask.distributed import Client, LocalCluster
from dask import array as da, dataframe as ddf

from dask_ml.model_selection import RandomizedSearchCV, train_test_split
from dask_ml.linear_model import LogisticRegression
from dask_ml.datasets import make_classification

cluster = LocalCluster( # Ignore specific values, just an example
    n_workers=4,
    threads_per_worker=2,
    memory_limit="1024MB",
    dashboard_address="0.0.0.0:1234",
)

client = Client(cluster)

X, y = make_classification(
    n_samples=1_000_000,
    n_features=12,
    n_informative=3,
    n_redundant=1,
    n_classes=2,
    chunks=5000,
)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

L = (10 ** np.linspace(-5, 2, num=10)).tolist()

rscv = RandomizedSearchCV(
    LogisticRegression(),
    param_distributions={"C": L},
    n_iter=20,
    scheduler=client,
    cache_cv=False,
)

rscv.fit(X_train, y_train)

rscv.score(X_test, y_test)

(In case anyone's wondering, the script above gives me an out of memory error and starts killing all the workers and I can't find out why)

Thanks for any help! Cheers

TomAugspurger commented 4 years ago

I think your expectation that dask_ml.model_selection.GridSearchCV should work well with a model accepting a dask Array is reasonable. I vaguely recall looking into this a while back but I don't remember the outcome.

If you're interested in investigating, it'd be nice to know which operations are causing the workers to error. Otherwise I'll be able to look into it in a week or two.

I also think that Dask-ML's linear models should implement partial_fit, which would let them work well with e.g. Hyperband.

stsievert commented 4 years ago

I also think that Dask-ML's linear models should implement partial_fit

:+1: I'm surprised they don't! I don't see an implementation in dask_glm/estimators.py or linear_model/glm.py.

TomAugspurger commented 4 years ago

It seems pretty easy to add, right? (they're being developed in linear_model/glm.py now).

Vslira commented 4 years ago

Thanks for the answers! I'm not sure I'll have the time this week, too, but if I do I'll try to get a more useful report about why GridSearchCV is failing.

stsievert commented 4 years ago

It seems pretty easy to add, right?

Yeah, I'd imagine because most optimizations are iterative. In practice, I'd imagine this would boil down to an implementation of warm_start or sending an initial estimate (or beta vector) to dask_glm/algorithms.py (e.g., gradient_descent(..., initial_model=previous_beta)).