Open Vslira opened 4 years ago
I think your expectation that dask_ml.model_selection.GridSearchCV
should work well with a model accepting a dask Array is reasonable. I vaguely recall looking into this a while back but I don't remember the outcome.
If you're interested in investigating, it'd be nice to know which operations are causing the workers to error. Otherwise I'll be able to look into it in a week or two.
I also think that Dask-ML's linear models should implement partial_fit
, which would let them work well with e.g. Hyperband.
I also think that Dask-ML's linear models should implement
partial_fit
:+1: I'm surprised they don't! I don't see an implementation in dask_glm/estimators.py or linear_model/glm.py.
It seems pretty easy to add, right? (they're being developed in linear_model/glm.py
now).
Thanks for the answers! I'm not sure I'll have the time this week, too, but if I do I'll try to get a more useful report about why GridSearchCV is failing.
It seems pretty easy to add, right?
Yeah, I'd imagine because most optimizations are iterative. In practice, I'd imagine this would boil down to an implementation of warm_start
or sending an initial estimate (or beta
vector) to dask_glm/algorithms.py (e.g., gradient_descent(..., initial_model=previous_beta)
).
Hello all,
I've been trying to combine dask-ml's tools in the most vanilla way that I can think of and they don't seem to fit together (no pun intended).
Specifically, I want to both train models and optimize hyperparams on larger-than-memory datasets.
Initially I supposed I could just stick a linear model in a cv class. From the documentation, however, it seems that (all classes from dask_ml.model_selection) GridSearchCV and RandomizedSearchCV both require that the CV splits fit in memory, while IncrementalSearchCV, HyperbandSearchCV and SuccessiveHalvingSearchCV require that the estimator implements partial_fit. Since none of the linear models in the dask-ml API support partial_fit, I'm left wondering if there's a way to use pure dask-ml for a ML workflow.
Something like:
(In case anyone's wondering, the script above gives me an out of memory error and starts killing all the workers and I can't find out why)
Thanks for any help! Cheers