dask / dask-searchcv

dask-searchcv is now part of dask-ml: https://github.com/dask/dask-ml
BSD 3-Clause "New" or "Revised" License
240 stars 43 forks source link

Rename this library? #27

Closed jcrist closed 7 years ago

jcrist commented 7 years ago

This library originally came out of experiments I did last summer trying various ways to make dask and scikit-learn play well together. Some things were nice (and useful), others were less so.

Recently, in an effort to clean things up, I've removed everything except the GridSearchCV and RandomizedSearchCV functionality. These implementations have been improved, and are now (almost) 100% compatible with their scikit-learn counterparts. There are a few unsupported parameters (e.g. verbose), and the output doesn't include the timings, but other than that these should be full drop-ins.

I like this limited scope, and would be slightly against expanding the scope of this library to include other things. Other machine-learning functionality should live in other libraries IMO (e.g. dask-glm). That said, the name "dask-learn" implies a larger scope than I think we can/should provide here. I'd like to rename this library to reflect the limited scope (just hyper-parameter searching).

A few ideas:

Import names could be one word, (e.g. daskcv) or use an underscore (e.g. dask_crossval).

Naming things is hard. Ping @mrocklin, @amueller for thoughts/other name ideas.

mrocklin commented 7 years ago

Names of recent Dask micro-projects have included the following:

Each is importable as dask_foo. This is not particularly concise, but is predictable in a nice way.

This scheme would suggest dask-sklearn but obviously that goes against the objective above.

Model selection? I prefer gridsearch over the other two because I think it is understood by the broadest audience.

jcrist commented 7 years ago

I'd be fine with dask-gridsearch. I have a slight preference for dask-crossval, just because it's for doing cross-validation, and RandomizedSearchCV isn't really a grid search (and the "CV" stands for "cross-validation"). A counter argument would be that crossval is an abbreviation, which is possibly unclear. Idk, no strong opinions - I just don't want "dask-learn".

amueller commented 7 years ago

I'm ok with dask-crossval or dask-gridsearch but they are both not ideal. We recently moved all this stuff into a new module called "model_selection", maybe dask-model-selection would work? But it's very long.

The problem with shorter names is that they don't make the connection to machine learning clear. Dask is so general that something like "dask-search" would be much too general for someone to think it is related to sklearn.

aterrel commented 7 years ago

As a lurker without any contributions. Why is the vision of the dask-learn library limiting itself to just cross validation? I would think ensemble methods and pipelines would be a great target for dask to contribute as well.

jcrist commented 7 years ago

That's a fair point. I would be ok if other dask accelerated implementations of scikit-learn classes were added here. I see this as swapping out joblib for dask, and benefiting from the increased flexibility, with a focus on in-memory data only. What I think is definitely out of scope is anything that implements data-parallel algorithms (e.g. the work in dask-glm).

I'm not sure what other scikit-learn classes would be useful to implement though. GridSearchCV/RandomizedSearchCV were the first requested features, and are something I think we can do well with dask. As you mentioned, ensemble methods may be useful to implement (VotingClassifier in particular looks like it be quick to add support for).

I'm also unsure if there's a need for this parallelism in contexts outside of parameter searches. Things that are embarrassingly parallel can already benefit from joblib, I wouldn't expect much of a speedup from dask here. Using the distributed joblib backend you can already run these on a cluster, so the only benefit of reimplementing in dask-learn would be the option to use remote data (meaning data that's already on a cluster somewhere). So while adding support to do a gridsearch over metaestimators like VotingClassifier using dask may be useful (and would be fairly quick to do), I'm unsure if reimplementing VotingClassifier using dask would provide a similar benefit. There is a cost to maintaining copies of scikit-learn stuff that makes me wary of reimplementing everything here.

I think a minimal scope is good. If the scope is "things we can provide speedups on over joblib on data that fits in memory, while matching the scikit-learn api" then perhaps we'd want to stick with dask-learn/dask-sklearn/dask-scikit-learn. However, if in practice that really means just grid-search/cross-validation, then I'd push for renaming to indicate the smaller scope.

amueller commented 7 years ago

I agree that small scope is good. Maybe the only things that can be sped up without just using distributed is what's implemented here.

Other name idea: Dask-ml-pipes Dask-ml-pipes While it not only implements pipelines, this is still mostly helpful with pipelines, and it's definitely part of the plumbing department.

Sent from phone. Please excuse spelling and brevity.

On Feb 24, 2017 1:37 PM, "Jim Crist" notifications@github.com wrote:

That's a fair point. I would be ok if other dask accelerated implementations of scikit-learn classes were added here. I see this as swapping out joblib for dask, and benefiting from the increased flexibility, with a focus on in-memory data only. What I think is definitely out of scope is anything that implements data-parallel algorithms (e.g. the work in dask-glm).

I'm not sure what other scikit-learn classes would be useful to implement though. GridSearchCV/RandomizedSearchCV were the first requested features, and are something I think we can do well with dask. As you mentioned, ensemble methods may be useful to implement (VotingClassifier in particular looks like it be quick to add support for).

I'm also unsure if there's a need for this parallelism in contexts outside of parameter searches. Things that are embarrassingly parallel can already benefit from joblib, I wouldn't expect much of a speedup from dask here. Using the distributed joblib backend http://distributed.readthedocs.io/en/latest/joblib.html you can already run these on a cluster, so the only benefit of reimplementing in dask-learn would be the option to use remote data (meaning data that's already on a cluster somewhere). So while adding support to do a gridsearch over metaestimators like VotingClassifier using dask may be useful (and would be fairly quick to do), I'm unsure if reimplementing VotingClassifier using dask would provide a similar benefit. There is a cost to maintaining copies of scikit-learn stuff that makes me wary of reimplementing everything here.

I think a minimal scope is good. If the scope is "things we can provide speedups on over joblib on data that fits in memory, while matching the scikit-learn api" then perhaps we'd want to stick with dask-learn/ dask-sklearn/dask-scikit-learn. However, if the scope is limited to just gridsearch/crossvalidation, then I'd push for renaming to indicate the smaller scope.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-learn/issues/27#issuecomment-282368878, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFi9cW8Ej8xf0Hn_YzK4CrqRV1nYeks5rfyNngaJpZM4MKpq5 .

mrocklin commented 7 years ago

For what it's worth, almost no Dask work requires dask.distributed. For example the dask-glm library can run and be meaningfully on just a thread pool without a distributed cluster.

jcrist commented 7 years ago

@amueller: I think the decision here hinges on whether there's a use case for implementing other classes to use dask internally besides GridSearchCV and RandomizedSearchCV. Note that this is separate from classes using the distributed joblib backend internally.

With the current code it's easy to add support for other meta-estimators to be fit in parallel when doing a *SearchCV fit (as we do for Pipeline and FeatureUnion). Adding support for dask capable versions of these classes on their own would require some rework. This was not a use case that I had in mind when writing this.

If there is a potential use case for those (e.g. a stand-alone dask accelerated version of VotingClassifier or RandomForestClassifier) then I'd keep the name general - probably dask-sklearn or dask-learn. If they're not, then I'd switch to a less general name.

I'd like to keep the number of duplicate classes here small, as there is a maintenance cost (and code duplication cost) for each one. Given no direction, I'd probably rename to dask-gridsearch and leave it at that. If needs change, a new dask-learn package could be created, this merged into it in a more general way, and this package deprecated. I think there is a benefit to being conservative here.

amueller commented 7 years ago

I have no strong opinions and no objection to dask-gridsearch maybe dask-model-selection would be slightly better? @jnothman or @gvaroquaux or @agramfort might have better names?

jnothman commented 7 years ago

No strong opinion. dask-gridsearch or dask-searchcv or dask-model-selection will all work.

mrocklin commented 7 years ago

I'm somewhat against dask-model-selection because

  1. It's long
  2. The term "model selection" is vague outside of machine learning. sklearn.model_selection has pretty clear intent because we're already within sklearn. Because the term dask is more generic it's not obvious to an outsider that dask-model-selection is about machine learning.

On Mon, Feb 27, 2017 at 9:50 PM, Joel Nothman notifications@github.com wrote:

No strong opinion. dask-gridsearch or dask-searchcv or dask-model-selection will all work.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-learn/issues/27#issuecomment-282926176, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszH3kTVgvapj_yhxSXauKKO52D1_6ks5rg4t7gaJpZM4MKpq5 .

amueller commented 7 years ago

is the same not true for grid search?

mrocklin commented 7 years ago

I think that lay-people associate grid-search more strongly with machine learning. I don't think that the same is true for the phrase "model selection".

This reasoning is purely anecdotal though. I'm working off of a sample size of one.

On Mon, Feb 27, 2017 at 10:10 PM, Andreas Mueller notifications@github.com wrote:

is the same not true for grid search?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-learn/issues/27#issuecomment-282929105, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszIMfNbK3_K4JI3vHYyemplqP839Lks5rg5A3gaJpZM4MKpq5 .

eriknw commented 7 years ago

+1 for dask-crossval.

To reveal my bias, dask-gridsearch evokes a more generic tool that can work on any function you want to begin minimizing or exploring in a brute-force (but scalable) manner. Optimization (brute force or otherwise) has been and remains the bread and butter of many analysts/engineers/statisticians/scientists who wouldn't have called their work machine learning 10 years ago.

agramfort commented 7 years ago

is there any vision to go beyond cross val? if so as we have:

dask-xgboost dask-tensorflow

if I would go for:

dask-sklearn

if not then dask-crossval is fine with me