Closed jcrist closed 7 years ago
Names of recent Dask micro-projects have included the following:
Each is importable as dask_foo
. This is not particularly concise, but is predictable in a nice way.
This scheme would suggest dask-sklearn
but obviously that goes against the objective above.
Model selection? I prefer gridsearch over the other two because I think it is understood by the broadest audience.
I'd be fine with dask-gridsearch
. I have a slight preference for dask-crossval
, just because it's for doing cross-validation, and RandomizedSearchCV
isn't really a grid search (and the "CV
" stands for "cross-validation"). A counter argument would be that crossval
is an abbreviation, which is possibly unclear. Idk, no strong opinions - I just don't want "dask-learn".
I'm ok with dask-crossval or dask-gridsearch but they are both not ideal. We recently moved all this stuff into a new module called "model_selection", maybe dask-model-selection would work? But it's very long.
The problem with shorter names is that they don't make the connection to machine learning clear. Dask is so general that something like "dask-search" would be much too general for someone to think it is related to sklearn.
As a lurker without any contributions. Why is the vision of the dask-learn library limiting itself to just cross validation? I would think ensemble methods and pipelines would be a great target for dask to contribute as well.
That's a fair point. I would be ok if other dask accelerated implementations of scikit-learn classes were added here. I see this as swapping out joblib for dask, and benefiting from the increased flexibility, with a focus on in-memory data only. What I think is definitely out of scope is anything that implements data-parallel algorithms (e.g. the work in dask-glm).
I'm not sure what other scikit-learn classes would be useful to implement though. GridSearchCV
/RandomizedSearchCV
were the first requested features, and are something I think we can do well with dask. As you mentioned, ensemble methods may be useful to implement (VotingClassifier
in particular looks like it be quick to add support for).
I'm also unsure if there's a need for this parallelism in contexts outside of parameter searches. Things that are embarrassingly parallel can already benefit from joblib, I wouldn't expect much of a speedup from dask here. Using the distributed joblib backend you can already run these on a cluster, so the only benefit of reimplementing in dask-learn would be the option to use remote data (meaning data that's already on a cluster somewhere). So while adding support to do a gridsearch over metaestimators like VotingClassifier
using dask may be useful (and would be fairly quick to do), I'm unsure if reimplementing VotingClassifier
using dask would provide a similar benefit. There is a cost to maintaining copies of scikit-learn stuff that makes me wary of reimplementing everything here.
I think a minimal scope is good. If the scope is "things we can provide speedups on over joblib on data that fits in memory, while matching the scikit-learn api" then perhaps we'd want to stick with dask-learn
/dask-sklearn
/dask-scikit-learn
. However, if in practice that really means just grid-search/cross-validation, then I'd push for renaming to indicate the smaller scope.
I agree that small scope is good. Maybe the only things that can be sped up without just using distributed is what's implemented here.
Other name idea: Dask-ml-pipes Dask-ml-pipes While it not only implements pipelines, this is still mostly helpful with pipelines, and it's definitely part of the plumbing department.
Sent from phone. Please excuse spelling and brevity.
On Feb 24, 2017 1:37 PM, "Jim Crist" notifications@github.com wrote:
That's a fair point. I would be ok if other dask accelerated implementations of scikit-learn classes were added here. I see this as swapping out joblib for dask, and benefiting from the increased flexibility, with a focus on in-memory data only. What I think is definitely out of scope is anything that implements data-parallel algorithms (e.g. the work in dask-glm).
I'm not sure what other scikit-learn classes would be useful to implement though. GridSearchCV/RandomizedSearchCV were the first requested features, and are something I think we can do well with dask. As you mentioned, ensemble methods may be useful to implement (VotingClassifier in particular looks like it be quick to add support for).
I'm also unsure if there's a need for this parallelism in contexts outside of parameter searches. Things that are embarrassingly parallel can already benefit from joblib, I wouldn't expect much of a speedup from dask here. Using the distributed joblib backend http://distributed.readthedocs.io/en/latest/joblib.html you can already run these on a cluster, so the only benefit of reimplementing in dask-learn would be the option to use remote data (meaning data that's already on a cluster somewhere). So while adding support to do a gridsearch over metaestimators like VotingClassifier using dask may be useful (and would be fairly quick to do), I'm unsure if reimplementing VotingClassifier using dask would provide a similar benefit. There is a cost to maintaining copies of scikit-learn stuff that makes me wary of reimplementing everything here.
I think a minimal scope is good. If the scope is "things we can provide speedups on over joblib on data that fits in memory, while matching the scikit-learn api" then perhaps we'd want to stick with dask-learn/ dask-sklearn/dask-scikit-learn. However, if the scope is limited to just gridsearch/crossvalidation, then I'd push for renaming to indicate the smaller scope.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-learn/issues/27#issuecomment-282368878, or mute the thread https://github.com/notifications/unsubscribe-auth/AAbcFi9cW8Ej8xf0Hn_YzK4CrqRV1nYeks5rfyNngaJpZM4MKpq5 .
For what it's worth, almost no Dask work requires dask.distributed. For example the dask-glm library can run and be meaningfully on just a thread pool without a distributed cluster.
@amueller: I think the decision here hinges on whether there's a use case for implementing other classes to use dask internally besides GridSearchCV
and RandomizedSearchCV
. Note that this is separate from classes using the distributed joblib backend internally.
With the current code it's easy to add support for other meta-estimators to be fit in parallel when doing a *SearchCV
fit (as we do for Pipeline
and FeatureUnion
). Adding support for dask capable versions of these classes on their own would require some rework. This was not a use case that I had in mind when writing this.
If there is a potential use case for those (e.g. a stand-alone dask accelerated version of VotingClassifier
or RandomForestClassifier
) then I'd keep the name general - probably dask-sklearn
or dask-learn
. If they're not, then I'd switch to a less general name.
I'd like to keep the number of duplicate classes here small, as there is a maintenance cost (and code duplication cost) for each one. Given no direction, I'd probably rename to dask-gridsearch
and leave it at that. If needs change, a new dask-learn
package could be created, this merged into it in a more general way, and this package deprecated. I think there is a benefit to being conservative here.
I have no strong opinions and no objection to dask-gridsearch
maybe dask-model-selection
would be slightly better?
@jnothman or @gvaroquaux or @agramfort might have better names?
No strong opinion. dask-gridsearch
or dask-searchcv
or dask-model-selection
will all work.
I'm somewhat against dask-model-selection because
On Mon, Feb 27, 2017 at 9:50 PM, Joel Nothman notifications@github.com wrote:
No strong opinion. dask-gridsearch or dask-searchcv or dask-model-selection will all work.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-learn/issues/27#issuecomment-282926176, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszH3kTVgvapj_yhxSXauKKO52D1_6ks5rg4t7gaJpZM4MKpq5 .
is the same not true for grid search?
I think that lay-people associate grid-search more strongly with machine learning. I don't think that the same is true for the phrase "model selection".
This reasoning is purely anecdotal though. I'm working off of a sample size of one.
On Mon, Feb 27, 2017 at 10:10 PM, Andreas Mueller notifications@github.com wrote:
is the same not true for grid search?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/dask/dask-learn/issues/27#issuecomment-282929105, or mute the thread https://github.com/notifications/unsubscribe-auth/AASszIMfNbK3_K4JI3vHYyemplqP839Lks5rg5A3gaJpZM4MKpq5 .
+1 for dask-crossval
.
To reveal my bias, dask-gridsearch
evokes a more generic tool that can work on any function you want to begin minimizing or exploring in a brute-force (but scalable) manner. Optimization (brute force or otherwise) has been and remains the bread and butter of many analysts/engineers/statisticians/scientists who wouldn't have called their work machine learning 10 years ago.
is there any vision to go beyond cross val? if so as we have:
dask-xgboost dask-tensorflow
if I would go for:
dask-sklearn
if not then dask-crossval is fine with me
This library originally came out of experiments I did last summer trying various ways to make dask and scikit-learn play well together. Some things were nice (and useful), others were less so.
Recently, in an effort to clean things up, I've removed everything except the
GridSearchCV
andRandomizedSearchCV
functionality. These implementations have been improved, and are now (almost) 100% compatible with their scikit-learn counterparts. There are a few unsupported parameters (e.g.verbose
), and the output doesn't include the timings, but other than that these should be full drop-ins.I like this limited scope, and would be slightly against expanding the scope of this library to include other things. Other machine-learning functionality should live in other libraries IMO (e.g.
dask-glm
). That said, the name "dask-learn
" implies a larger scope than I think we can/should provide here. I'd like to rename this library to reflect the limited scope (just hyper-parameter searching).A few ideas:
dask-cv
dask-crossval
dask-gridsearch
Import names could be one word, (e.g.
daskcv
) or use an underscore (e.g.dask_crossval
).Naming things is hard. Ping @mrocklin, @amueller for thoughts/other name ideas.