Evaluate dask-searchcv to speed up GridSearchCV

dhimmel commented 7 years ago

I'm excited about trying out dask-searchcv as a drop-in replacement for GridSearchCV. For info on dask-searchcv see, the blog post, github, docs, and video.

I'm hoping using dask-searchcv for GridSearchCV will help solve the following problems:

High memory usage, e.g. https://github.com/cognoma/machine-learning/issues/70, caused by joblib overhead.
The slow performance of the pipeline when properly implementing cross-validation. See discussion at https://github.com/scikit-learn/scikit-learn/issues/7536#issuecomment-250841108. The builtin GridSearchCV is repeating the same transform steps making it brutally slow.

I initially mentioned dask-searchcv in https://github.com/cognoma/machine-learning/pull/93#issuecomment-302851217, a PR by @patrick-miller. I thought this would be a good issue for @rdvelazquez to work on. @rdvelazquez are you interested?

We'll have to add some additional dependencies to our environment. It may be a good time to also update the package versions of existing packages (especially pandas).

rdvelazquez commented 7 years ago

Yea, I'm definitely interested! I'll have a little time to look into this before next Tuesday's meet-up so I'll come ready with some questions and suggestions to discuss.

@patrick-miller : Did you look into dask-searchcv yet? I don't want to duplicate work.

patrick-miller commented 7 years ago

I only read the docs, so go for it!

It isn't directly relevant to dash-searchcv, but one thing we need to keep in mind when implementing PCA on our features is that we want to only run PCA on the expression features and not the covariates. I have created a new issue for just this #96.

rdvelazquez commented 7 years ago

I'm still looking into this and hope to post a WIP pull request early next week so you can see what I'm working on. So far dask-searchcv itself is very easy to implement but it does not seem to negate all the issues associated with running PCA in the pipeline... I'll have more soon.

rdvelazquez commented 7 years ago

Here's my WIP notebook: dask-searchCV

@dhimmel and @patrick-miller, some questions on where to go from here:

What should I include in the pull request? I'm thinking: Update environment.yml to include dask-searchcv, and replace SciKit-Learn's grid search with dask-searchcv in some of the notebooks that we want to keep up to date moving forward. (which notebooks should this be? and should I wait until #93 is merged?)
How important is speed? As you'll see in the WIP notebook, searching over a range of n_components can take a while and will only take longer if we increase the number of CV splits and/or increase the range of alpha that we search across. If speed is an important issue we could consider pre-processing (scale, pca, scale) the whole training data-set; this would speed things up while still keeping training and testing data isolated... our cross validation just wouldn't be totally accurate.
Any advice on selecting the range of n_components to search across? I'm tagging @htcai because he may have already looked at this. It seems like unbalanced genes/queries (which in our case will mostly be gene's with few mutations) will perform better with fewer components (in the 30 - 100 range) where as balanced genes/queries (equal number of mutated and non-mutated samples) will perform better with more components. This question may be better addressed as a separate issue/PR.

Thanks in advance for the input!

dhimmel commented 7 years ago

What should I include in the pull request?

I'd open an exploration PR. In other words, a PR with the notebook you link to above. Then after #93 by @patrick-miller is merged, you can update the main notebooks.

Any advice on selecting the range of n_components to search across?

For efficiency reasons, I'm hoping we can have presets for n_components based on the number of positives of negatives (whichever is smaller). This way we can avoid the computational burden of having to grid search across a range of component numbers. This will only work if the optimal number of n_components is consistent across classifiers with similar number of positives. Let's save this research for another PR. I believe @htcai may also be working on it.

If speed is an important issue we could consider pre-processing (scale, pca, scale) the whole training data-set; this would speed things up while still keeping training and testing data isolated... our cross validation just wouldn't be totally accurate.

Let's keep this in mind and decide later. But your reasoning is correct. In my experience here, the CV performance will not be majorly inflated due to this shortcut.

rdvelazquez commented 7 years ago

Thanks @dhimmel I'll open an exploration PR now.

For efficiency reasons, I'm hoping we can have presets for n_components based on the number of positives of negatives (whichever is smaller).

I looked into this a bit and didn't find a clear correlation between number of positives and ideal n_components... this will be easier to evaluate now that we can include PCA in the pipeline. @htcai let me know if this is something you are working on... if not I can look into it.

dhimmel commented 7 years ago

This is getting off topic, but on the topic of:

I looked into this a bit and didn't find a clear correlation between number of positives and ideal n_components... this will be easier to evaluate now that we can include PCA in the pipeline.

@rdvelazquez when working with small n_positives, you'll likely need to switch the CV assessment to use repeated cross validation or a large number of StratifiedShuffleSplits. See discussion on https://github.com/cognoma/machine-learning/pull/71 by @htcai. We ended up going with:

sss = StratifiedShuffleSplit(n_splits=100, test_size=0.1, random_state=0)

Let's open a new issue if we want to continue this discussion.

cognoma / machine-learning

Evaluate dask-searchcv to speed up GridSearchCV #94