Closed dhimmel closed 7 years ago
Yea, I'm definitely interested! I'll have a little time to look into this before next Tuesday's meet-up so I'll come ready with some questions and suggestions to discuss.
@patrick-miller : Did you look into dask-searchcv yet? I don't want to duplicate work.
I only read the docs, so go for it!
It isn't directly relevant to dash-searchcv, but one thing we need to keep in mind when implementing PCA on our features is that we want to only run PCA on the expression features and not the covariates. I have created a new issue for just this #96.
I'm still looking into this and hope to post a WIP pull request early next week so you can see what I'm working on. So far dask-searchcv itself is very easy to implement but it does not seem to negate all the issues associated with running PCA in the pipeline... I'll have more soon.
Here's my WIP notebook: dask-searchCV
@dhimmel and @patrick-miller, some questions on where to go from here:
Thanks in advance for the input!
What should I include in the pull request?
I'd open an exploration PR. In other words, a PR with the notebook you link to above. Then after #93 by @patrick-miller is merged, you can update the main notebooks.
Any advice on selecting the range of n_components to search across?
For efficiency reasons, I'm hoping we can have presets for n_components
based on the number of positives of negatives (whichever is smaller). This way we can avoid the computational burden of having to grid search across a range of component numbers. This will only work if the optimal number of n_components
is consistent across classifiers with similar number of positives. Let's save this research for another PR. I believe @htcai may also be working on it.
If speed is an important issue we could consider pre-processing (scale, pca, scale) the whole training data-set; this would speed things up while still keeping training and testing data isolated... our cross validation just wouldn't be totally accurate.
Let's keep this in mind and decide later. But your reasoning is correct. In my experience here, the CV performance will not be majorly inflated due to this shortcut.
Thanks @dhimmel I'll open an exploration PR now.
For efficiency reasons, I'm hoping we can have presets for n_components based on the number of positives of negatives (whichever is smaller).
I looked into this a bit and didn't find a clear correlation between number of positives and ideal n_components... this will be easier to evaluate now that we can include PCA in the pipeline. @htcai let me know if this is something you are working on... if not I can look into it.
This is getting off topic, but on the topic of:
I looked into this a bit and didn't find a clear correlation between number of positives and ideal n_components... this will be easier to evaluate now that we can include PCA in the pipeline.
@rdvelazquez when working with small n_positives, you'll likely need to switch the CV assessment to use repeated cross validation or a large number of StratifiedShuffleSplit
s. See discussion on https://github.com/cognoma/machine-learning/pull/71 by @htcai. We ended up going with:
sss = StratifiedShuffleSplit(n_splits=100, test_size=0.1, random_state=0)
Let's open a new issue if we want to continue this discussion.
I'm excited about trying out dask-searchcv as a drop-in replacement for
GridSearchCV
. For info on dask-searchcv see, the blog post, github, docs, and video.I'm hoping using dask-searchcv for GridSearchCV will help solve the following problems:
I initially mentioned dask-searchcv in https://github.com/cognoma/machine-learning/pull/93#issuecomment-302851217, a PR by @patrick-miller. I thought this would be a good issue for @rdvelazquez to work on. @rdvelazquez are you interested?
We'll have to add some additional dependencies to our environment. It may be a good time to also update the package versions of existing packages (especially pandas).