Closed patrick-miller closed 7 years ago
One little "cheat" that we are performing here is running PCA on the entire expression matrix instead of just on the train partition. In a separate pull request, I think we should migrate the PCA to being performed inside the pipeline. However, that is not straightforward.
WIll do a full review later. Just wanted to note dask-searchcv
which should make the PCA in a GridSearchCV pipeline more efficient. http://jcrist.github.io/introducing-dask-searchcv.html
Ok, so the two big things left here are:
updating the plot (I think just using the Vega spec makes sense, I will write a wrapper to dump the metrics to JSON)
IIRC, you should be able to pass the dataframe directly using ipyvega:
This looks great. It will be good to have a more up-to-date notebook in the main directory, especially for new people.
Are we concerned about including the total number of mutations as a feature? Seems like we are using the thing we are trying to predict.
I removed the selected_cols.append('n_mutations_log1p')
line and ran the notebook with a few different random states to see how much the total mutations feature adds... the testing AUROC seems to drop about 0.5%-1% for the full dataset and about 2% for the covariates only.
I had the same question about including number of mutations at the last meetup. @dhimmel has a good explanation for why it is kept.
Are we concerned about including the total number of mutations as a feature? Seems like we are using the thing we are trying to predict.
Only in a very minor way. There are ~20000 genes where mutation is possible. We are fitting a classifier on only a single one of those genes. So if you wanted to entirely remove any confounding effect you could subject a single mutation from n_mutation
for samples that were positives. However, I suspect the effect will be trivial. See https://github.com/cognoma/machine-learning/issues/8 for why including the mutation load covariate is important.
I agree that removing only the gene you are trying to predict from n_mutation
would have a trivial impact.
I was more getting at the fact that you would only be able to use the trained classifier on data sets where you already know the total mutation load. I assume if you know the total mutation load you would likely already know if the gene of interest was mutated (I don't know if this is true).
Thanks for the response!
I assume if you know the total mutation load you would likely already know if the gene of interest was mutated (I don't know if this is true).
In some cases this will be true. But in these cases, we'd have the user rerun the notebook and remove the mutation load covariate.
I changed the ROC plot to use a version of the Vega spec. We can play around with the Vega configurations to make the plot look a little better (color scheme, size, etc.).
Maybe we should separate out implementing new CV pipelines into a different PR #96?
Maybe we should separate out implementing new CV pipelines into a different PR
Agreed!
Ok, let's split out the remaining parts into two new pull requests:
@patrick-miller are you waiting on me / is this ready to merge on your end?
The covariates model part is good to go, unless you have any other issues. What needs to be updated are the visualization and PCA in the pipeline, both of which are a bit tangential.
Following from the great work done by @joshlevy89 in #67, I have created a notebook that takes the covariates and runs three models in the same vein as the main directory notebooks:
This notebook strays a little bit in that it uses PCA instead of the SelectKBest, but I think that is where we are moving anyway. Each of the models is fit on the same partitions, and metrics are calculated for each. In most of the places, I store the results for each model, but in a few spots where we show the results, I only access one model. I can change it either way if you would like, let me know what you would like @dhimmel.
One little "cheat" that we are performing here is running PCA on the entire expression matrix instead of just on the train partition. In a separate pull request, I think we should migrate the PCA to being performed inside the pipeline. However, that is not straightforward.