patrick-miller commented 7 years ago

Following from the great work done by @joshlevy89 in #67, I have created a notebook that takes the covariates and runs three models in the same vein as the main directory notebooks:

Full features -- gene expression matrix (after PCA) + covariates
Covariates -- only the covariates
Expressions -- only the gene expression matrix (after PCA)

This notebook strays a little bit in that it uses PCA instead of the SelectKBest, but I think that is where we are moving anyway. Each of the models is fit on the same partitions, and metrics are calculated for each. In most of the places, I store the results for each model, but in a few spots where we show the results, I only access one model. I can change it either way if you would like, let me know what you would like @dhimmel.

One little "cheat" that we are performing here is running PCA on the entire expression matrix instead of just on the train partition. In a separate pull request, I think we should migrate the PCA to being performed inside the pipeline. However, that is not straightforward.

dhimmel commented 7 years ago

One little "cheat" that we are performing here is running PCA on the entire expression matrix instead of just on the train partition. In a separate pull request, I think we should migrate the PCA to being performed inside the pipeline. However, that is not straightforward.

WIll do a full review later. Just wanted to note dask-searchcv which should make the PCA in a GridSearchCV pipeline more efficient. http://jcrist.github.io/introducing-dask-searchcv.html

patrick-miller commented 7 years ago

Ok, so the two big things left here are:

updating the plot (I think just using the Vega spec makes sense, I will write a wrapper to dump the metrics to JSON)
building the PCA into the CV pipeline: going to try this

dhimmel commented 7 years ago

updating the plot (I think just using the Vega spec makes sense, I will write a wrapper to dump the metrics to JSON)

IIRC, you should be able to pass the dataframe directly using ipyvega:

rdvelazquez commented 7 years ago

This looks great. It will be good to have a more up-to-date notebook in the main directory, especially for new people.

Are we concerned about including the total number of mutations as a feature? Seems like we are using the thing we are trying to predict.

I removed the selected_cols.append('n_mutations_log1p') line and ran the notebook with a few different random states to see how much the total mutations feature adds... the testing AUROC seems to drop about 0.5%-1% for the full dataset and about 2% for the covariates only.

patrick-miller commented 7 years ago

I had the same question about including number of mutations at the last meetup. @dhimmel has a good explanation for why it is kept.

dhimmel commented 7 years ago

Are we concerned about including the total number of mutations as a feature? Seems like we are using the thing we are trying to predict.

Only in a very minor way. There are ~20000 genes where mutation is possible. We are fitting a classifier on only a single one of those genes. So if you wanted to entirely remove any confounding effect you could subject a single mutation from n_mutation for samples that were positives. However, I suspect the effect will be trivial. See https://github.com/cognoma/machine-learning/issues/8 for why including the mutation load covariate is important.

rdvelazquez commented 7 years ago

I agree that removing only the gene you are trying to predict from n_mutation would have a trivial impact.

I was more getting at the fact that you would only be able to use the trained classifier on data sets where you already know the total mutation load. I assume if you know the total mutation load you would likely already know if the gene of interest was mutated (I don't know if this is true).

8 was a little over my head but it sounds like including the total mutation load is beneficial and not uncommon, it just may limit the use of the produced classifier (if my assumption isn't off-base).

Thanks for the response!

dhimmel commented 7 years ago

I assume if you know the total mutation load you would likely already know if the gene of interest was mutated (I don't know if this is true).

In some cases this will be true. But in these cases, we'd have the user rerun the notebook and remove the mutation load covariate.

patrick-miller commented 7 years ago

I changed the ROC plot to use a version of the Vega spec. We can play around with the Vega configurations to make the plot look a little better (color scheme, size, etc.).

Maybe we should separate out implementing new CV pipelines into a different PR #96?

dhimmel commented 7 years ago

Maybe we should separate out implementing new CV pipelines into a different PR

Agreed!

patrick-miller commented 7 years ago

Ok, let's split out the remaining parts into two new pull requests:

fixing visualization (keep interactivity, get rid of temporary files)
incorporate PCA just on the gene expressions into the pipeline

dhimmel commented 7 years ago

@patrick-miller are you waiting on me / is this ready to merge on your end?

patrick-miller commented 7 years ago

The covariates model part is good to go, unless you have any other issues. What needs to be updated are the visualization and PCA in the pipeline, both of which are a bit tangential.

cognoma / machine-learning

Add covariates-only model for comparison in the main notebook #93

8 was a little over my head but it sounds like including the total mutation load is beneficial and not uncommon, it just may limit the use of the produced classifier (if my assumption isn't off-base).