cognoma / machine-learning

Machine learning for Project Cognoma
Other
32 stars 47 forks source link

Median absolute deviation feature selection #22

Open dhimmel opened 7 years ago

dhimmel commented 7 years ago

@gwaygenomics presented evidence that median absolute deviation (MAD) feature selection (selecting genes with the highest MADs) can eliminate most features without hurting performance: https://github.com/cognoma/machine-learning/pull/18#issuecomment-236265506. In fact, it appears that performance increased with the feature selection, which could make sense if the selection enriched for predictive features, increasing the signal-to-noise ratio.

Therefore, I think we should investigate this method of feature selection further. Specifically, I'm curious whether:

I'm labeling this issue a task, so please investigate if you feel inclined.

dhimmel commented 7 years ago

In 34225ccdefa191287ca153fc14c73bb4eaa6706d -- an example for classifying TP53 mutation -- we did not apply MAD feature selection (notebook). In a8ae61147897aed4a3883853563b357644cbc5f3 (pull request #25), @yl565 selected the top 500 MAD genes (notebook).

Before MAD feature selection, training AUROC was 95.9% and testing AUROC was 93.5%. After MAD feature selection, training AUROC was 89.9% and testing AUROC was 87.9%. @yl565, did anything else change in your pull request that would negatively affect performance? If not I think we may have an example of 500-MAD genes being detrimental. See @gwaygenomics's analysis for benchmarking on RAS mutations: 500 genes appears to be borderline dangerous.

yl565 commented 7 years ago

Since a pipeline has been used, only X_train is used for feature selection and standardization. This would decrease AUROC but I think it reflects the reality better because we want to use the classifier to predict if the gene will mutate for a patient so the X_test in reality is only 1 sample. Using the entire dataset X for feature selection and standardization will cause overfitting. This figure compares the differences in testing AUROC with varying amount of feature selected my MAD image

dhimmel commented 7 years ago

@yl565 really informative analysis. Can you share the source code? Checkout GitHub gists if you want a quick way to host a single notebook. Also I'd love to see the graph extended to all ~20,000 genes.

I'm having some trouble comprehending why performance drops off when you feature select and scale on X_train only. I wouldn't think our unsupervised selection and scaling would cause overfitting and X_test is only 10% of the samples. Do you have any insight?

yl565 commented 7 years ago

Because there are differences in distribution between training and testing set. This figure shows the genes with the most difference between training and testing data. I guess 7000 samples are not enough to represent the gene variation of the population image

Here are the codes: https://gist.github.com/yl565/1a978e358a00dea573590e0456dfc1b2#file-1-tcga-mlexample-effectoffeaturenumbers-ipynb