PCA (and other pre-processing steps) on just the expression matrix in CV pipeline

cognoma / machine-learning

Machine learning for Project Cognoma

Other

32 stars 47 forks source link

PCA (and other pre-processing steps) on just the expression matrix in CV pipeline #96

Closed patrick-miller closed 7 years ago

patrick-miller commented 7 years ago

We have two sources of features: the covariates and the gene expression matrix. When pre-processing this data, we generally want to perform dimensionality reduction only on the expression matrix.

This can prove cumbersome when trying to implement PCA in a pipeline. Scikit-learn does not provide this sort of functionality out of the box, but I believe it is possible to continue using pipelines. Here is an example.

dhimmel commented 7 years ago

In https://github.com/cognoma/machine-learning/pull/67, @joshlevy89 ended up exploring different ways of applying tranformations on a subset of features (e.g. PCA on expression features only). See https://github.com/cognoma/machine-learning/pull/67#discussion_r86557384. It looks like we ended up removing all solutions from the PR. However, there were at least 2 options we discussed:

Using sklearn.pipeline.FeatureUnion
Using sklearn_pandas.DataFrameMapper

@joshlevy89 do you remember what the issues were / have any advice? Perhaps some of the problems we ran into have been fixed.

patrick-miller commented 7 years ago

Great, thanks for the pointer, I missed the comment on my first pass. I have opened a PR (#100) that uses FeatureUnion, and it seems to work correctly.

joshlevy89 commented 7 years ago

It's been awhile since I've worked with these modules so I don't remember specifics beyond what I wrote in the PR. I think the key points from that were: 1) DataFrameMapper was more concise but was an extra dep plus was relatively new (leading to some weird behaviors) 2) FeatureUnion worked in many situations, but note from that PR: "It should be possible to apply Imputer to only covariates. But for some reason this is not working with FeatureUnion. It's as if FunctionTransformer were were working for SelectKBest but not Imputer"

It's possible things have changed since then or that it works great for your use-case. Hope that helps.