FitRegularizedClassificationGlm for finding Gene Sets

GWMcElfresh commented 11 months ago

Hi everyone,

This PR addressed a somewhat longstanding issue that we've been having. Namely, given some metadata feature that we would use FindMarkers() or PCA to try to isolate candidate genes for, how many of those genes is sufficient?

This uses glmnet's penalized regression framework to add a penalty for each parameter (gene) in the model such that you optimize a trade off between accuracy and number of genes used for prediction.

My primary use case here would be pseudobulking to isolate candidate "heatmap ready" gene sets for interpretation, but this does technically support single cell data (just please do not run it locally).

Usage:

FitRegularizedClassificationGlm(seuratObj,
                                metadataVariableForClassification = "Vaccine",
                                rescale = TRUE,
                                numberOfVariableFeatures = 3000,
                                assay = "RNA",
                                slot = "scale.data",
                                devianceCutoff = 0.8,
                                split = NULL, 
                                returnModelAndSplits = F)

devianceCutoff is your tunable accuracy parameter: 1.0 = 100% accuracy (e.g. give me every non-redundant + useful gene for classification).

If you want to run this iteratively to classify a set of candidate genes that isolate Vaccine and also Timepoint, you can set returnModelAndSplits = TRUE so that you can keep the same testing and training sets to prevent data leakage (PR supporting multiple regression for evaluating these gene sets jointly TBD).

Usage for this kind of iterative regression is:

vaccine_model_and_features <- FitRegularizedClassificationGlm(seuratObj, 
                                                              metadataVariableForClassification = 'Vaccine', 
                                                              returnModelAndSplits = T, 
                                                              rescale = T, 
                                                              split = NULL
                                                              )

timepoint_model_and_features <- FitRegularizedClassificationGlm(seuratObj, 
                                                                metadataVariableForClassification = 'Timepoint', 
                                                                returnModelAndSplits = T, 
                                                                rescale = F, 
                                                                split = vaccine_model_and_features$split
                                                                )

I do want to highlight that since this is a regression problem, so genes that have approximately equal predictive power as another gene will get dropped. Coupling this with a post-hoc correlation with the genes in the gene set is a good idea to "fill out" the rest of a candidate gene set.

GWMcElfresh commented 11 months ago

Some figures that help illustrate the point:

bbimber commented 11 months ago

@GWMcElfresh, i think this overall looks good. I did qualify "mlr3::as_task_classif", which may fix that test error.

GWMcElfresh commented 11 months ago

thanks, I think this is a meta-package problem. I'm importing mlr3verse so that the glmnet (not shipped with base mlr3) gets pulled in. It seems like mlr3 does all the function exports, but itself isn't exposed when importing mlr3verse.

I probably just need to import both of them in the roxygen docs, I'll test that in the morning.

bbimber commented 11 months ago

Perhaps. If you can avoid a blanket "@import ", that is probably preferable. In the case of what I added to pseudobulking, simply qualifying the method it was using with "mlr3::" should be sufficient, and it didnt need any roxygen changes. I did not look very specifically into the failures from last night yet though

bbimber commented 11 months ago

@GWMcElfresh: two more "mlr3::" just added. we'll see if that's enough

GWMcElfresh commented 11 months ago

Perhaps. If you can avoid a blanket "@import ", that is probably preferable. In the case of what I added to pseudobulking, simply qualifying the method it was using with "mlr3::" should be sufficient, and it didnt need any roxygen changes. I did not look very specifically into the failures from last night yet though

agreed, I'll try to find a better way to get glmnet's learner to play nicely without an @import mlr3verse. In the meantime, I imported just the mlr3 functions (and added it to the NAMESPACE) so that should fix the mlr3 import errors.

bimberlabinternal / CellMembrane

FitRegularizedClassificationGlm for finding Gene Sets #193