Closed GWMcElfresh closed 11 months ago
Some figures that help illustrate the point:
@GWMcElfresh, i think this overall looks good. I did qualify "mlr3::as_task_classif", which may fix that test error.
thanks, I think this is a meta-package problem. I'm importing mlr3verse so that the glmnet (not shipped with base mlr3) gets pulled in. It seems like mlr3 does all the function exports, but itself isn't exposed when importing mlr3verse.
I probably just need to import both of them in the roxygen docs, I'll test that in the morning.
Perhaps. If you can avoid a blanket "@import
@GWMcElfresh: two more "mlr3::" just added. we'll see if that's enough
Perhaps. If you can avoid a blanket "@import ", that is probably preferable. In the case of what I added to pseudobulking, simply qualifying the method it was using with "mlr3::" should be sufficient, and it didnt need any roxygen changes. I did not look very specifically into the failures from last night yet though
agreed, I'll try to find a better way to get glmnet's learner to play nicely without an @import mlr3verse
. In the meantime, I imported just the mlr3 functions (and added it to the NAMESPACE) so that should fix the mlr3 import errors.
Hi everyone,
This PR addressed a somewhat longstanding issue that we've been having. Namely, given some metadata feature that we would use
FindMarkers()
or PCA to try to isolate candidate genes for, how many of those genes is sufficient?This uses glmnet's penalized regression framework to add a penalty for each parameter (gene) in the model such that you optimize a trade off between accuracy and number of genes used for prediction.
My primary use case here would be pseudobulking to isolate candidate "heatmap ready" gene sets for interpretation, but this does technically support single cell data (just please do not run it locally).
Usage:
devianceCutoff
is your tunable accuracy parameter: 1.0 = 100% accuracy (e.g. give me every non-redundant + useful gene for classification).If you want to run this iteratively to classify a set of candidate genes that isolate
Vaccine
and alsoTimepoint
, you can setreturnModelAndSplits = TRUE
so that you can keep the same testing and training sets to prevent data leakage (PR supporting multiple regression for evaluating these gene sets jointly TBD).Usage for this kind of iterative regression is:
I do want to highlight that since this is a regression problem, so genes that have approximately equal predictive power as another gene will get dropped. Coupling this with a post-hoc correlation with the genes in the gene set is a good idea to "fill out" the rest of a candidate gene set.