Two datasets for Random Forest

iacopogentile commented 7 months ago

Hi, Wonderful Project and tool.

I have not used it yet, but I have been reading through the available vignettes.

I was wondering if there is a way to have the random forest trained on a different dataset compared to the one matrix directly used for clustering. so, two datasets will be inputted, one will be used for clustering, and the other one to train the RF. I can provide more details if needed, to explain why this would be necessary. Best, Iacopo

catpetersen commented 7 months ago

Hi! This will be a future feature. We're hoping to add an option to do "count-splitting" (similar to what is described here: https://doi.org/10.1093/biostatistics/kxac047).

iacopogentile commented 7 months ago

Thank you

catpetersen commented 6 months ago

Hi again! Just wanted to let you know that we've now enabled count splitting in the dev branch!

Here's a quick run-down from the updated vignette:

When parameter countsplit = TRUE, CHOIR accepts count split input matrices. One matrix will be used to calculate highly variable features, dimensionality reductions, nearest neighbor adjacency matrices, and the initial clustering tree. The other matrix will be exclusively used as input to the random forest classifiers, in order to decide when clusters should or should not be merged.

You can take your Seurat object and run count splitting using the following code, which will extract your existing count matrix, run function countsplit::countsplit, and store the resulting matrices back in your object with added suffixes (default is “_1” and “_2”). Use parameter normalization_method to apply normalization and store the normalized count split matrices.

seurat_object <- runCountSplit(seurat_object)

The default application here will produce log normalized, count split matrices stored under slots "counts_log_1" and "counts_log_2" which can then be input to CHOIR:

seurat_object <- CHOIR(seurat_object, 
                       use_slot = "counts_log", 
                       countsplit = TRUE)

Alternately, you can provide your own count split matrices. They must share a prefix, which is provided to parameter use_slot for Seurat objects or parameter use_assay for SingleCellExperiment objects. The unique suffixes should be provided to parameter countsplit_suffix as a character vector.

corceslab / CHOIR

Two datasets for Random Forest #4