corceslab / CHOIR

CHOIR : Clustering Hierarchy Optimization by Iterative Random forests (www.CHOIRclustering.com)
MIT License
20 stars 5 forks source link

Starting with pruneTree and using harmony reduction, do I still need to correct batch effect in the pruneTree? #13

Closed YiweiNiu closed 6 months ago

YiweiNiu commented 6 months ago

Hi,

Thanks for developing this useful tool! I would like to start with pruneTree function, with Harmony corrected reduction and adjacency matrices provided. Specifically, I want to adapt the codes from https://github.com/corceslab/CHOIR/issues/6#issuecomment-1926119318, like this:

# Import dataset & pre-process
#  Object cluster_labels should be a vector of your cluster labels
cluster_labels <- seurat_object@meta.data$seurat_clusters
# Make this a named vector, where the names correspond to the cell IDs
names(cluster_labels) <- colnames(seurat_object) # Assumes that cluster IDs are in the same order as cells in the object

# Extract the cell embeddings of your dimensionality reduction
dim_reduction <- seurat_object@reductions$harmony@cell.embeddings

# Create cluster_tree
cluster_tree <- inferTree(cluster_labels = cluster_labels,  reduction = dim_reduction)

# Extract variable features and input matrix for random forest classifiers
var_features <- VariableFeatures(seurat_object)
input_matrix <- CHOIR:::.getMatrix(seurat_object, 
                                   use_assay = "RNA", 
                                   use_slot = "data", 
                                   use_features = var_features, 
                                   verbose = TRUE)

# Extract adjacency matrices
nn_matrix <- seurat_object@graphs$RNA_nn
snn_matrix <- seurat_object@graphs$RNA_snn

# Run pruneTree
seurat_object <- pruneTree(seurat_object, 
                           cluster_tree = cluster_tree, 
                           input_matrix = input_matrix,
                           nn_matrix = nn_matrix,
                           snn_matrix = snn_matrix,
                           reduction = dim_reduction)

In this way, I was wondering if it's necessary to correct batches in this function, considering that the cluster_tree and snn_matrix are both based on Harmony corrected dimension reduction.

Thanks!

catpetersen commented 6 months ago

Hi! Yes, I would still recommend enabling CHOIR's internal batch correction by setting parameter batch_correction_method to "Harmony" and parameter batch_labels to the name of the cell metadata column containing your batch info.

Since you're starting at the pruneTree step, this won't change your provided dimensionality reductions at all, but will ensure that the random forest classifiers are not biased by potential remaining batch effects in your cell x feature matrix.

A bit of explanation— Relative batch composition will naturally vary a bit across clusters, particularly at the bottom of the clustering tree, where you have more clusters. For example, say we have a hypothetical case where CHOIR is comparing cluster A (composed of 75% cells from batch 1 and 25% cells from batch 2) to cluster B (composed of 25% cells from batch 1 and 75% cells from batch 2). In this case, the random forest classifiers may learn to some degree to associate the signature of batch 1 with cluster A, and batch 2 with cluster B. We don't want the decision to split/merge these clusters to be biased by the batch composition, because this may lead to overclustering. With CHOIR's batch correction enabled, cells from each batch are compared separately to determine whether the clusters should be merged, avoiding this potential bias.

YiweiNiu commented 6 months ago

Hi, thank you very much for your awesome explanation! I found that currently CHOIR only supports using Harmony to correct one covariate (it gives errors when using two covariates for batch_labels). I was wondering if there's any workaround to include more than one covariate to correct?

catpetersen commented 6 months ago

CHOIR can only handle 1 batch variable, so I think the best solution is probably to create a new metadata column with the two batch covariates pasted together.

YiweiNiu commented 6 months ago

Got it, thanks!