Unexpectledly large number of clusters

smorabit commented 8 months ago

Hi Cathrine,

First of all thanks for developing this very interesting method! I tried out CHOIR on one of my datasets to see how the results would compare with my previous clustering analysis. I understand that one of the main advantages of CHOIR is that in principle it should identify the "correct" number of clusters without over or under clustering.

I have a snRNA-seq dataset of postmortem human cortical tissue from 8 individuals totaling ~66k nuclei. Based on my previous analysis and what we expect based on other studies, this dataset has clusters corresponding to seven major cell lineages (oligodendrocytes, OPCs, microglia, astrocytes, vascular cells, excitatory neurons, and inhibitory neurons). We generally expect that each of these major cell types would have some subclusters as well. I was surprised that after running CHOIR, I ended up with 132 clusters identified in my dataset, many of which contained very few nuclei. I ran CHOIR using the harmony dim reduction that I had already computed. I am not sure what is going on here or if you have any advice in this scenario? Below I am including the code that I ran as well as a UMAP plot comparing my previous clustering to the CHOIR clustering.

seurat_obj <- CHOIR(
    seurat_obj, 
    reduction = seurat_obj@reductions$harmony@cell.embeddings,
    var_features = VariableFeatures(seurat_obj)
)

catpetersen commented 8 months ago

Hi Sam!

For datasets with multiple batches, it's best to use CHOIR's built-in batch correction method rather than providing a pre-generated dimensionality reduction. Try setting batch_correction_method to "Harmony" and batch_labels to the metadata column name containing your batch info, without providing any input to reduction or var_features.

seurat_obj <- CHOIR(
    seurat_obj, 
    batch_correction_method = "Harmony",
    batch_labels = "Batch")

CHOIR will run Harmony on the dimensionality reductions generated, on which the initial hierarchical clustering tree is then based. But more importantly, the batch information will be used in the random forest classifier comparisons to ensure that batch effects do not bias the ultimate clusters that pass the significance and accuracy thresholds imposed. Essentially, within the permutation tests, only cells from the same batch are compared within a single iteration.

Since most datasets these days have multiple batches, CHOIR's performance was largely optimized with this batch correction method enabled. Let me know how your results look after trying this!

smorabit commented 8 months ago

Thanks for the quick response! I ran the code as you suggested, and I ended up with 21 clusters, relevant figures attached at the bottom. However they do seem somewhat strange to me compared to my previous analysis. For example, almost all of the clusters that I had previously annotated as excitatory neurons have been assigned to only one CHOIR cluster. The marker gene analysis of my previous excitatory neuron clusters did show unique markers that mapped to cortical layers so I think there should be more than one cluster here. Any further advice?

I am also curious based on what you said:

... the batch information will be used in the random forest classifier comparisons ...

Since CHOIR's performance relies on this batch information, does this mean that users should not supply dim reductions from other batch correction methods (scVI, LIGER, etc)? It would be great to be able to use more than just Harmony here.

Screenshot 2024-01-31 at 12 41 39 PM

catpetersen commented 8 months ago

I think a couple of things may be going on, related to CHOIR’s statistical power when working with multiple batches within a small-ish dataset.

First, it'd be useful to see the clusters plotted on a UMAP based on the dimensionality reduction that CHOIR generates. To easily do this, run:

seurat_obj <- runCHOIRumap(seurat_obj)
plotCHOIR(seurat_obj)

Are you considering each of your 8 individuals as a separate batch? Would be helpful to see a UMAP colored by batch also on CHOIR's UMAP embedding. Would you expect any cell types/states to be largely confined to a single batch or small subset of batches? In instances where the two clusters being compared by CHOIR have few cells belonging to the same batch, these clusters are assumed to be batch-confounded instances of the same cell population and are merged. CHOIR's batch correction was designed to work best with batches that include more than 1 sample and that are not confounded by groups of interest (e.g., disease vs. control).

If you upload the records from seurat_obj@misc$CHOIR$records here, that would also help me get a better sense for what may be going on, but these clusters may just reflect a slightly more “conservative” partition of your cells.

Regarding your question:

does this mean that users should not supply dim reductions from other batch correction methods (scVI, LIGER, etc)?

Yes, for now, CHOIR’s batch correction is limited to the built-in Harmony method. But I’d like to expand compatibility to more methods of batch correction in the future, particularly scVI.

Regardless of batch correction, for most users, I’d advise against providing a pre-generated dimensionality reduction. This is because CHOIR generates multiple dimensionality reductions as it builds out the hierarchical clustering tree. A “root” dimensionality reduction encompassing all cells is followed by multiple “subtree” dimensionality reductions that each encompass a subset of the total cells. These subsettted reductions, and their accompanying sets of highly variable features allow CHOIR to pick up on more nuanced differences between cell subtypes/states. You can visualize the CHOIR clusters on these subsetted dimensionality clusters as well, by specifying the reduction to use in the code up top.

smorabit commented 8 months ago

Hi again,

Based on what you said:

CHOIR's batch correction was designed to work best with batches that include more than 1 sample and that are not confounded by groups of interest (e.g., disease vs. control).

My 8 samples actually do belong to a single batch. Previously I was using the sample ID as the "batch" grouping. I re-ran CHOIR without any batch correction since all the samples did come from the same batch. Unfortunately it looks like I still have the problem of having a large number of clusters. Here's the code:

seurat_obj <- CHOIR(seurat_obj)
seurat_obj <- runCHOIRumap(seurat_obj)
plotCHOIR(seurat_obj)

And the resulting plot: Screenshot 2024-02-12 at 6 43 43 PM

If you are interested here's the records (.rda file).

Regardless of batch correction, for most users, I’d advise against providing a pre-generated dimensionality reduction. This is because CHOIR generates multiple dimensionality reductions as it builds out the hierarchical clustering tree.

Okay this makes sense! I was assuming that this was more akin to other clustering algorithms where you just need to supply any dim reduction. It also seems to me like the feature selection step would play a huge role in the downstream results, since some genes that are discriminatory between two groups of cells may not be present in the variable features.

catpetersen commented 8 months ago

Hi Sam, we're still looking into this—balancing the performance of CHOIR with and without batch correction has turned out to be a bit of a complicated issue. The fundamental issue seems to be related to the number of cells used for each iteration of the permutation test (so each train/test random forest classifier application), and we're hoping to have an update on this soon.

It also seems to me like the feature selection step would play a huge role in the downstream results, since some genes that are discriminatory between two groups of cells may not be present in the variable features.

I do think the feature selection step matters, and we're hoping to explore some of the newer feature selection methods in the future! CHOIR does incorporate a couple of ways to modify the features used—there's an option to use all features as input to the random forest classifiers, rather than just variable features (although this is quite a bit slower), and a parameter exclude_features that can be used to, for instance, prevent mitochondrial genes or other quality-related features from being used as input to the random forest classifiers.

We will update here when we have something tangible to report on the batch correction side of things!

corceslab / CHOIR

Unexpectledly large number of clusters #7