Providing pre-generated clusters

corceslab / CHOIR

CHOIR : Clustering Hierarchy Optimization by Iterative Random forests (www.CHOIRclustering.com)

MIT License

18 stars 5 forks source link

Providing pre-generated clusters #6

Closed alifarhat40 closed 6 months ago

alifarhat40 commented 6 months ago

Hello,

I am trying to apply CHOIR on my precomputed clusters from a method I developed. I am working with the PBMC3k dataset in fact.

Preprocessing is exactly followed step by step from here: https://satijalab.org/seurat/articles/pbmc3k_tutorial.html

After I perform my clustering, I obtain a list of cluster labels for each of the 2638 cells in the order they appear in the seurat object. I do not have any information about their hierarchical trees, because my method does not use trees. My question is: how do I provide this cluster_tree dataframe of my precomputed cluster labels? All I have is my cluster labels and I want to use CHOIR to correctly merge them. In scSHC and Cytocipher I can just simply pass my list of cluster labels to their functions and pop out a final answer. But I am not sure how to do the same with CHOIR. I could not find a vignette or example. https://www.choirclustering.com/articles/CHOIR.html#providing-pre-generated-clusters

catpetersen commented 6 months ago

Hi! The cluster_tree can be generated in any way you'd like, but I've added a function to the dev branch to generate a simple hierarchical tree from provided input cluster labels based on cluster distances in a dimensionality reduction space. The output is a dataframe that can be used as input for parameter cluster_tree.

So install CHOIR from the dev branch, then use function inferTree(). Run it like this:

# Object cluster_IDs should be a vector of your cluster labels named with the corresponding cell ID
# Object dim_reduction should be cell embeddings from a dimensionality reduction you've previously generated
cluster_tree <- inferTree(cluster_labels = cluster_IDs, reduction = dim_reduction)

Alternately, without needing to install from the dev branch, you can apply two of CHOIR's hidden functions sequentially like this:

initial_tree <- data.frame(L1 = 1, L2 = cluster_IDs)
cluster_tree <- CHOIR:::.optimizeTree(cluster_tree = initial_tree, reduction = dim_reduction)
cluster_tree <- CHOIR:::.checkClusterLabels(cluster_tree)

Let me know if you run into issues with this or the subsequent steps, since this is not the primary use-case for CHOIR.

alifarhat40 commented 6 months ago

Thanks! I will try this and get back to you.

By any chance, do you think this clustering method would work for general data as well? Or is it tailored specifically for scRNAseq data.

catpetersen commented 6 months ago

Great!

And yes, CHOIR is compatible with any data that is in the shape of one or more cell x feature matrices. So any single-cell data (both single- and multi-omic), including single-cell RNA-seq, ATAC-seq, proteomics data, etc.

Outside of typical single-cell data, other data types that have this similar shape can also be used, though CHOIR has not been as thoroughly tested with these. For example, I've applied CHOIR to the MNIST dataset in the past (where each "cell" would be a digit sample, and the features would be the pixel values), with interesting results.

alifarhat40 commented 6 months ago

I installed the dev branch, but when running

dim_reduction = Embeddings(object = seurat_object[["pca"]]) cluster_tree <- inferTree(cluster_labels = cluster_IDs, reduction = dim_reduction)

I get the following error:

This is a snippet of my precoputed cluster IDs. It is a vector of integer labels for each of the 2638 pbmc3k cells.

Next I try your other way:

initial_tree <- data.frame(L1 = 1, L2 = cluster_IDs) cluster_tree <- CHOIR:::.optimizeTree(cluster_tree = initial_tree, reduction = dim_reduction) cluster_tree <- CHOIR:::.checkClusterLabels(cluster_tree) expression_data <- seurat_object[["RNA"]]$scale.data seurat_object <- pruneTree(seurat_object, cluster_tree = cluster_tree, n_cores = 1)

but I also get an error after running pruneTree:

Here is what cluster_tree looks like:

Am I messing up a step here? Thanks in advance.

catpetersen commented 6 months ago

Thanks for working through this! I've made some additional changes to the dev branch (so please re-install) to hopefully make starting with pre-generated clusters smoother.

First, for running inferTree, cluster_labels must be a named vector, where the names correspond to the cell IDs. This is so that the function can check that the user input cell IDs match the rownames of the cell embeddings provided for reduction. The inferTree function now adds these same cell IDs as rownames of the output cluster_tree, which should stop that error you're getting when running pruneTree.

But since other inputs and parameters will not have been calculated or provided by the buildTree step of CHOIR, you'll need to provide additional inputs to the pruneTree function. Here's some code that extracts all of those necessary inputs for a Seurat object that has been through a standard pipeline (also requested in issue #8):

# Example code for a Seurat object that has been through the following steps (modify as needed)
seurat_object <- seurat_object %>%
  NormalizeData() %>%
  FindVariableFeatures() %>%
  ScaleData() %>%
  RunPCA() %>%
  FindNeighbors() %>%
  FindClusters()

#  Object cluster_labels should be a vector of your cluster labels
cluster_labels <- seurat_object@meta.data$seurat_clusters
# Make this a named vector, where the names correspond to the cell IDs
names(cluster_labels) <- colnames(seurat_object) # Assumes that cluster IDs are in the same order as cells in the object

# Extract the cell embeddings of your dimensionality reduction
dim_reduction <- seurat_object@reductions$pca@cell.embeddings

# Create cluster_tree
cluster_tree <- inferTree(cluster_labels = cluster_labels,  reduction = reduction)

# Extract variable features and input matrix for random forest classifiers
var_features <- seurat_object@assays$RNA@var.features
input_matrix <- CHOIR:::.getMatrix(seurat_object, 
                                   use_assay = "RNA", 
                                   use_slot = "data", 
                                   use_features = var_features, 
                                   verbose = TRUE)

# Extract adjacency matrices
nn_matrix <- seurat_object@graphs$RNA_nn
snn_matrix <- seurat_object@graphs$RNA_snn

# Run pruneTree
seurat_object <- pruneTree(seurat_object, 
                           cluster_tree = cluster_tree, 
                           input_matrix = input_matrix,
                           nn_matrix = nn_matrix,
                           snn_matrix = snn_matrix, # New requirement not yet reflected in online documentation
                           reduction = reduction)

I've tested this with two datasets myself, but please let me know if any other errors pop up. I'd also be curious how your final results compare to just running the entire default CHOIR algorithm with function CHOIR().

alifarhat40 commented 6 months ago

Okay thanks! I reinstalled the new dev branch.

1) I do not run FindNeighbors() in my pipeline. Is that necessary?

2) should cluster_labelsbe a vector of characters or vector of integers?

3) I think var_features <- seurat_object@assays$RNA@var.features should instead be var_features <- VariableFeatures(seurat_object, assay = "RNA")

4) The following code gives me null for both nn_matrixand snn_matrix, because I have not run buildtree. Again, all I have is the cluster_labels from my own clsutering algorithm.

#Extract adjacency matrices nn_matrix <- seurat_object@graphs$RNA_nn snn_matrix <- seurat_object@graphs$RNA_snn So when I run the pruneTree function I get an error:

Error in pruneTree(seurat_object, cluster_tree = cluster_tree, input_matrix = input_matrix,  : 
  unused argument (snn_matrix = snn_matrix)

When I remove the variable snn_matrixfrom pruneTree: snn_matrix = snn_matrix, # New requirement not yet reflected in online documentation

I get another error same as before:

Error in .validInput(cluster_tree, "cluster_tree", object) : 
Input value for 'cluster_tree' must be a dataframe or matrix in which the row names and order are identical to the cell IDs in the provided object. Please supply valid input or set to NULL.

I am using the standard PBMC3k dataset from the seurat tutorial I linked above. I ran CHOIR alone on it and it gave me 15 clusters with ARI = 0.5529. Running scSHC directly on it gives 7 clusters with ARI = 0.5436. But when I run scSHC on my precomputed clusters the ARI goes up a little to 0.61749. I am hoping to see something better with CHOIR. My clustering algorithm does not have hyperparameters to tune like Leiden. It uncovers the natural clusters. However, sparsity is an issue as it gives me over 500 data points in their own cluster (i.e. 500 clusters with just one data point). So I am looking for ways to merge. I have a feeling CHOIR might be good.

alifarhat40 commented 6 months ago

After running: seurat_object <- FindNeighbors(seurat_object, dims = 1:10)

I can finally extract nn_matrixand snn_matrix, but then get another error:

And if I remove the snn_matrix variable then I get this error again:

Error in .validInput(cluster_tree, "cluster_tree", object) : 
Input value for 'cluster_tree' must be a dataframe or matrix in which the row names and order are identical to the cell IDs in the provided object. Please supply valid input or set to NULL.

catpetersen commented 6 months ago

I suspect the reinstallation might not have completed properly. Those errors should be resolved in the newest version of the dev branch. You do need the snn_matrix input. You can check whether you have the correct package version loaded by running head(pruneTree, 10), which should give the following output (where you can see the snn_matrix parameter):

To answer some of your questions:

I do not run FindNeighbors() in my pipeline. Is that necessary?

If you want to run CHOIR in this non-standard way, then you'll have to produce all of the necessary input matrices, including the nn and snn adjacency matrices. These are all things that CHOIR would usually generate for you in the course of running the buildTree function.

should cluster_labels be a vector of characters or vector of integers?

Either is fine.

Below is a reproducible example that works for me using a small dataset, see if this also errors out for you:

if (!requireNamespace("scRNAseq", quietly = TRUE)) BiocManager::install("scRNAseq")
library(scRNAseq)
library(Seurat)
library(CHOIR)

# Import dataset & pre-process
data_matrix <- LaMannoBrainData('mouse-adult')@assays@data$counts
colnames(data_matrix) <- seq(1:ncol(data_matrix))

seurat_object <- CreateSeuratObject(data_matrix, 
                                    min.features = 100,
                                    min.cells = 5)

seurat_object <- seurat_object %>%
  NormalizeData() %>%
  FindVariableFeatures() %>%
  ScaleData() %>%
  RunPCA() %>%
  FindNeighbors() %>%
  FindClusters()

#  Object cluster_labels should be a vector of your cluster labels
cluster_labels <- seurat_object@meta.data$seurat_clusters
# Make this a named vector, where the names correspond to the cell IDs
names(cluster_labels) <- colnames(seurat_object) # Assumes that cluster IDs are in the same order as cells in the object

# Extract the cell embeddings of your dimensionality reduction
dim_reduction <- seurat_object@reductions$pca@cell.embeddings

# Create cluster_tree
cluster_tree <- inferTree(cluster_labels = cluster_labels,  reduction = dim_reduction)

# Extract variable features and input matrix for random forest classifiers
var_features <- VariableFeatures(seurat_object)
input_matrix <- CHOIR:::.getMatrix(seurat_object, 
                                   use_assay = "RNA", 
                                   use_slot = "data", 
                                   use_features = var_features, 
                                   verbose = TRUE)

# Extract adjacency matrices
nn_matrix <- seurat_object@graphs$RNA_nn
snn_matrix <- seurat_object@graphs$RNA_snn

# Run pruneTree
seurat_object <- pruneTree(seurat_object, 
                           cluster_tree = cluster_tree, 
                           input_matrix = input_matrix,
                           nn_matrix = nn_matrix,
                           snn_matrix = snn_matrix,
                           reduction = dim_reduction)

As to benchmarking with ARI for real datasets, just as my personal opinion -- I don't always think that's the best approach, as it can be biased towards methods that are more similar to those used to generate the reference labels you are comparing to. But it can definitely be a starting point.

alifarhat40 commented 6 months ago

Yup. That worked. Thanks!

Running CHOIR on my precomputed clusters improved the ARI a tiny bit from 0.55 to 0.58. However, it did reduce the 15 clusters from CHOIR to 8. Here is an image of the "ground truth" from annotated pbmc3k.

"ground truth"

Here is the result after running CHOIR on my precomputed clusters from my newly developed clustering algorithm:

And here is scSHC on my precomputed clusters:

sc-SHC was actually able to separate the upper 2 clusters for me, because my algorithm does separate them as well. So it is strange to see why CHOIR would merge them.

1) Do you have any ideas of how I can change up some parameters in CHOIR to try and resolve this? I have a feeling CHOIR can do it, because it gave better separation from sc-SHC despite the lower ARI. CHOIR results look better.

2) You mentioned you do not prefer to benchmark on real datasets, do you have a recommendation for how to do that? In your paper you show simulated datasets. Should I use those as well?

3) I figured I'd let you know that the functions seurat_object <- runCHOIRumap(seurat_object, reduction = "P0_reduction") plotCHOIR(seurat_object)

do not work after running CHOIR on precomputed clusters. Instead I had to run UMAP using seurat and plot them using the CHOIR clusters. Just a heads up. This is the error:

Error in .validInput(reduction, "reduction", list("runCHOIRumap", object, : 
Input value 'P0_reduction' for 'reduction' was not found in the provided object, please supply valid input!
3. stop("Input value '", input, "' for '", name, "' was not found in the provided object, please supply valid input!")
2. .validInput(reduction, "reduction", list("runCHOIRumap", object, 
key))
1.runCHOIRumap(seurat_object, reduction = "P0_reduction")

It does not bother me. I just bring it up in case it concerns you.

catpetersen commented 6 months ago

Great! Glad that worked for you.

Do you have any ideas of how I can change up some parameters in CHOIR to try and resolve this?

The main parameter I'd try adjusting is the alpha value. Other parameters to look at are p_adjust and distance_awareness. If the dataset was larger, then the downsampling_rate could also have an effect, but this is only relevant above 5K cells.

When run in full, CHOIR generates multiple dimensionality reductions as it builds out the hierarchical clustering tree. A “root” dimensionality reduction encompassing all cells is followed by multiple “subtree” dimensionality reductions that each encompass a subset of the total cells. These subsetted reductions, and their accompanying sets of highly variable features allow CHOIR to pick up on more nuanced differences between cell subtypes/states. So the lack of these subsetted reductions in your case may cause CHOIR to be more conservative and merge more clusters.

You mentioned you do not prefer to benchmark on real datasets

No, benchmarking can absolutely be done with real datasets, I just wouldn't use ARI for real datasets. There are certainly other methods that can be used to benchmark with real datasets, but I feel that benchmarking metrics that require a "ground truth" set of cell type labels are not ideal for real world data.

You are welcome to try the simulated datasets from the paper, they're available here: https://files.corces.gladstone.org/Publications/2024_Petersen_CHOIR The Simulated_Datasets.tar.gz file contains the 100 simulated datasets and the SourceData1_SimulatedData.xlsx file contains detailed benchmarking results for CHOIR, sc-SHC, etc. including cluster labels for each cell across a variety of tested parameter settings.

And yea, for those UMAP and plotting functions, you'd have to specify a dimensionality reduction that exists in your Seurat object, instead of "P0_reduction" (which is the default name given to the dimensionality reduction calculated during the buildTree part of CHOIR).

I'm going to close this issue, but feel free to continue replying here if anything else comes up!

alifarhat40 commented 6 months ago

Great. Thanks for all your help @catpetersen ! Even those 2 new clusters in the left, I am sure they could be biologically meaningful since Cytocipher discovered a similar thing in their paper. I will have to double check.

1) could you explain a bit what distance_awareness is? I read the description in the preprint and function, but I am not sure how to go about selecting a value.

2) runCHOIRumap() does not work even with setting reduction = "pca". I might be interested in the accuracy plots. The only dimensionality reduction I perform is: seurat_object <- RunPCA(seurat_object, features = VariableFeatures(object = seurat_object)); in that case what should I set variable reduction equal to?

3) Does CHOIR keep track of which precomputed clusters merged into which final clusters? Or would I have to write manual code for that? It is not big a deal if not. I can easliy match cell names with their pre and final destinations. Cytocipher provides a slankey.

4) How do I apply CHOIR on general non scRNAseq data.

catpetersen commented 5 months ago

To keep these GitHub issue threads on topic, I'll confine my answers below to questions directly related to CHOIR. For your question related to benchmarking, I'll just point back to the pre-print, which contains a lot of my thought process around how to effectively benchmark clustering methods.

could you explain a bit what distance_awareness is? I read the description in the preprint and function, but I am not sure how to go about selecting a value.

distance_awareness is essentially a parameter used to make CHOIR more efficient by using a distance threshold to omit certain permutation test comparisons that we assume will just tell us that the two clusters in question should remain separate.

Here's an example with the default value distance_awareness = 2. First, if cluster A has is compared to cluster B, and we've determined that cluster A and B should not be merged, we record the distance between cluster A and B in our dimensionality reduction space, let's call this distance d. Then, if cluster A is compared to any cluster C where the distance between cluster A and C is greater than 2d, we just assume that these two clusters should also not be merged, and we keep them separate without running the permutation test. So, if you're fine with a somewhat longer runtime, you could increase this value, to make fewer of these assumptions and instead run more permutation tests.

runCHOIRumap() does not work even with setting reduction = "pca"

I'll take a look at this, but this function is probably best reserved for those who generate their dimensionality reductions using CHOIR, rather than providing pre-generated reductions.

Does CHOIR keep track of which precomputed clusters merged into which final clusters?

Yes! Check out the records stored under seurat_obj@misc$CHOIR. You can also see how the tree is pruned at each level using the stepwise_clusters_ dataframe under seurat_obj@misc$CHOIR$clusters.

How would I go about trying it on data such as the ones from scikit-learn? I do not have cell names or RNA assay in seurat object.

I'm not the best person to answer how to wrangle different types of data into a Seurat object. But you should really just need a cell x feature matrix with some sort of row and column names (e.g., 1, 2, 3..) to make this work. CHOIR is also compatible with SingleCellExperiment objects.

alifarhat40 commented 5 months ago

Much appreciated!

alifarhat40 commented 5 months ago

@catpetersen It appears CHOIR performs amazingly well on my precomputed clusters of PBMC3k dataset yet fails to keep the upper Memory vs Naive CD4 T cells separate despite the clear separation from my algorithm. My question now is: can I hard select specific clusters to NOT merge? I can fork the github repo and try implementing it myself, but would appreciate guidance on whether you think this is achievable. For example, if there are 3 clusters that I do not want them to mege with each other then I want to be able to give pruneTree() an additional parameter designating which precomputed clusters to specifically not merge. Other clusters can merge into them, but these 3 clusters specifically cannot merge with each other. Thanks for your time.

catpetersen commented 5 months ago

Glad to hear CHOIR is working well for you! No, I don't have any plans to add that feature.

I think the easiest solution would be to manually overwrite the cluster labels with those original 3 clusters after the fact. You can then use the compareClusters() function to run a permutation test comparison of any two specific clusters to see if any of the other clusters should be merged into your 3 set clusters. Grab the final adjusted alpha threshold from seurat_object@misc$CHOIR$parameters$pruneTree_parameters$adjusted_alpha if you want to keep the same approx. significance level used in the pruneTree run after correction for multiple comparisons.

alifarhat40 commented 4 months ago

Hi @catpetersen ,

I finally managed to get CHOIR working on all scRNAseq clustering on my precomputed clusters. Here is what finally made them work:

set n_iterations to 200 (default is 100)
set n_trees to 100 (default is 50)
use_variance = FALSE (default is TRUE) [sometimes needed not all the time, not sure why or what this parameter does]

Yes it takes much longer, but it correctly splits up the clusters as expected. Another observation I have is that when two clusters have equal number of data points, then it can split them using the default settings. However, when there is a class imbalance (e.g. cluster A has 600 points and another cluster B has 300 points that are neighbors and must be separate) then CHOIR fails to keep them separate under the default settings. Thats when the new settings work. I tried setting downsampling_rate to zero but that gave an error. Apparently the number needs to be between 0 and 1.

After benchmarking on a lot of precomputed clusters on scRNAseq and getting good results, I am moving forward with using CHOIR on pregenerated clusters of general data (not scRNAseq related). However, since CHOIR is made to work with Seurat scRNAseq, I am having trouble making it work like you suggested above.

I am starting off with a very simple datset, the blobs dataset. My new clustering algorithm finds 8 clusters, ad-hoc merging makes them 3. But I want to use CHOIR to help merge instead.

Everything is the same as scRNAseq, I have precomputed labels for each of the 1000 datapoints. blobs_data.txt ph_cluster_ids_blobs.txt

# Load package
suppressPackageStartupMessages({
  library(Seurat)
  library(CHOIR)
  library(data.table) # or library(readr)
})

data <- fread("blobs_data.txt")
data_matrix <- as.matrix(data)
# Name the columns based on the number of features (columns)
num_genes <- ncol(data)
colnames(data) <- paste0("feat", 1:num_genes)
# Transpose the data to meet Seurat's expectations (genes as rows, samples as columns)
data_t <- t(data)
# Create a Seurat object
seurat_object <- CreateSeuratObject(counts = data_t, project = "clustering")
# Print the Seurat object to verify
print(seurat_object)

Then I do dimensionality reduction using a different PCA package since the seurat one is tailored for scRNAseq.

# Perform PCA
pca_result <- prcomp(data_matrix, center = TRUE, scale. = TRUE)
# Extract PCA scores (cell embeddings)
pca_scores <- pca_result$x
# Make sure the row names are set correctly to the cell identifiers
rownames(pca_scores) <- colnames(seurat_object)
# Create a new DimReduc object
new_pca <- CreateDimReducObject(
  embeddings = pca_scores,
  key = "PC_",
  assay = DefaultAssay(seurat_object),
)
# Add the DimReduc object to the Seurat object
seurat_object[["pca"]] <- new_pca

Find neighbors for the nearest neighbor and shared nearest neighbors graphs requied for pruneTree()

seurat_object <- FindNeighbors(seurat_object, dims = 1:2, k.param = 5)
# Extract the cell embeddings of your dimensionality reduction
dim_reduction <- seurat_object@reductions$pca@cell.embeddings

Simple importing clusters as discussed a few weeks ago.

file_path <- "ph_cluster_ids_blobs.txt"
# Read the file
cluster_labels <- scan(file_path, what = integer(), quiet = TRUE)
names(cluster_labels) <- colnames(seurat_object)

Since there is no highly variable genes, I just resort to using the counts data. I only have 2 features, so I just use them instead of the genes.

# Create cluster_tree
cluster_tree <- inferTree(cluster_labels = cluster_labels,  reduction = dim_reduction)
# Extract variable features and input matrix for random forest classifiers
var_features <- rownames(seurat_object)
input_matrix <- CHOIR:::.getMatrix(seurat_object, 
                                   use_assay = "RNA", 
                                   use_slot = "counts", 
                                   use_features = var_features,
                                   verbose = TRUE)

# Extract adjacency matrices
nn_matrix <- seurat_object@graphs$RNA_nn
snn_matrix <- seurat_object@graphs$RNA_snn

Everything looks the same as that from scRNAseq, but pruneTree() sees bothered.

# Run pruneTree
seurat_object <- pruneTree(seurat_object, 
                           cluster_tree = cluster_tree,
                           input_matrix = input_matrix,
                           nn_matrix = nn_matrix,
                           snn_matrix = snn_matrix, # New requirement not yet reflected in online documentation
                           reduction = dim_reduction,
                           n_cores = 1, # be sure to change this
                           distance_awareness = 2,
                           alpha = 0.05,
                           p_adjust = 'fdr')

The error appears to be from the ranger function for the decision trees.

Input data:
 - Object type: Seurat
 - # of cells: 1000
 - # of modalities: 1
 - # of subtrees: 1
 - # of levels: 8
 - # of starting clusters: 8

....

2024-03-08 12:29:34 PM : (Step 2/2) Iterating through clustering tree..
2024-03-08 12:29:34 PM : 11% (1/8 levels) in 0 min. 7 clusters remaining.                                           
Error in ranger::ranger(x = input_matrix[c(cluster1_cells, cluster2_cells),  : 
  Error: No covariates found.

catpetersen commented 4 months ago

After benchmarking on a lot of precomputed clusters on scRNAseq and getting good results

Glad you're having good results using CHOIR!

I tried setting downsampling_rate to zero but that gave an error. Apparently the number needs to be between 0 and 1.

Set downsampling_rate to 1 to disable downsampling. See details in this section of the tutorial: https://www.choirclustering.com/articles/CHOIR.html#downsampling

I only have 2 features, so I just use them instead of the genes.

If you only have 2 features, I'm really not sure CHOIR is the best tool for this dataset. Random forest classifiers are typically used with a larger feature space.

The error appears to be from the ranger function for the decision trees.

Does your input matrix have row & column names for the features & cells? If not, that might be causing the ranger error (see https://github.com/imbs-hl/ranger/issues/597)

alifarhat40 commented 4 months ago

I think the problem is that this method needs many features (columns) in general data that is not RNAseq. I tried a dataset with 18 features, it runs to 33% and it gives an error. So I guess the problem is that CHOIR needs really large datasets making it ideal for scRNA seq.