countsplit with Seurat to find optimal number of clusters

Hi @anna-neufeld
thanks for this very interesting and very much needed package. I think the approach you take is brilliant and your tutorials are extensive. However, I really miss a tutorial that shows how to use countsplit with real single cell data to find the optimal number of clusters. However, for the end users this is much more important than simulated data.

So I tried to apply your tutorial https://anna-neufeld.github.io/countsplit.tutorials/articles/MSE_tutorial.html to Seurat Data.

First of all, I treid to optimize your cluster.sse function because this has to be run multiple times on pretty large SC datasets.

Cluster SSE function

``` # original function cluster.sse <- function(trainDat, testDat, clusters.est, eps.train) { totSS <- 0 for (lab in unique(clusters.est)) { if (sum(clusters.est==lab) > 1) { clustdat.test <- testDat[clusters.est==lab,] clustdat.train <- trainDat[clusters.est==lab,] colmeansTrain <- apply(clustdat.train, 2, mean) pred.means <- 1/eps.train*colmeansTrain ss <- apply(1/(1-eps.train)*clustdat.test, 1, function(u) sum((log(u+1)-log(pred.means+1))^2)) totSS <- totSS+sum(ss) } } return(totSS) } #refactored version: vectorize cluster_sse1 <- function(trainDat, testDat, clusters.est, eps.train) { stopifnot(nrow(trainDat) == length(clusters.est)) totSS <- 0 unique_clusters <- unique(clusters.est) inverse_eps_train <- 1 / (1 - eps.train) for (lab in unique_clusters) { cluster_indices <- which(clusters.est == lab) num_points <- length(cluster_indices) if (num_points > 1) { clustdat.test <- testDat[cluster_indices, ] clustdat.train <- trainDat[cluster_indices, ] colmeansTrain <- colMeans(clustdat.train) pred.means <- colmeansTrain * inverse_eps_train clustdat.test.1 <- 1 / (1 - eps.train) * clustdat.test ss <- rowSums((log1p(t(clustdat.test.1)) - log1p(pred.means))^2) totSS <- totSS + sum(ss) } } return(totSS) } #best version #use the transpose of the trainDat and testDat #sum is the same as sum(rowSums(matrix)) cluster_sse2 <- function(trainDat, testDat, clusters.est, eps.train) { stopifnot(ncol(trainDat) == length(clusters.est)) totSS <- 0 unique_clusters <- unique(clusters.est) inverse_eps_train <- 1 / (1 - eps.train) for (lab in unique_clusters) { cluster_indices <- which(clusters.est == lab) num_points <- length(cluster_indices) if (num_points > 1) { clustdat.test <- testDat[cluster_indices] clustdat.train <- trainDat[,cluster_indices] rowMeansTrain <- rowMeans(clustdat.train) pred.means <- rowMeansTrain * inverse_eps_train clustdat.test.1 <- 1 / (1 - eps.train) * clustdat.test ss <- sum((log1p(clustdat.test.1) - log1p(pred.means))^2) totSS <- totSS + ss } } return(totSS) } ```

This led to a nearly 5 times faster code in the pbmc.counts dataset and the same results.

  expression      min   median `itr/sec` mem_alloc `gc/sec` n_itr  n_gc
  <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl> <int> <dbl>
1 orig          2.41s    2.41s     0.414    2.79GB     1.66     1     4
2 v1          675.3ms  675.3ms     1.48     1.13GB     2.96     1     2
3 v2         549.61ms 549.61ms     1.82     1.06GB     3.64     1     2

Of note, in the v2 I changed rows and columns because it further improved the performance and is also better compatible with the pbmc.counts matrix.

Here's the code that I used for the poisson version of countsplit with the pbmc.counts dataset using Seurat.

Code Poisson

``` #load libraries library(Seurat) library(countsplit) library(Matrix) library(tidyverse) library(patchwork) data(pbmc.counts, package="countsplit.tutorials") #replace underscores with dashes in rownames rownames(pbmc.counts) <- gsub(x = rownames(pbmc.counts), pattern = "_", replacement = "-") #split data into training and test sets with Poisson set.seed(1) split <- countsplit(pbmc.counts) Xtrain <- split[[1]] Xtest <- split[[2]] ## Set up the training set object and subset pbmcTrain <- CreateSeuratObject(counts = Xtrain, min.cells = 3, min.features = 200) pbmcTrain[["percent.mt"]] <- PercentageFeatureSet(pbmcTrain, pattern = "^MT-") pbmcTrain <- subset(pbmcTrain, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5) # same genes and cells in the train set as in the test set rows <- rownames(pbmcTrain) cols <- colnames(pbmcTrain) XtestSubset <- Xtest[rows,cols] XtrainSubset <- Xtrain[rows,cols] #run Seurat pipeline pbmcTrain <- pbmcTrain |> NormalizeData() |> FindVariableFeatures() |> ScaleData() |> RunPCA() |> FindNeighbors(dims = 1:10) |> RunUMAP(dims = 1:10) # define a range of resolutions resRange <- seq(0.2, 3, by = 0.2) resNames <- paste0("RNA_snn_res.", resRange) for (res in resRange) { pbmcTrain <- FindClusters(pbmcTrain, resolution = res) } # calculcate within-cluster sse for each resolution results <- list() for (k in resNames) { results[k] <- cluster_sse2( trainDat = XtrainSubset, testDat = XtestSubset, clusters.est = pbmcTrain@meta.data[[k]], eps.train = 0.5) } #prepare results for plotting results_df <- data.frame(cluster = names(results), sse = as.numeric(unlist(results))) |> dplyr::mutate(sse_norm = (sse - min(sse)) / (max(sse) - min(sse))) #plot results results_df |> ggplot(aes(x = cluster, y = sse_norm, group = 1)) + geom_point() + geom_line() + theme_bw() + ylab("Within-cluster SSE") + xlab("Number of cluster") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) ```

Maybe you can check if the code is correct. I had some troubles with determining what is column and what is row. In your simulated dataset the clusters.est have the same length as nrow(trainDat) but in the Seurat Dataset rows are genes and columns are cells, so the cluster annotation have the same length as ncol(trainDat) and not nrow(trainDat). So to use your original function I had to use t(Xtrain) and t(Xtest) as input or in my refactored version2 Xtrain, Xtest. Correct?

The within-cluster SSE suggest as Leiden resolution of 1.2 or 1.4 i would say based on the results.

cluster_sse

The Seurat team choose a resolution of 0.5. If you look at the UMAP plots (I used the test dataset without further filtering to produce that plot, is that statistically correct?), I would also say that 1.2 and 1.4 is probably overclustered. But that fit's nicely to your simulated results.

umap

I then wanted to use negative binominal and used sctransform as you suggested. I stole some code from here: https://github.com/anna-neufeld/nbcs_paper_simulations/blob/main/Fig8/reproduce_full_analysis/code/countsplit.R

Code Negative binominal

## countsplit with negative binomial #first get overdispersion estimates sctransform_fit <- sctransform::vst(pbmc.counts, residual_type = "none", min_cells = 10) overdisp_estimates <- sctransform_fit$model_pars_fit[,"theta"] #subset to genes with overdispersion estimates pbmcCountsSubset <- pbmc.counts[names(overdisp_estimates),] split2 <- countsplit(t(pbmcCountsSubset), overdisp = overdisp_estimates) XtrainNB <- t(split2[[1]]) XtestNB <- t(split2[[2]]) ## Set up the training set object and subset pbmcTrainNB <- CreateSeuratObject(counts = XtrainNB, min.cells = 3, min.features = 200) pbmcTrainNB[["percent.mt"]] <- PercentageFeatureSet(pbmcTrainNB, pattern = "^MT-") pbmcTrainNB <- subset(pbmcTrainNB, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5) # same genes and cells in the train set as in the test set rowsNB <- rownames(pbmcTrainNB) colsNB <- colnames(pbmcTrainNB) XtestSubsetNB <- XtestNB[rowsNB,colsNB] XtrainSubsetNB <- XtrainNB[rowsNB,colsNB] #run Seurat pipeline pbmcTrainNB <- pbmcTrainNB |> NormalizeData() |> FindVariableFeatures() |> ScaleData() |> RunPCA() |> FindNeighbors(dims = 1:10) |> RunUMAP(dims = 1:10) # define a range of resolutions resRange <- seq(0.2, 3, by = 0.2) resNames <- paste0("RNA_snn_res.", resRange) for (res in resRange) { pbmcTrainNB <- FindClusters(pbmcTrainNB, resolution = res) } # calculcate within-cluster sse for each resolution resultsNB <- list() for (k in resNames) { resultsNB[k] <- cluster_sse2( trainDat = XtrainSubsetNB, testDat = XtestSubsetNB, clusters.est = pbmcTrainNB@meta.data[[k]], eps.train = 0.5) } #prepare results for plotting results_df_nb <- data.frame(cluster = names(resultsNB), sse = as.numeric(unlist(resultsNB))) |> dplyr::mutate(sse_norm = (sse - min(sse)) / (max(sse) - min(sse))) #plot results results_df_nb |> ggplot(aes(x = cluster, y = sse_norm, group = 1)) + geom_point() + geom_line() + theme_bw() + ylab("Within-cluster SSE") + xlab("Number of cluster") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) ggsave(file.path("countsplit", "cluster_sse_nb.png"), width = 10, height = 5)

Maybe you can check if the code is correct, especially concerning the sctransform part. Of note, I struggled again with columns and rows. As you can see in the code i had to 1) subset the matrix based on genes with known overdispersion and 2) tranpose the counts Matrix so that countsplit accepts it.

The resulting plot was similar to your simulation data in the way that it more clearly identified an optimal resolution. And the resolution was also lower (1.0) which is more plausible I think.

cluster_sse_nb

These are the results based on the XestSubsetNB data. umap_nb

-> I actually think that both are biologically plausible so it's exciting that the algorithm even suggests more clusters than the Seurat team chose.

Finally, I wanted to use the multifold negative binominal approach. Here's the code

Code Multifold Negative binominal

``` #negative binominal with multifold sctransform_fit <- sctransform::vst(pbmc.counts, residual_type = "none", min_cells = 10) overdisp_estimates <- sctransform_fit$model_pars_fit[,"theta"] #subset to genes with overdispersion estimates pbmcCountsSubset <- pbmc.counts[names(overdisp_estimates),] split3 <- countsplit(t(pbmcCountsSubset), folds = 10, overdisp = overdisp_estimates) split3_t <- lapply(split3, t) calcSS <- function(test_data, matrix, resRange, folds) { train_data <- matrix - test_data #create seurat object seu_obj_train <- CreateSeuratObject(counts = train_data, min.cells = 3, min.features = 200) seu_obj_train[["percent.mt"]] <- PercentageFeatureSet(seu_obj_train, pattern = "^MT-") seu_obj_train <- subset(seu_obj_train, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5) # same genes and cells in the train set as in the test set rows <- rownames(seu_obj_train) cols <- colnames(seu_obj_train) train_data <- train_data[rows,cols] test_data <- test_data[rows,cols] # run seurat pipeline seu_obj_train <- seu_obj_train |> NormalizeData() |> FindVariableFeatures() |> ScaleData() |> RunPCA() |> FindNeighbors(dims = 1:10) |> RunUMAP(dims = 1:10) # assign names for each resolution resNames <- paste0("RNA_snn_res.", resRange) for (res in resRange) { seu_obj_train <- FindClusters(seu_obj_train, resolution = res) } # calculcate within-cluster sse for each resolution results <- list() for (k in resNames) { results[k] <- cluster_sse2( trainDat = train_data, testDat = test_data, clusters.est = seu_obj_train@meta.data[[k]], eps.train = 1 - 1/folds) } return(data.frame(results)) } results_multifold <- lapply(1:10, function(x) calcSS(test_data = split3_t[[x]], matrix = pbmcCountsSubset, resRange = seq(0.2, 3, by = 0.2), folds = 10)) results_df <- results_multifold |> bind_rows() |> mutate(fold = 1:10) |> mutate(fold = sprintf("fold_%02d", fold)) |> tidyr::pivot_longer(cols = -fold, names_to = "cluster", values_to = "sse") |> group_by(fold) |> mutate(sse_norm = (sse - min(sse)) / (max(sse) - min(sse))) |> ungroup() |> group_by(cluster) |> mutate(mean_sse = mean(sse_norm)) |> ungroup() results_df |> ggplot(aes(x = cluster, y = sse_norm, group = fold, color = fold)) + geom_point() + geom_line() + geom_line(aes(y = mean_sse), color = "black", size = 1) + theme_bw() + ylab("Within-cluster SSE") + xlab("Number of cluster") + theme(axis.text.x = element_text(angle = 90, hjust = 1)) ggsave(file.path("countsplit", "cluster_sse_multifold.png"), width = 10, height = 5) ```

The fat black line is the mean SSE of all 10 folds. This also suggested a resolution of 1.0 (probably 0.8 and 1.2 would also be fine) cluster_sse_multifold

@anna-neufeld I am eager to hear you feedback. Especially, if all implementations are correct, like the sctransform, the refactoring of the ss function and the transposition of the matrices.

Thanks!

Hi @mihem ,

Thank you so much for this extensive work! It is so exciting to see NBCS in action !

Here are a few comments / answers to your questions.

(1) Thank you so much for writing the faster version of the SSE function! The one included in the very basic tutorial was only intended to be used on the small toy dataset, so it's great to have this faster version on hand. I had been thinking about using an alternate loss function for the real data tutorial, but it looks like SSE worked pretty well for you !

(2) My tutorials have all used cell-by-gene matrices. Having the observations in the rows and the features in the columns is nice because it is compatible with existing R functions such as k-means (used in my tutorials). However, you are 100% correct that this is the opposite of the way that Seurat does it (they do gene-by-cell matrices). So, the fact that you took transposes was correct.

(3) I agree that the resolution parameter suggested by Poisson count splitting is too big. This is probably due to the fact that the data are not Poisson, they are negative binomial. Thus, in the Poisson version, there is dependence between the training and test sets, leading to overfitting.

(4) For using sctransform-- I am so glad that the paper GitHub repository was useful in finding this code, and so sorry that I haven't put this code in a tutorial yet! -- The only step I am not sure about is subsetting the data to only genes with overdispersion estimates. I ran into the issue that sctransform::vst() does not always return an overdispersion estimate for every single gene (e.g. it does not return an estimate for genes that were expressed in <5 cells, by default). I personally decided to retain these genes in the analysis and just set their overdisperson parameter = infinity (i.e. if the estimation fails, assume the genes are Poisson). -- It might not matter here since you are subsetting the genes/cells later anyway!

(5) I think it's really cool that NBCS ends up showing a sharper minimum at a certain (reasonable!) resolution parameter. This is probably because it did a better job making independent train/test sets than Poisson count splitting. The plots are beautiful, and it looks to me like you did the multifold version correctly! It is really nice to see that the minimum around resolution=1 was pretty consistent across folds for multifold splitting.

(5) I have been working on a Seurat negative binomial tutorial for the website using this same dataset and similar ideas, but had not finished it yet. I really like your analysis, as I think that this shows a clear comparison between NBCS and Poisson count splitting on real data, while also showing how to use multifold splitting, etc. If I use this analysis as the basis for a new tutorial, can I credit you on the website and in the package? If so, shoot me an email at aneufeld@fredhutch.org so that you can let me know how best to credit you!

Hi @anna-neufeld You are very welcome.

Thank for your swift response and clear explainations.

1) You are welcome. I actually spent some hours trying to conver this to a Rcpp function to improve speed, but in the end the function was slower than the vectorized R version (but that could be also due to my limited C++ skills, mostly done ChatGPT). But at least for this size of dataset it's sufficient. However I realized that I made a mistake when refactoring your version: In your function you have:

 pred.means <- 1/eps.train*colmeansTrain

but in my refactored version it's

 inverse_eps_train <- 1 / (1 - eps.train)
 pred.means <- colmeansTrain * inverse_eps_train

is this a bug in your code and I accidently corrected it, or was your inital code correct and I made a careless mistake?

When using your version (bust just using folds, which is easier i think)

cluster sse function

``` cluster_sse3 <- function(train_data, test_data, clusters_est, folds) { stopifnot(ncol(train_data) == length(clusters_est)) totSS <- 0 unique_clusters <- unique(clusters_est) for (lab in unique_clusters) { cluster_indices <- which(clusters_est == lab) num_points <- length(cluster_indices) if (num_points > 1) { clust_data_test <- test_data[,cluster_indices] clust_data_train <- train_data[,cluster_indices] rowMeans_train <- rowMeans(clust_data_train) pred_means <- rowMeans_train * 1/(1 - 1/folds) ss <- sum((log1p(folds * clust_data_test) - log1p(pred_means))^2) totSS <- totSS + ss } } return(totSS) } ```

When running multifold negative binominal with that on the pbmc data, the result is not tooo different, but the minimum at resolution at 1 is not that clear anymore cluster_sse_multifold2

regaring other metrics: I agree. SSE worked well here. But of course if you think other metrics like silhouette score (the integrated version has C++ parts https://svn.r-project.org/R-packages/trunk/cluster/R/silhouette.R) are more appropriate, that would be also interesting.

Cool perfect
Agree. Thanks for the explaination
No problem. Took me some time to find it, and just wanted to make sure to attribute you accordingly. Yes true with the default setting which I used here I also only got 11139 genes of the initial 32738 genes with your cutoff that you use in your github min_cells = 10. I wondered how it would look like with the default min_cells = 5 resulting in 12 572 genes (using my old, so potentially wrong cluster_sse2 function):

cluster_sse_multifold_sctransform_more_v1

And then I also tried sctransform v2 https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02584-9 by just setting vst.flavor = 2 (again with my initial cluster_sse2 version)

cluster_sse_multifold_sctransform_v2

Here is the initial version which i showed in my previous post (so sctransform with min_cells = 10, vst.flavor 1), however, now with the same splitting as in the two plots before (didn't think about set.seed before).

cluster_sse_multifold_same_splits

What do you think are the most appropriate settings, especially regarding the regularization of v2? Looks pretty similar to me, but maybe it's better to reduce min_cells to at least 5.

I also wondered what you think the most approriate number of folds is here.

Yeah sure, I'll send you an e-mail. Maybe it can also help your manuscript to show how your countsplit approach works on this very relevant problem.

There are two further general questions/concerns I have with that approach:

A. When you are happy with your results and decide to proceed with a resolution of 1: Which data should be taken for the further downstream analysis, e.g. cluster abundance between conditions, differential expression between conditions, cell communcation analysis.. Should that be the test split (probably the best way mathematically)? But one could be worried that this influences the downstream analysis (although you show nicely in your tutorial that this is not the case for e.g. variable genes https://anna-neufeld.github.io/countsplit.tutorials/articles/seurat_tutorial.html). Besides, using the test set for further analysis would mean that you have to run Normalization, doublet detection, PCA, UMAP, Integration etc. again on your dataset, which is okay for this small dataset, but quite expensive computationally with datasets containing 400k cells and 40 samples (which usually can be run on a laptop in a few hours in Seurat. So it would be more convenient to just continue with the train set, but is that mathematically okay?

B. It's computationally even more expensive to run 10 folds with a large 400k dataset with 40 samples. Additionally, I wondered whether it's okay to just use the same filter settings for a folds (nFeature_RNA, percent.mt)? In fact, I usually set filtering parameters (RNA, MT) for each sample manually, which can take some time for large datasets with 40 samples. But this would be impossible if this had to be done adjusted manually in each of the folds.

Thank you so much.

Hi! Thank you so much for the continued good points.

colmeansTrain stores an estimate of eps.train*E[X]. Therefore, to convert this to an estimate of E[X], we should divide by eps.train. So I don't believe that this was a mistake in my version. However, when you are using eps.train = 0.5 (which you were using in your two-fold example) either version of the code would be correct.

For multifold thinning, eps.train should be equal to (folds-1)/folds, which is equivalent to 1-1/folds. I couldn't quite tell if what you had in your code was equivalent to this, but I don't think that it is. So I think that there may be a mistake with this section of your SSE code for multifold thinning. Let me know if this makes sense.

Regarding "min_cells"-- I don't have a ton of expertise with this! From a purely count splitting perspective, I don't have a strong opinion on how you should set min_cells. My main advice would be: "don't exclude any genes just because sctransform did. Exclude genes if you think that it will help your overall analysis.

So, if you think that your overall clustering analysis will be less noisy if you only include genes that were expressed in at least 10 cells, then you can go ahead and filter to only include these genes, and it would be fine to do this during the sctransform step. It might be useful to consult documentation from Seurat or sctransform for recommendations on filtering in this way.

Regarding the number of folds-- I think that 10 folds seems to be working well for you! The danger with having lots of folds is that you aren't leaving as much information in the test set for SSE calculation. But it doesn't seem like 10 is way too many here. If you were to find that, with 10 holds, the held out fold was too sparse to really have any signal in it, then I would recommend reducing the number of folds. But really it's up to you! Computation time is also a consideration!

A. This is a great question! I I will break my answer into two pieces. It really depends on if there is any "double dipping" associated with your downstream analysis. Also, I haven't thought through all of these examples super carefully, so I may have a mistake in my logic somewhere. It is worth thinking more about!

A1) Suppose that you want to figure out how many clusters are in your data, and then you want to do differential expression testing between the clusters (i.e. the cell types). In this case, it is mathematically best if none of the data used to form the final clusters gets used for the testing.

So the most correct thing to do would be to use 3 folds of data. You could use the first fold to fit clusters for a variety of resolution parameters, and then use the second fold to choose the best resolution parameter. You could then re-fit the clusters using these two folds combined (although this is computationally intensive so it would also be ok to just use the training set ones), and then use the final, completely held out fold of data for differential expression testing.

A2) Suppose you want to see if there is differential abundance of some cluster across a known condition such as disease status. As long as disease status itself was not used to define the clusters, then you don't actually have a double dipping problem. So I think that, if computation is not a limiting issue, the following process would make the best use of the data.

Use the train/test split to pick the best resolution parameter.
Using that resolution parameter, re-fit the clusters using the entire dataset.
Test for association between the clusters and the condition. The train test split ensured that we didn't overfit when picking a good resolution parameter, but I don't think that we are at risk of overfitting in the final step since "condition" was not used in the clustering steps.
By the same "no double dipping" logic, it should be fine to use either just the training set or just the test set in the final step if you prefer. But this does not make the best use of all of your data and may lower your power.

I hope that this helps you think through this.

B) Yes, you can use the same filter settings across each fold! In some ways I think that this makes more sense than using differently-sized datasets in every fold due to different filtering.

Hi @anna-neufeld,

thanks for your clear explainations.

okay then my first take had a mistake (although it does not matter as you say for a fold of 2). I tested it and with the last correction this new function has the same output as your original function, but is much faster. To avoid any confusion here's the function again #best vestion with correct fold

cluster_sse3 <- function(train_data, test_data, clusters_est, folds) {
  stopifnot(ncol(train_data) == length(clusters_est))
  totSS <- 0
  unique_clusters <- unique(clusters_est)

  for (lab in unique_clusters) {
    cluster_indices <- which(clusters_est == lab)
    num_points <- length(cluster_indices)

    if (num_points > 1) {
      clust_data_test <- test_data[,cluster_indices]
      clust_data_train <- train_data[,cluster_indices]
      rowMeans_train <- rowMeans(clust_data_train)
      pred_means <- rowMeans_train * 1/(1 - 1/folds)
      ss <- sum((log1p(folds * clust_data_test) - log1p(pred_means))^2)
      totSS <- totSS + ss
    }
  }
  return(totSS)
}

Thanks I agree. The default value of sctransform::vst min_cells = 5 makes sense I think.

A. Thanks, this is super helpful and very important for the user I think, so maybe include that in the tutorial :), especially: "Use train/test to pick the best resolution parameter, use that resolution parameter to re-fit the clusters using the entire-dataset!"

I actually tried to apply countsplit to my 170 000 cell dataset (which is still growing). Sctransform took pretty long, but even more of a bottleck was countsplit, more specifically iterating over dirMulSample: https://github.com/anna-neufeld/countsplit/blob/85ac0c3789afe1c7ca37d38a7f16dfaeefade507/R/countsplit.R#L70C8-L70C8

So for my matrix with 30044 rows and 20251 columns it took 368 s on my machine. Since this is actually just my first sample out of 13, this would probably take 80min for all samples.

I spend some hours to convert this line (so the dirMulSample function and the mapply iteration into Rcpp code. The dirMulSample only resulted in a ~ 2x performance increase. But if you include the mapply iteration, this resulted in a ~ 10x performance increase, so 38s on my machine (so probably around 8min for the entire dataset).

I created a pull request https://github.com/anna-neufeld/countsplit/pull/9. Feel free to verify (I am not that experienced with Cpp, but the outcome was similar; getting the same output is not trivial because of the random seed) and edit (I removed epsilon which i thought made the function more complicated and was determined by folds).

I will continue to work on my large real dataset and let you know how the result will look like.

What do you think about downsampling? So e.g. randomly downsampling 170 000 cells to ~18 000 cells. Would this be sufficient to run countsplit to determine the best resolution (to reduce computational costs)? scSHC requires a certain number of cells. So in this specific case did not work with ~2000 cells but worked okay with ~18 000. https://github.com/igrabski/sc-SHC/issues/5

Another idea would be to use the sketched assay which is integrated in Seurat, and which is supposed to select representative cells from multiple datasets. https://satijalab.org/seurat/articles/parsebio_sketch_integration

Best, Mischko

I ran countsplit on my 170k dataset. even with my optimization it took a few hours, but I think that's okay for such a huge dataset (althoughl if downsampling was okay, that would be nice, see previous question). It was a little more complex because of multiple samples that had to be aligned, but i just used a train and test matrix for each sample and then ran my analysis (including doublet removal, integration) based on the train matrix.

However, the result suggested to pick a very large cluster resolution.

cluster_sse

So a resolution of 1.5 would look like this

While I think a resolution of approximately 0.4 would be appropriate:

Of note, scSHC suggests less clusters (although that could be partly do downsampling which was necessary because scSHC requires a lot of memory)

and Cytocipher is even more conservative in general although we are still actively working on that:

Any ideas @anna-neufeld ?

Hi!!

Thanks for the detailed points!!

(1) If the goal of using countsplit is to pick the optimal resolution parameter for downstream clustering, I don't see any issue with using only a subset of the cells. This seems like a good idea!

(2) It is interesting that countsplit is suggesting such a high resolution parameter, when you really feel like it should be smaller. Two questions for you. (a) are you using Poisson or negative binomial count splitting? The fact that your chosen resolution parameter is too high might suggest dependence between train/test, which would definitely occur if you are using Poisson. (b) could you consider a metric / loss function other than logged sum of squared errors to evaluate your functions? (c) If you do stick with the logged som of squared errors, it is possible that you are getting messed up by not incorporating size factors (or other types of normalization corrections) into the loss function. The example tutorial was on a toy dataset with no size factors, but I think on real data these should be incorporated into the loss function. As I work on the new Seurat tutorial, I will think carefully about this point and add a correct example, but in the meantime if you have questions feel free to ask over email!!

(3) Yay about the RCPP speed up! I will look over the pull request and incorporate the changes! This is good timing because I was about to submit this package to CRAN but will hold off until it is faster!

Hi Anna!

Thank you.

Okay great this would speed up the analysis of large datasets a lot. What do you think would be a reasonable downsampling from 170 000 cells? 50 000, 20 000, 5 000? Also would you think random downsampling is okay, or rather cluser-based downsampling as it's often done in Seurat (so each cluster with 1000 cells, which could then mean that you have to run the pipeline before and decide on some clusters), or using this leverage score?
a. negative binimoial count splitting and overdispersion estimates from the sctransform function. Just as I did in the end with the PBMC dataset (where it looked pretty good) b Sure I can try that. What metric would you suggest. Do you have a function which is not too expensive computationally? c Hm I am not sure I understand this fully. So i treated this dataset just like the PBMC dataset (see code above), with the only difference that I normalized per sample and also ran doublet detection and integration. But the sse function then uses the raw (so not noramlized) counts data, right? So XtrainSubset and XtestSubset. Only the cluster information is based on the normalized data I thought? Maybe you can also try to run the data (I can give you access) or you try it with the pancreas dataset which is also more complicated than the PBMC dataset.

#split data into training and test sets with Poisson
set.seed(1)
split <- countsplit(pbmc.counts)

Xtrain <- split[[1]]
Xtest <- split[[2]]

## Set up the training set object and subset
pbmcTrain <- CreateSeuratObject(counts = Xtrain, min.cells = 3, min.features = 200)
pbmcTrain[["percent.mt"]] <- PercentageFeatureSet(pbmcTrain, pattern = "^MT-")
pbmcTrain <- subset(pbmcTrain, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

# same genes and cells in the train set as in the test set
rows <- rownames(pbmcTrain)
cols <- colnames(pbmcTrain)
XtestSubset <- Xtest[rows,cols]
XtrainSubset <- Xtrain[rows,cols]

#run Seurat pipeline
pbmcTrain <-
  pbmcTrain |>
  NormalizeData() |>
  FindVariableFeatures() |>
  ScaleData() |>
  RunPCA() |>
  FindNeighbors(dims = 1:10) |>
  RunUMAP(dims = 1:10)

# define a range of resolutions
resRange <- seq(0.2, 3, by = 0.2)
resNames <- paste0("RNA_snn_res.", resRange)

for (res in resRange) {
  pbmcTrain <- FindClusters(pbmcTrain, resolution = res)
}

# calculcate within-cluster sse for each resolution
results <- list()

for (k in resNames) {
  results[k] <- cluster_sse2(
    trainDat = XtrainSubset,
    testDat = XtestSubset,
    clusters.est = pbmcTrain@meta.data[[k]],
    eps.train = 0.5)
}

Cool. Happy to help. Will change the epilson parameter tomorrow!

Hi!

Sorry for the late response!

(1) I think that random downsampling is okay, but keeping as many cells as your computational cost will allow is a good idea!

(2) I am still working on a new tutorial for the website that will be about "evaluating clusters"- sorry for the delay on this! In the new tutorial, I plan to recommend two different classes of loss functions. You could play around with either type.

Class 1 = "reconstruction-based". MSE is an example of this. The idea is that, when we estimate "good" clusters, we should be able to explain a lot of the variation in the dataset using only the cluster means.

You seem to have run your clustering algorithm on Seurat-normalized data, which I believe is something like log(Xtrain/sizefactors + 0.1), and then applied your loss function to log(Xtest)+1.
I recommend that you try to apply an MSE-based loss function directly to the Seurat-normalized data. This might give more reasonable results.
I am working on this for the PBMC data, but had to resolve some bugs first so was not able to tell yet if this gave substantially different answers than your analysis.

Class 2 = "stability-based". The idea is that "good clusters" should be stable and should be reproducible on an independent dataset.

For a variety of resolution parameters, you could try clustering both the training set and the test set. Then, record a similarity metric (such as adjusted Rand Index) between these two clusterings.
With a "good" resolution parameter, this adjusted Rand Index should be very high.
The adjusted Rand Index may still be high if we have under-clustered. This would be the case if the true clusters are nested within one another, rather than totally separated.
A low adjusted Rand Index is evidence of over clustering.

As I type out this answer, I realize that a third loss function might make sense. When we cluster with k-means, our clustering algorithm is optimizing for within-cluster-SSE, and so this is a reasonable loss function to use. But Louvain clustering optimizes for a "modularity score". I may try to implement something where I plot the test-set modularity score as a function of the resolution parameter.

This is a work-in-progress as I need to read more about this metric and how it is computed.

Hi Anna,

thank you for your response and no problem regarding the delay. I was just surprised that you closed this issue. So thank you for repopening.

1) Understood, maybe add this in the tutorial because it's practically super relevant 2) Yes. I normalized via Seurat::NormalizeData to find the clusters. I then used the raw couns for the sse metric, as shown above. Sure i can try that on normalized data instead of the raw counts. Sorry do you mean the cluster.sse metric of your tutorial or some other metric? Yes this does sound familiar, although I've never tried those: e.g. https://github.com/crazyhottommy/scclusteval https://github.com/PYangLab/scCCESS https://www.nature.com/articles/s41598-020-66848-3 So the question is how to implement which of these metrics?

Yeah sure, that does sound great, although at least stability clusters have been used before on Louvain clustered scRNA-seq data.

For the SSE metric, I recommend something like this. The exact details will be in the new tutorial. After

## Set up a training set Seurat object, do your preprocessing
pbmcTrain <- CreateSeuratObject(counts = Xtrain, min.cells = 5, min.features = 200)
pbmcTrain[["percent.mt"]] <- PercentageFeatureSet(pbmcTrain, pattern = "^MT-")
pbmcTrain <- subset(pbmcTrain, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)
pbmcTrain <-
    pbmcTrain |>
    NormalizeData(verbose=F) |>
    FindVariableFeatures(verbose=F) |>
    ScaleData(verbose=F) |>
    RunPCA(verbose=F) |>
    FindNeighbors(dims = 1:10, verbose=F) |>
    RunUMAP(dims = 1:10, verbose=F)

scale.dat.train <- GetAssayData(pbmcTrain, layer="scale.data")

### Set up a Seurat test set object. Use the rows/columns selected for the training matrix, but scale/normalize this data.
 rows <- rownames(scale.dat.train)
 cols <- colnames(scale.dat.train )
 XtestSubset <- Xtest[rows,cols]
  pbmcTest <- CreateSeuratObject(counts=XtestSubset, min.cells=0, min.features=0)
 pbmcTest <- pbmcTest |>
    NormalizeData(verbose=F) |>
    ScaleData(verbose=F)

scale.dat.test <- GetAssayData(pbmcTest, layer="scale.data")

### Insert your code that estimates clusters for different resolution parameters here.

### Code that will be used to compute SSE for each set of clusters 
 for (k in 1:length(resNames)) {
    clusters.est <- pbmcTrain@meta.data[[resNames[k]]]
    totSS <- 0
    unique_clusters <- unique(clusters.est)
    for (lab in unique_clusters) {
       cluster_indices <- which(clusters.est == lab)
       if (length(cluster_indices) > 1) {
            ## importantly, you are using the scaled and logged data.
            ## Make sure that scale.dat.test and scale.dat.train have the same dimensions: that is crucial
            clustdat.test <- scale.dat.test[,cluster_indices]
            clustdat.train <- scale.dat.train[,cluster_indices]
            rowMeansTrain <- rowMeans(clustdat.train)

           pred.means <- rowMeansTrain/eps.train
           ss <- sum((clustdat.test/(1-eps.train) - pred.means)^2)
           totSS <- totSS + ss
 }

For the other metrics, I like the idea of this one https://github.com/crazyhottommy/scclusteval, but I was actually thinking that count splitting provides an alternative to this method, in that we shouldn't need to use subsampling of cells, and we can instead use count splitting. I'm also putting this into the new tutorial.

@anna-neufeld Ah thanks.

I tried it with your code and also faced an issue. Lots of NaN because scale data can be negative, so log1p is problematic.

However, it works if you use data, instead of scale.data. data (so the normalized, but not scaled counts) are used by default by Seurat when findings markers, although there is some debate about that.

https://github.com/satijalab/seurat/issues/62

What do you think is mathematically the better version?

I ran the PBMC data with 10 folds with the data instead, which looked really nice. Even better than before because high resolution with overclustering have high SSEs. But don't want to jump again into wrong conclusions, so would love to hear if you think data is approriate. If not then they would need to be scaled differently, or no log version?

cluster_sse_multifold_data

Thanks!

I actually don't work with Seurat very often, so I wasn't totally sure what was stored in "data" vs. "scale.data".

I just verified empirically that "data" holds log1p ( count / (total counts in cell)*10000).

I think that this is an appropriate scale on which to compute the error function.

But note that this data is already logged, so you don't need to log it again when computing the SSE. I think it's appropriate to look at the sum of squared differences between the "data" slot of the test object, and the cluster means computed from the "data" slot of the train object (remembering to scale by epsilon where needed).

Does this help?

Either way, taking into account the size factors (total counts in cell), which is done if you use "data", is definitely a good idea and I think it will lead to a more reasonable choice of resolution parameter.

Thank you.

Okay cool.

Yes that makes sense.

So I would just remove the log1p function, so the cluster_sse function would be:

cluster_sse4 <- function(train_data, test_data, clusters_est, folds) {
  stopifnot(ncol(train_data) == length(clusters_est))
  totSS <- 0
  unique_clusters <- unique(clusters_est)

  for (lab in unique_clusters) {
    cluster_indices <- which(clusters_est == lab)
    num_points <- length(cluster_indices)

    if (num_points > 1) {
      clust_data_test <- test_data[,cluster_indices]
      clust_data_train <- train_data[,cluster_indices]
      rowMeans_train <- rowMeans(clust_data_train)
      pred_means <- rowMeans_train * 1/(1 - 1/folds)
      ss <- sum((folds * clust_data_test - pred_means)^2)
      totSS <- totSS + ss
    }
  }
  return(totSS)
}

This looks not so plausible anymore unfortunately with the PBMC dataset.

cluster_sse_multifold_data

The log1p version looked much better, but is mathematically not correct? I mean sounds logically to not take the log twice, but somehow the result is more plausible then, at least in this dataset.

In your code above-- I think that clust_dat_test should get multiple by 1/folds, not folds. Does that change make it look more plausible?

Actually- my apologies! I would not expect the above code to work well unless you are in the two-fold setting where espilon_train = 0.5.

The reason is because we should be multiplying the training/test counts by their epsilon factors before taking the log, not after taking the log.

This gets messed up when we are using the "data" attribute directly from Seurat.

I will write some code that works for this example and get back to you.

I was able to recreate your error on the PBMC data, and it does seem to be due to incorrect scaling by the number of folds. To get around this, I did the following.

I first verified that I knew how to reproduce the contents of the "data" layer in the Seurat assay.

raw.counts.train <- GetAssayData(pbmcTrain, layer="counts")
sizeFacs.train <- pbmcTrain$nCount_RNA
default.dat.train <- GetAssayData(pbmcTrain, layer="data")
my.dat.train <- sapply(1:NCOL(raw.counts.train), 
                             function(u) log1p(raw.counts.train[,u]/sizeFacs.train[u]*10000))
all.equal(as.numeric(default.dat.train), as.numeric(my.dat.train))

Once I did this verification, I realized that what I want to do is scale the raw counts in the training set by a factor of 1/(eps.train) and scale the raw counts in the test set by a factor of 1/(eps.test), where eps.test = 1-eps.train. So, I made my own version of dat.train and dat.test as follows.

dat.train <- sapply(1:NCOL(raw.counts.train), 
                             function(u) log1p((raw.counts.train[,u]/eps.train)/sizeFacs[u]*10000))
raw.counts.test <- GetAssayData(pbmcTest, layer="counts")
sizeFacs.test <- pbmcTest$nCount_RNA
dat.test <- sapply(1:NCOL(raw.counts.test), 
                             function(u) log1p((raw.counts.test[,u]/(1-eps.train))/sizeFacs[u]*10000))

The objects dat.train and dat.test are now ready for input into the SSE function, and do not need to be logged again.

Does this help? Working through this has been super helpful and will definitely be appearing in the new tutorial.

Hi Anna,

I still used my rcpp version and realized that it did not work with poission distribution, when you set overdispersion to Inf. I spent some time fixing this problem using std::isinf(b) and skipping the rgamma function. But then I found out that you already changed that in your repo ... among many other things. Maybe in the future you could try to use the pull request then it's easier to keep track and automatically mentions everyone who contributed code. But thanks so much for acknowledging me so prominently in the readme. You added a lot of comments and additional Rcpp code. Looks great!!

Will have a look at the sse function later.

I used your function, but found that it was very slow (with the 10 folds even on this small dataset it took pretty long). Since there is already a very fast Normalization method in Seurat, i just used that, and modified it slightly (removing the progress bar and dividing by eps_train. Around 57x performance boost. I created a pull request. If possible, please accept, revise, or tell me to revise.

The code i used for the 10 fold negative binomial version:

calcSS <- function(test_data, matrix, resRange, folds) {
  train_data <- matrix - test_data

  #create seurat object
  seu_obj_train <- CreateSeuratObject(counts = train_data, min.cells = 3, min.features = 200)
  seu_obj_train[["percent.mt"]] <- PercentageFeatureSet(seu_obj_train, pattern = "^MT-")
  seu_obj_train <- subset(seu_obj_train, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

  # run seurat pipeline
  seu_obj_train <-
    seu_obj_train |>
    NormalizeData() |>
    FindVariableFeatures() |>
    ScaleData() |>
    RunPCA() |>
    FindNeighbors(dims = 1:10) |>
    RunUMAP(dims = 1:10)

  # same genes and cells in the train set as in the test set
  rows <- rownames(seu_obj_train)
  cols <- colnames(seu_obj_train)
  train_data <- train_data[rows,cols]
  test_data <- test_data[rows,cols]

  seu_obj_test <- CreateSeuratObject(counts = test_data, min.cells = 0, min.features = 0)

  eps_train <- 1 - 1/folds
  train_data <- LogNorm(train_data, scale_factor = 10000, eps_train = eps_train)
  test_dat <- LogNorm(test_data, scale_factor = 10000, eps_train = eps_train)

  # assign names for each resolution
  resNames <- paste0("RNA_snn_res.", resRange)

  for (res in resRange) {
    seu_obj_train <- FindClusters(seu_obj_train, resolution = res)
  }

# calculcate within-cluster sse for each resolution
results <- list()

  for (k in resNames) {
    results[k] <- cluster_sse4(
      train_data = train_data,
      test_data = test_data,
      clusters_est = seu_obj_train@meta.data[[k]],
      folds = folds)
  }
  return(data.frame(results))
}

However, the result again did not look so plausible unfortunately:

cluster_sse_multifold_data

A source of error could still be the cluster_sse function. I removed the log1p part as you suggest

#remove log1p because input already log transformed
cluster_sse4 <- function(train_data, test_data, clusters_est, folds) {
  stopifnot(ncol(train_data) == length(clusters_est))
  totSS <- 0
  unique_clusters <- unique(clusters_est)

  for (lab in unique_clusters) {
    cluster_indices <- which(clusters_est == lab)
    num_points <- length(cluster_indices)

    if (num_points > 1) {
      clust_data_test <- test_data[,cluster_indices]
      clust_data_train <- train_data[,cluster_indices]
      rowMeans_train <- rowMeans(clust_data_train)
      pred_means <- rowMeans_train * 1/(1 - 1/folds)
      ss <- sum((folds * clust_data_test - pred_means)^2)
      totSS <- totSS + ss
    }
  }
  return(totSS)
}

But is it that sufficient? Is the multiplication by folds of clustdat_test and the multiplication of rowMeans_train by 1/(1-folds) still necessary?

Thanks!

Actually, the most recent normalization function that I suggested to you is not correct.

It is very tricky to figure out where the logs should go and where the normalization should go! It was much easier on toy datasets where there was no need to account for size factors.

I have tried several combinations of ways to normalize and take the log, and I think I am narrowing in on the "correct" version. I will send you a bare-bones tutorial outlining the procedure as soon as possible, followed by a full-length tutorial.

The main principle is that we always must be comparing "apples to apples" in our loss function. The cluster mean for the training set needs to be comparable to the cluster mean for the test set. So, on the raw count scale, we need to be sure that we have accounted for epsilon.train and epsilon.test. But if we are dividing by size factors, this is not be necessary, since both the counts and the size factors have been scaled by eps.train or eps.test, and these contributions cancel out.

I think that the use of the "cluster.sse" function in the toy-dataset tutorial was misleading, since this function was not intended for use on real scRNA-seq data (that's why it's not in the R package). As I work on the new non-toy-dataset tutorial, I will be sure to highlight that the choice of loss function is very context-dependent (i.e. do you have size factors, do you want to be working on the log scale, etc).

One other factor to consider is: do you want to compute the SSE over all genes, or only over those that were selected as highly variable genes in your test set? All of this stuff is really seeming to affect the results. This is one reason why I think that a stability-based metric could be preferable (will send code for this shortly!).

Thanks for the info.

Okay too bad, especially because the first results looked so promising.

Looking forward to the countsplit implementation of the stability metrics. https://crazyhottommy.github.io/EvaluateSingleCellClustering/5k_pbmc.html shows this pretty well without a test dataset on the 5k pbmc dataset.

https://www.biorxiv.org/content/10.1101/2023.07.07.548158v1.full actually uses your countsplit approach to perform self-supervised benchmark of clustering. So maybe you also get some methodological ideas there.

@anna-neufeld hey Anna, hope you could make some some progress with that? Anything I can test or help?

Sorry about the delay!

Here is a draft tutorial (not linked from the main site yet) with some thoughts about stability-based metrics.

I think that the visualizations are helpful, but there isn't yet a nice metric that actually has a minimum/maximum value for selecting the best # of clusters. But I think that this is more interpretable than MSE (in my opinion).

https://anna-neufeld.github.io/countsplit.tutorials/articles/Seurat_stability_demonstration.html

Thanks for the response.

Yes, interesting, but agree not super clear results. None of the established metrics looked better?

Nothing similarity-based achieved a clear max/min(Rand, Jacquard, etc). I tried using these on simulated datasets, and they achieve a maximum at the true number of clusters only if the true clusters are very well separated and not at all nested within each other. And I think on real datasets, clusters tend to be nested.

I think one thing that might be useful would be thinking of it as "deciding which clusters to merge", rather than "picking the best resolution parameter". Along the lines of this paper, https://www.science.org/doi/full/10.1126/science.aba7721, although this paper used a sample splitting procedure that is not actually valid (count splitting could provide a workaround).

Their procedure will still require a user-defined "threshold', though, to decide when clusters should be merged.

I agree. Choosing the right resolution might seem easier but merging clusters is probably the better approach I think this is what both scSHC https://github.com/igrabski/sc-SHC and cytocipher https://github.com/BradBalderson/Cytocipher are using

Could this be combined with the countsplit approach?

anna-neufeld / countsplit

countsplit with Seurat to find optimal number of clusters #8