Number of initial clusters influences number of significant merged clusters

@BradBalderson Congratulation to this package. I think this is a very much needed tool for the sc community. I can confirm that it scales much better to large dataset than scSHC. And the visualization you provide are nice and helpful.

I am a Seurat user but conversion from Seurat to AnnData via SeuratDisk https://mojaveazure.github.io/seurat-disk/articles/convert-anndata.html has worked fine for me here (at least with Seurat v4). Converting AnnData to Seurat was more problematic. This didn't work for me with SeuratDisk. sceasy https://github.com/cellgeni/sceasy worked, but lacked some data. anndata2ri https://github.com/theislab/anndata2ri worked for me previously but was problematic here because r2py didn't detect my R paths. So in the end I just saved the merged cluster as a .csv file and added this to my Seurat metadata, which is easy, fast and sufficient for the downstream analysis I think.

However, I had some strange results depending on the number of initital clusters. To reduce the computation speed, I downsampled my dataset of initially ~170k Downsampling via Seurat's internal subset function actually downsamples per cluster, so I downsamples to 1000 per cluster, which resulted in ~17k clusters with 19 clusters (some had less than 1000 cells so were not downsamples at all).

original dataset

   Adipo        B      EC1      EC2    EndoC     EpiC  Granulo      LEC 
    2647      189    15756     4873    14520    18447     1000     1331 
   Macro     Mast     mySC   NK_CD8    nmSC1    nmSC2   PeriC1   PeriC2 
    8413      613     3473     2568    23152     3670    32526     6055 
repairSC VSMC_PC1 VSMC_PC2 
    3147    19205    10464

downsampled dataset


   Adipo        B      EC1      EC2    EndoC     EpiC  Granulo      LEC 
    1000      189     1000     1000     1000     1000     1000     1000 
   Macro     Mast     mySC   NK_CD8    nmSC1    nmSC2   PeriC1   PeriC2 
    1000      613     1000     1000     1000     1000     1000     1000 
repairSC VSMC_PC1 VSMC_PC2 
    1000     1000     1000

These 19 cluster make sense visually and biologically, althoughone could argue that a few are overclustered.

The. resulting merged cluster with a p cutoff 0.05 looked reasonable in most cases (e.g. merge EC1 and EC2 ), but overly conservative in other cases (e.g. VSMC_PC1 and EpiC are very distinct cluters).

The volcano plot looked reasonable to me, so not sure it would make sense to increase the p value.

Based on your tutorial and manuscript jupyter notebooks, I thought that starting with overclustered dataset would make sense. I clustered the downsamples dataset with a resolution of 0.8 with Seurat's default Louvain algorithm. Seurat recommends 0.4-1.2, so actually it's not even that high. But it does look pretty overclustered in the UMAP to me.

However, this resulted in problems when using the default parameters

Exception: No marker genes for 11. Relax marker gene parameters in cc.tl.get_markers() or decrease Leiden resolution for inputted clusters.

So i relaxed the parameters:

cc.tl.get_markers(adata, "cluster", n_top = 6, padj_cutoff = 0.2, t_cutoff = 1)

This worked, but lead to another issue:

Full overlap of + and - gene sets detected for 20 and 16; suggested to increase number of marker genes for scoring.

So for merge i then used more markers

cc.tl.merge_clusters(adata, "cluster", n_cpus = 6, marker_padj_cutoff = 0.2, t_cutoff = 1, n_top_genes = 10)

This worked, but the resulting merged UMAP clusters looked very strange.

I then though that maybe the results would look better with my entire dataset of 170k cells. However, here with the same clustering as in the downsamples dataset (so not overclustered, resolution 0.4, so lower limit of what Seurat recommends), I encountered the No marker genes for .. issue.

So my two main questions are: 1) Is it recommended to make the cluster sizes similar? Based on my results this lead to more plausible results. 2) What is a good initial clustering resolution? Why did my rather moderate resolution of 0.8 lead to issues with the algorithm and especially to very strange merged clusters? 3) Is the method able to not only detect OVERclustering but also UNDERclustering? In your manuscript you nicely show an example, how you detect 2 CD8 T cell population. But in my first attempts I only found that it merged clusters, not separated them

Thank you!

Hey @mihem,

Thanks alot for being the first person to try this out! (That I know of).

And thanks alot for details with regards to getting it ported over to R. I will look into writing a tutorial based on utilising reticulate within R itself to run Cytocipher, so don't need to convert and save to .csv. It would still require transferring over some results from the AnnData object to your Seurat object however, but it's something which should not take me long to setup. I'll update this thread when I try in the next few days.

Downsampling approach makes complete sense, I did the same for Tabula Sapiens and got very comparable results to using the full dataset with MUCH faster run time.

I'm happy to see those diagnostics plots are working well; can see when it's doing the right versus wrong thing quite quickly!

There are some important recommendations on correct run usage in the Pancreas analysis tutorial, in section "Running Cytocipher to determine significantly different clusters". One point from here I'll highlight:

Practically, Leiden clustering at a resolution between 2 and 4 produces over-clusters.

More specifically to evaluate initial clusters:

In terms of making sure the data is over-clustered first, the code-score heatmaps are more quantitative. Should observe scores along each column split amongst multiple rows (indicating cells in separate clusters having similar scores). Example below (taken from Fig. 5C of paper):
Then once merged, the 'cross-scoring' should be resolved:

For the marker genes issue, I should update that warning statement, it is not a problem at all to have full marker gene overlap (in-fact, expected in-cases of overclusters!). Instead, could you try this after updating the initial clusters with the recommendations above?: cc.tl.merge_clusters(adata, "cluster", n_cpus = 6, marker_padj_cutoff = 0.2, t_cutoff = 1, n_top_genes = 6, squash_exception=True)

So what squash_exception is doing works in the cc.tl.code_score, from that documentation:

squash_exception: bool
            Whether to ignore the edge-case where there is complete overlap of
            marker genes between two clusters, thus these two clusters will
            score exactly the same having the same set of marker genes. By
            default is false, prompting the using to relax the marker gene
            parameters so additional genes may differentiate clusters.

These two bits of documentation need to be updated, the default should be squash_exception=True.

When I was developing it, I had it as True because I was interested in how it would perform for that edge case, and found it simply means that if there are no marker genes which differentiate two clusters, it then only effectively declares them different if there is a difference in magnitude of expression for those marker genes between the respective cluster pairs tested.

For your questions:

Is it recommended to make the cluster sizes similar? In the back-end, Cytocipher is already representing the score-distributions in a way that makes an equal number of observations per cluster while maintaining the cluster-score distributions (see Supp. Figure 1). Effectively what I do to ensure cluster-pair significance does NOT correlate with the number of cells in the cluster pair is by taking samples of cells across the distribution of code-scores, in a way that represents the original distribution but with an equal number of datapoints for each cluster (K-quantile sampling). Including a graphic of Supp. Fig. 1A to illustrate this, the right panel is the default running method in Cytocipher. Additional panels in Fig S1. indicate how this maintains the distributions correctly and the performance, while de-correlating the significance with the number of observed cells.
What is a good initial clustering resolution? Why did my rather moderate resolution of 0.8 lead to issues with the algorithm and especially to very strange merged clusters? I recommend between resolution 2-4 (as indicated above). The heatmap diagnostics are the best way to ensure this. I think the weird merging you are seeing is a result of the large number of marker genes combined with code-scoring. I recently observed this, and did a sensitivity analysis with respect to number of marker genes and performance with code-scoring. When there are too many marker genes, the negative gene-set filtering causes issues, because there are too many false-positive marker gene across the clusters, causing over-abundant 0-scores with the negative gene-set filtering. This is illustrated below with a sensitivity analysis with respect to the number of markers for different scoring methods using the same simulated data in the paper (this is a really new result, so not in the paper). The drop off in performance you see is for the methods with negative gene set subtraction. These will perform worse when that parameter is set too high (but, as in the paper, generally has better performance when set correctly, in the cases of the real data I tested on).
Is the method able to not only detect OVERclustering but also UNDERclustering? In your manuscript you nicely show an example, how you detect 2 CD8 T cell population. But in my first attempts I only found that it merged clusters, not separated them. In terms of the significance test, it works on over-clusters. For identifying under-clusters, I think instead just make sure each of the clusters is an over-cluster first (using the criteria I mentioned above). If it is not clearly over-clustered, it could be under-clustered. So I think it becomes a non-issue if make sure the clusters are over-clusters first.

Hopefully this helps, I am adding a few items to my TODO list:

Update marker gene recommendation printouts so is not misleading (sorry about this!)
Set squash_exception=True.
Write Cytocipher reticulate tutorial.

Let me know if this resolves the issue!

Best, -Brad

Hey @BradBalderson ,

thanks a lot for your extensive and very informative response.

Sure, it's great to be the first user apparently. I think this package has tons of potential users, especially if you make it easily accessible also for R (especially Seurat) users (as for example harmony did that https://portals.broadinstitute.org/harmony/SeuratV3.html). I am actually okay with exporting the cluster_merged column and importing it into Seurat. Very fast and robust. For Seurat users it would be of course nice if you don't have to leave R and can create and save your plots there, as you are used to.

In fact, in the last few days I've been working on using the countsplit approach to find the optimal number of clusters https://arxiv.org/abs/2307.12985 . This worked quite well with the pbmc 10x dataset https://github.com/anna-neufeld/countsplit/issues/8. The large advantage of Cytocipher is that it merges clusters and does not only rely on Leiden Clustering as you show very nicely in Figure 4.

But it is a very strong statistical idea I think, so maybe one could multifold negative binominal countsplit with the merged clusteres (based on Cytocipher) and see that this minimized the within cluster sum of squares error?

Regarding the questions from the previous post:

Okay that makes sense. But still: Do you then recommend subsetting in a way that cluster sizes are similar or just random downsampling independent of the clusters? Will this help your algorithm or skew the true results?
Thanks for the explaination. I tried a resolution as you suggested (3.0), but unfortunately the results were again very weid.
Understood. You are right. It's not a problem at all to OVERcluster on purpose so that Cytocipher can then merge the clusters. Unfortunately didn't work with my dataset.

I used the downsampled Seurat dataset with downsampling it to 1000. So the clusters are as shown previously:

    EpiC   PeriC1   PeriC2    EndoC      EC1      EC2    Adipo VSMC_PC1 
    1000     1000     1000     1000     1000     1000     1000     1000 
VSMC_PC2     mySC repairSC    nmSC1    nmSC2      LEC     Mast    Macro 
    1000     1000     1000     1000     1000     1000      613     1000 
 Granulo        B   NK_CD8 
    1000      189     1000

I then overclustered with a Leiden resolution of 3.0.

Then enriched heatmap looked like this. The overclustering doesn't look as impressive here as in your picture, but I think it's notheless sufficiently overclustered?

downsampled_enriched_hmap_before

Side note: I am unable to save the enriched heatmap. I use R 99% of my time, so it could be my incompetence. But e.g. the sc.pl.umap can be saved:

type(sc.pl.umap(adata, color="cluster", show = False))

>>> class 'matplotlib.axes._subplots.AxesSubplot'>

So the following works:

umap = sc.pl.umap(adata, color = "cluster", show = False)
umap.figure.savefig(os.path.join("results", "cytocipher", "umap_before.pdf"))

While:

type(cc.pl.enrich_heatmap(adata, "cluster", show = False))
>>> <class 'dict'>

So figure.savefig does not work. Is this a bug?

I merged the clustered with squashed_exception = True as you told me.

cc.tl.merge_clusters(adata,
                     "cluster",
                     n_cpus = 6,
                     p_cut = 0.05,
                     marker_padj_cutoff = 0.2,
                     t_cutoff = 1,
                     n_top_genes = 6,
                     squash_exception = True)

The enriched heatmap of the merged clusters looks like this: downsampled_enriched_hmap_merged

The umap looks unfortunately very weid again. Very unplausible merging, both visually and biologically. Do you have any idea what the problem could be?

I also tried merge_clusteres with enrich_method "coexpr". This looked very different, much better than the other one but pretty sure overclustered:

As a second approach, I used the randomly downsampled dataset, which has cluter sizes like this (so some very small one like B, but well defined based on marker genes and biologically of course different from adjacent macrophage cluster.

    EpiC   PeriC1   PeriC2    EndoC      EC1      EC2    Adipo VSMC_PC1 
    2130     3852      686     1725     1789      588      285     2176 
VSMC_PC2     mySC repairSC    nmSC1    nmSC2      LEC     Mast    Macro 
    1230      401      359     2712      417      144       61      980 
 Granulo        B   NK_CD8 
     118       23      324

When calling the merge_cluster function (with quash_exception True), this lead to the following error:

  File "/home/mischko/.local/lib/python3.10/site-packages/cytocipher/score_and_merge/cluster_score.py", line 593, in code_enrich
    str_dtype = f"<U{max([len(gene_name) for gene_name in all_genes])}"
ValueError: max() arg is an empty sequence

I had to increase marker_padj_cutoff and p_cut a lot, plus reducing t_cutoff a lot and then the function finally worked:

cc.tl.merge_clusters(adata,
                     "cluster",
                     n_cpus = 6,
                     p_cut = 0.5,
                     marker_padj_cutoff = 0.9,
                     t_cutoff = 0.001,
                     n_top_genes = 6,
                     squash_exception = True)

However, when I wanted to run further function like umap with the cluster_merge i got the following error:

Traceback (most recent call last):
  File "<string>", line 8, in __PYTHON_EL_eval
  File "/home/mischko/miniconda3/envs/cytocipher/lib/python3.10/ast.py", line 50, in parse
    return compile(source, filename, mode, flags,
  File "/home/mischko/project/cytocipher_sn_suralis_2023_02.py", line 2
    import cytocipher as cc

I will send you a link the Seurat data via mail and hope you find time to have a look at this yourself, which is probably easier.

Also I just realized: I do all the pre-processing in Seurat (using log normalization). This should not be a problem right?

Thanks a lot!

BradBalderson / Cytocipher

Number of initial clusters influences number of significant merged clusters #1