Could not calculate statistics for groups since they only contain one sample

alifarhat40 commented 9 months ago

@BradBalderson Thanks for creating this amazing package. I am running into an issue where my new clustering algorithm provides many small clusters. And a lot of these clusters only have one sample in them.

When running this like: cc.tl.get_markers(adata, 'cluster_labels', var_groups='highly_variable', n_top=6)

I get the following error:

ValueError: Could not calculate statistics for groups 956, 497, 1149, 559, 818, 490, 634, 1043, 554, 1121, 747, 584, 1088, 722, 793, 937, 1104, 945, 544, 808, 1175, 756, 708, 1168, 690, 617, 649, 841, 830, 745, 682, 595, 693 since they only contain one sample.

Is there anyway I can fix this from the package? Because my clustering method naturally overclusters, and I cannot add more samples to the clusters. A lot of my clusters only include one sample because of the sparse nature of PBMC3k dataset. I am applying this method on the PBMC3k dataset. I get 344 cluster labels for 2638 cells from the PBMC3k dataset. The package scSHC is able to find me 8 clusters out of the 9 with ARI = 0.6174933708899379 (based on annotated ground truth labels). But I am really interested in trying your method since I believe a lot of these small labels that I get are due to the sparsity.

BradBalderson commented 9 months ago

Hi @alifarhat40,

That is really weird you get clusters with only one cell in them! I suppose scSHC can handle this because it tests gradually using a guide-tree, from root to tips, and therefore naturally aggregates the clusters before testing, so it makes sense.

I find it very weird there are so many clusters, I think something is going wrong with the upstream clustering.

Could you send a UMAP with the clusters annotated? And perhaps a counts of how many cells of each type are in each cluster? Also, what resolution do you set the Leiden clustering?

alifarhat40 commented 9 months ago

Thanks for the reply!

Could you send a UMAP with the clusters annotated? And perhaps a counts of how many cells of each type are in each cluster? Also, what resolution do you set the Leiden clustering? To answer your questions: 1) there are 344 clusters uncovered by my algorithm (overclustering) due to the sparse nature of cells. So a UMAP would not be informative showing 100 dots and 244 small clusters. 2) count range from 1 cell per cluster to many 3) I do not use Leiden or resolution. I am using my own clustering algorithm that I recently developed. Sorry I know these are vague responses.

Actually, I developed my own clustering algorithm based on manifold theory and topology. Therefore, I do not apply any Leiden clustering. In fact, my algorithm does not have any hyperparameters whatsoever such as "resolution", "k", "neighborhood size", etc ... We naturally uncover the cells in high dimensional PCA (or latent) space. My density based clustering algorithm scales linearly with the data and performs really well against current clustering algorithms on toy datasets (benchmarks) of simple data. It is already state-of-the-art on general data. But I am trying to adapt it to scRNA-seq clustering for my Bioinformatics PhD thesis.

When applying scSHC alone to PBMC3k I get ARI = 0.54 (compared to annotated ground truth). But when applying scSHC on my precomputed clusters I get ARI = 0.617. However, scSHC is unable to correctly merge/classify the rare clusters (dendritic cells). I know for a fact the dendritic cells are uncovered by my clustering algorithm because when I manually looked at the 344 clusters (super tedious) I found the dendritic cells and was able to improve my ARI to 0.71.

From 2638 PBMC3k cells, I get 344 clusters, 100 of which contain only one cell. When reseraching how to merge my clusters I thought: "Well, how about I compare the differentially expressed genes, create a score, and decide whether to merge or not in my hierarchical clustering." That is when I came across Cytocipher and scSHC. The reason I obtain so many clusters is because my density based clustering algorithm looks at the natural position of data in high dimensional space. And because scRNA-seq is sparse, it creates challenges. But I am confident that rare cells such as dendritic cells are in their own cluster that I need to be able to accurately merge.

Ground truth (9 clusters): seurat_truth_umap_pbmc3k

scSHC on my precomputed clusters (8 clusters compared to 9 ground truth): scSHC_my_clustering

BradBalderson commented 9 months ago

That sounds very interesting. It is funny that I was thinking it would be a great improvement to Cytocipher/ScSHC to have a prior clustering method that guarantees prior over-clustering, which it sounds like you are developing!

It is a little tricky with the stats tests I do if there are clusters with only one cell though, since there is no replication on which to calculate the t-stats, compare the scores, and merge.. I did not think about this edge-case.

The only way I can think of handling it is using a guide-tree, which is what scSHC does, to make merges/statistical calls at each split-point.

alifarhat40 commented 9 months ago

No worries. I will try to wrangle the data to see if I can prepare it to be used with your method. I have several ideas I need to try. How many samples per cluster does your method need?

Yes, my new clustering method guarantees prior over-clustering without tuning any hyperparameters. Worst case I will compare DEGs between every pair of clusters as I grow my tree. My topology has a distance metric, and I combine nearby clusters post-hoc without any guidance. Perhaps I can manually compute differentially expressed genes at every merge event to determine whether to merge clusters based on a scoring criterion.

Feel free to close this issue if you want, and I appreciate your response.

BradBalderson commented 9 months ago

@alifarhat40 this looks really great.

I am thinking a minimum cluster size of 3-10, I think you cannot calculate basic thinks like standard error without atleast 3 samples.

Perhaps, based on your tree above, you could set a cut-level that results in a minimum cluster size of 3 cells, then try Cytocipher?

alifarhat40 commented 9 months ago

Yup haha. I was thinking the same thing. I can cut at 3 cells. It should not be an issue. It will take me some time. Perhaps by the end of the week I can have something. Thanks a lot for your timely responses. I'll keep you updated.

BradBalderson commented 9 months ago

Thanks @alifarhat40 and best of luck, think it is a very interesting idea and looking forward to hearing more about it.

P.S. will keep github issue open if you want to post updates, but if not just let me know and I will close it.

alifarhat40 commented 9 months ago

Fine by me. I will post updates here.

alifarhat40 commented 9 months ago

Hi @BradBalderson,

I have a really stupid question: If I duplicate the same data point (i.e. sample cell) three times within my clusters that only have one data point/sample, would that mess with your statistical tests?

I ask this because there is no way to cut my tree in a location that guarantees each cluster have 3 data points. In my manifold there are locations really far from everything and only merge in the tree at the end. So I am looking to cheat a little in my exploratory phase haha.

alifarhat40 commented 9 months ago

Actually, ^ that did not work. Instead, I just removed the 400 clusters that only have one cell in them and performed cytocipher on the remaining data points. My ARI went up to 0.89 on the PBMC3k dataset.

Now I have to figure out what to do with the remaining 400 points. I am thinking of applying my manifold clustering on them and forcing them to have more than 5 cells per cluster. Only about 64 clusters have more than 5 cells in them. The remaining 400+ clusters only contain one cell.

I will keep you updated. Again, my method does not have hyperparameters such as "k" or "nearest neighbors" or graphs. And cytocipher is better than scSHC so far.

BradBalderson commented 7 months ago

Hey @alifarhat40, sorry for the late reply:

If I duplicate the same data point (i.e. sample cell) three times within my clusters that only have one data point/sample, would that mess with your statistical tests? -> Yes, since the standard error would be zero, and the standard error is on the denominator of the t-statistic, causing infinity as the output.

Perhaps, to merge the clusters, of the ones which have only one cell, can you merge them with the 3 nearest-neighbours by some distance metric?

Also, the second result looks neat. :)

alifarhat40 commented 7 months ago

@BradBalderson Thanks!

I use PCA-space to cluster. So I guess I can give it a try to do k-nearest neighbor merge for the clusters with just a single cell in them. I believe each cluster needs at least 10 cells in them. Do you have a recommendation for which method I can use to perform this merging for the clusters with less than 10 cells?

BradBalderson commented 7 months ago

In scanpy, you can construct nearest neighbour graphs with:

sc.pp.neighbors(data, n_neighbors=3)

The neighbourhood information is stored somewhere in data.uns, I am not 100% sure where, but I suppose for cells that are in a single cluster group perhaps could merge them with their neighbours?

alifarhat40 commented 7 months ago

Okay I will give that a try.

alifarhat40 commented 7 months ago

Thanks for the suggestions. Will close this for now.

BradBalderson / Cytocipher

Could not calculate statistics for groups since they only contain one sample #3