CalculateClusterEnrichment

GWMcElfresh commented 6 months ago

Hi everyone,

I saw an approach of using non-parametric tests for proportions that I think strikes the right balance for "enrichment" that we normally mean colloquially. The crux of the idea is that SubjectId is sampled independently of clustering (mostly true, though subject specific recovery violates this a little), so by testing vaccine 1 proportions against vaccine 2 proportions, we can test the two vectors of proportions for "treatment/vaccine" differences.

~~I'll edit this post with some plots as I add them~~ edit: plots added in comment below.

Best regards, GW

GWMcElfresh commented 6 months ago

While I coded this up to support pairwise tests, I suspect this will mostly be used for non-parametric anova. This could stand to ~~have a companion function that does~~ do a post hoc test for specific pairs of treatmentField. I think the Conover-Iman test from conover.test is probably fine for this. It's is on CRAN and will compute all pairwise tests.

edit: I updated the function to iterate through each cluster that's significant by the Kruskal-Wallis test. It will print a tile plot (with columns & rows populated with treatmentField variables).

~~Right now, the color is determined by the p-value, but I think plotting the T statistic like the author does in conover.test() is more helpful~~ (updated below)

GWMcElfresh commented 6 months ago

This function now, if a there are multiple levels per cluster, will run a Kruskal-Wallis test followed by a Conover-Iman post hoc test if the Kruskal-Wallis was significant.

The Conover-Iman will populate T statistics and p values for each cluster and print a plot like the one below to help interpret which elements of treatmentField the cluster is enriched for. Here, it's (approximately) RhCMV-TB/9Ag < RhCMV-TB/6Ag = IV-BCG < Unvax < Intradermal BCGs = RhCMV/Gag.

You'll end up getting a lot of these plots (at least I did with cohorts of n = 16), but this adds a lot of oomph to "this cluster is important because of how biased the clusters on the UMAP looks"-based approaches.

Cluster enrichment: Clusters with any enrichment:

I think this implementation (printing cluster enrichment plots only when post-hoc test is required) strikes the right the balance between returning a seurat object versus showing the cluster enrichment plots (although, this can take awhile to print the highlighted DimPlot). The alternative is returning a non-seurat object though, which could nuke a user's Seurat object accidentally. You'd want something simple (p values for each cluster) if your Seurat object only contained two levels of treatmentField.

As long as the colorspace import works fine, I think this is good to go.

GWMcElfresh commented 6 months ago

I also just added a check that ensures the characters " - ", which are used to determine Group1 and Group2 in the post-hoc test, throw an error if they appear in treatmentField and ask the user to change it.

bimberlabinternal / CellMembrane

CalculateClusterEnrichment #239