bimberlabinternal / CellMembrane

An R package with wrappers and pipelines for single cell RNA-seq analysis
10 stars 3 forks source link

ADT Triage and Classification #263

Open GWMcElfresh opened 3 months ago

GWMcElfresh commented 3 months ago

Hi all,

This PR builds off of @gboggy2 's recent work with triaging ADTs and builds up a basic skeleton for how we might get at this.

The core of this function is using multimode:locmodes() to find modes and antimodes within ADTs, but as Greg has identified, it's more sensitive to mixture distributions than we'd like. It follows up with an autocorrelation-based approach, exploiting the higher level of symmetry of unimodal distributions as you lag them compared to multimodal distributions

This function doesn't yet do any classification, but it implements (most of) Greg's triaging. As is, it returns a dataframe with a boat-load of distribution summary statistics so we can get a sense of what's happening at scale.

I ran this on a mixed cell type RIRA, with what I think was an ADT filter applied (518154). There were a few failure cases with locmodes() where it would return a mode and some NAs. Greg, these failures broke the for loop where you were incrementing the mod0 parameter to find more and more modes, so I left that as a TODO for you to do battle with.

I did some basic infrastructure overhauling to make this run faster/parallelizable, but once I moved around some Seurat manipulation it was fast enough on full RIRA.

These are all using Greg's original thresholds on what I think was RIRA T_NK. The thresholds may need to react to the ADT they're working on (or not?)

TODO:

for the 2nd TODO: it may be that we enforce a hierarchy like: triageList = list("TNK" = c("CD4", "CD8"), "Macrophage" = c("CD68"), "Bcell" = c("CD20")) and then parse the tree:

$TNK
[1] "CD4" "CD8"

$Macrophage
[1] "CD68"

$Bcell
[1] "CD20"

Here are some plots that I thought might be helpful and accompanying ggplot code:

#usage
df <- triageADTsAndClassifyCells(seuratObj, 'CD4', plots = T)
df %>% 
  ggplot(aes(x = call, y = decay, color = call)) +
  geom_point() + 
  egg::theme_article() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) + 
  ggtitle("Decay Parameter")

image

df %>% 
  pivot_longer(cols = c(peak1_location , peak2_location , antimode_location), names_to = "peak", values_to = "value") %>%
  mutate(peak = factor(peak, levels = c("peak1_location", "antimode_location", "peak2_location"))) %>% 
  ggplot(aes(x = peak, y = value, color = call)) +
  geom_point(position = position_jitter()) + 
    egg::theme_article() +
  ggtitle("Location of Modes and Antimodes") + 
    theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) 

image

df %>% 
  pivot_longer(cols = c(peak1_density , peak2_density , antimode_density), names_to = "peak", values_to = "value") %>%
  mutate(peak = factor(peak, levels = c("peak1_density", "antimode_density", "peak2_density"))) %>% 
  ggplot(aes(x = peak, y = value, color = call)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1, vjust = 1)) +
  geom_point(position = position_jitter()) + 
    egg::theme_article() +
  ggtitle("Density of Modes and Antimodes")

image

GWMcElfresh commented 3 months ago

Recording this beyond the ID meeting:

Greg's current heurstic thresholds are based on a strategy of 1. get the counts slot -> 2. asinh() -> 3. CLR().

I think this solves some problems and causes others, but once we're pretty satisfied, this flag should probably default to TRUE.

https://github.com/bimberlabinternal/CellMembrane/blob/b44f32937229cf84556435815605ae104ecbbbd0/R/Utils.R#L315

gboggy2 commented 1 month ago

Recording this beyond the ID meeting:

Greg's current heurstic thresholds are based on a strategy of 1. get the counts slot -> 2. asinh() -> 3. CLR().

I think this solves some problems and causes others, but once we're pretty satisfied, this flag should probably default to TRUE.

https://github.com/bimberlabinternal/CellMembrane/blob/b44f32937229cf84556435815605ae104ecbbbd0/R/Utils.R#L315

Based on review of failure cases, my current preference is to not apply asinh() transformation to the data.

GWMcElfresh commented 1 month ago

yeah, I'm on board for the internal workings.

Upstream of this function, what are we thinking about cell typing prior to ADT quality checking (the first two boxes of the checklist above)?

We could either bake in some Seurat processing into this (e.g. provide an optional whitelist of cell types and subset) or not.

I can't precisely recall, but did the RIRA objects you were using to test include bulk cells or just T/NKs?

gboggy2 commented 1 month ago

yeah, I'm on board for the internal workings.

Upstream of this function, what are we thinking about cell typing prior to ADT quality checking (the first two boxes of the checklist above)?

We could either bake in some Seurat processing into this (e.g. provide an optional whitelist of cell types and subset) or not.

I can't precisely recall, but did the RIRA objects you were using to test include bulk cells or just T/NKs?

Just T/NKs.