adw96 / breakaway

Species richness with high diversity
68 stars 18 forks source link

Plug-in estimates and breakaway estimates are the same #174

Closed jorondo1 closed 2 years ago

jorondo1 commented 2 years ago

Hi @adw96 !

I thought I posted this already but I can't find my issue anymore, sorry in advance if I double-posted. The required command outputs are here.

I am running the STAMPS2022 breakaway tutorial on my data and the comparison between breakaway adjustment and plug-in richness yields identical plots, even though I have a significant (orders-of-magnitude) differences in sequencing depth between my two sample types (Saliva and Feces),

observed_phyloseq <- sample_richness(psObject)
data.frame("observed_richness" = (observed_phyloseq %>% summary)$estimate,
           "depth" = phyloseq::sample_sums(psObject),
           "type" = psObject %>% sample_data %>% get_variable("compart")) %>%
  ggplot(aes(x = depth, y = observed_richness, color = type)) + geom_point()

Rplot02

ba <- breakaway(psObject, cutoff=Inf) # omitting cutoff produces the same results
p1 <- plot(ba, psObject, color = "compart")
p2 <- plot(observed_phyloseq, psObject, color = "compart")
grid.arrange(p2, p1, nrow = 2)

Rplot03

I do get a warning after running the breakaway function, but from reading previously posted issues it doesn't seem to be cause for alarm:

There were 50 or more warnings (use warnings() to see the first 50)
> warnings()
Warning messages:
1: In cutoff_wrap(my_data, requested = cutoff) :
  ignoring requested cutoff; it's too low
2: In cutoff_wrap(my_data, requested = cutoff) :
  ignoring requested cutoff; it's too low

And the contents of the two objects look like this:

> ba
$ GQ56
Estimate of richness from method breakaway:
  Estimate is 389
 with standard error 0
  Confidence interval: (NaN, NaN)
  Cutoff:  Inf

> observed_phyloseq
$ GQ56
Estimate of richness from method Plug-in:
  Estimate is 389
 with standard error 0
  Confidence interval: (389, 389)

Can you help me figure this out ?

Thanks in advance

adw96 commented 2 years ago

Thanks so much @jorondo1 -- I have some suspicions about what's happening. Two questions:

  1. How are you producing your count matrix? Are you filtering singletons?
  2. Could you please provide some example frequency count tables?

I think I can diagnose given the above information.

jorondo1 commented 2 years ago

The counts matrix is a Sourmash output. Here is a subset of the phyloseq object I am using as input. I have not done any filtering (except here for this subset), the counts are as-is. It looks like this:


             GQ1   GQ2  GQ32  GQ33   GQ3  GQ34  GQ35   GQ4  GQ36  GQ37   GQ5   GQ6  GQ11  GQ12  GQ40
 1 [Eubac…   130     0     0     0     0     0     0     0     0     0     0     0     0     0     0
 2 [Eubac…  1115 10007     0     0  7426     0   109   251     0     0 12743  9898     0    80     0
 3 [Eubac…   169   111     0     0     0     0     0     0     0     0     0     0     0     0     0
 4 Actino…   287   214     0     0   198     0     0     0     0     0     0   211   137   857     0
 5 Actino…   323   182     0     0     0     0     0     0     0     0     0     0     0   304     0
 6 Actino…    75     0     0     0     0     0     0     0     0     0     0     0     0   359     0
 7 Actino…  2333 28694     0     0  8682     0     0  2403     0     0  7657  9792   141    64     0
 8 Actino…   562   171     0     0   196     0     0     0     0     0   213     0    70   215     0
 9 Actino…  3796  1403     0     0  3390     0     0   559     0     0   157   124  1215  1414     0
10 Actino…   860  2216     0     0  3446     0     0    73     0     0  1970   967  1106  1233     0

Thanks a lot for your time!

adw96 commented 2 years ago

Interesting -- it looks like you have very few biological units that are observed infrequently. What breakaway is (reasonably) inferring from this is that there are few unobserved biological units, which is why breakaway's estimates are the same as your plugin estimates. Basically if your data structure doesn't suggest that you have anything that's rare, it will predict that you have nothing missing! This is definitely the intended behavior of breakaway, so I'm going to close it as an issue. I hope that helps answer your question!

That said, given the extremely strong correlation between depth and observed richness, if you wanted to fit a model for richness adjusting for depth, you might consider fitting a model like lm(sample_richness ~ your_covariates + depth). I don't recommend this often but given that you don't have the ability to estimate the # of rare units (min hash sketches?), this might be the best you can do in this setting.

Next time I chat with Taylor and Titus I might ask them about whether sourmash can detect rare/low abundance min hash sketches, since I'm ignorant on this.