adw96 / breakaway

Species richness with high diversity
68 stars 18 forks source link

Multiple richness model types in betta #123

Closed msmcfarlin closed 2 years ago

msmcfarlin commented 3 years ago

Hello!

I have the same question as discussed in issue #98, though there wasn't a response to that post.

When running breakaway() on my data set, I get a mixture of models. 12 of 37 samples are fitted with Kemp models, 24/37 fitted with Negative Binomial, and one sample is fitted with a Poisson model. I am interested in richness among different groups and will be using betta() for these comparisons. Will the combination of models impact analysis with betta?

Thank you!

ailurophilia commented 3 years ago

Hi @msmcfarlin,

This is a fairly difficult question to answer because fundamentally there are no guarantees when it comes to species richness estimation (with any model – this isn't specific to breakaway()). Without making fairly strong assumptions about unobserved characteristics of a population, there's no way to upper bound the true species richness in that population.

Intuitively, we run into this problem because it is always at least theoretically possible that there are any number of species we have failed to observe in our sample because they are account for some vanishingly small proportion of the total organisms in our population of interest. In other words, species richness is a diversity measure that gives equal weight to species accounting for one tenth, one ten thousandth, on ten millionth, etc. of the organisms in a target population, and so to estimate total species diversity, we have to make assumptions about how many rare species exist (and different assumptions give us different estimates).

With all that said, breakaway() was developed to work well in realistic settings, and you can peruse the breakaway paper for for some conditions under which breakaway outperforms other common diversity estimation methods. However, because of the structure of the species richness estimation problem, it is not really possible to give general statistical guarantees for this or any other estimation method.

All of this is to say that differing diversity models being fit to your data could cause an issue if some of the models are wrong, but there are more general considerations to take into account here regarding statistical guarantees. However, if you're concerned about this, you could consider a sensitivity analysis: use some of the model-specific functions (e.g., kemp(), objective_bayes_poisson, etc.) to fit diversity models to your data and see how much difference this makes in your downstream betta() analysis.

I hope this helps! Also tagging @adw96 here because I am definitely not the expert on species richness estimation and I may certainly have missed something here!

Best, David

P.S. I also think it's worth thinking through whether / in what way you care about species richness – i.e., is it important to your scientific goals to understand how many species compose very small proportions (1e-6, 1e-8, etc.) of the biomass in your population? (The answer could of course be yes – but nonetheless still worth a thought imo.)

adw96 commented 3 years ago

Great answer, @ailurophilia ! My short and slightly more optimistic version would be: yes, this combination of models may impact the results of your analysis, but I think that this apprach is the best way forward out of available approaches -- I'd recommend against using only negative binomial models, and would trust a Kemp model estimate more if it exists.

Also, another sensitivity analysis I'd recommend is to consider adjusting cutoff, especially for the lower complexity models (Poisson and NB).

Thanks both!

msmcfarlin commented 3 years ago

Hi @ailurophilia and @adw96,

Sorry for the delayed response and thank you for all the information!

@ailurophilia, regarding your question about my goals...

...is it important to your scientific goals to understand how many species compose very small proportions (1e-6, 1e-8, etc.) of the biomass in your population?

It is important to me to know if there are differences between study groups. That being said, I am working with fecal samples and I do not expect there to be large differences in the number of species in small proportions of biomass. That being the case would you recommend a different approach than breakaway? Or a different implementation than I am currently doing?

@adw96, regarding a sensitivity analysis adjusting cutoff for Poisson and NB models, are there a particular values you would suggest running for the cutoff? Would the mean or median cutoff values from all NB/Poisson models make any sense?

I tried a few different iterations were I kept the Kemp estimates from the initial breakaway run, then replaced the NB and Poisson estimates with new NB/Poisson estimates made at different cutoffs. After comparing study groups in betta I see the same significance between groups that I did before, where all NB/Poisson had different cutoff values.

Thank you so much for your assistance!

ailurophilia commented 3 years ago

Hi @msmcfarlin,

I wouldn't suggest a different approach if you're set on richness as your outcome. In that case, I think a breakaway + betta analysis with the sensitivity analysis it sounds like you're already doing is probably the best choice.

Best, David