Closed boulund closed 1 year ago
I want to use the tree from gtdb as reference tree for all diversity measurements that require phylogenetic information. This might require considering alternative databases for some tools, or produce remappings from their db sequences to the gtdb tree.
Maybe we could also consider using e.g. taxonkit, or maybe the taxonomy tools in BBTools?
I've prototyped all the required code to solve a basic implementation of this now, ignoring any phylogenetic measures:
chao1
, simpson
, shannon
, etc.).braycurtis
).Started a branch, alpha-beta-diversity
to deal with this. Since we're going to need count information in order to use scikit-bio stuff for diversity calculations this entire issue is now entirely dependent on #49 before we can proceed.
@jwdebelius said:
For alpha diversity, do we want to do depth normalization (Im not sure how with relative abundance) or should we look at something like Breakaway or do a rarefaction curve? Picking a normalization depth seems like something thats harder to do programatically so maybe something to discuss more?
We talked about running Chao1, Simpson, and Shannon diversity metrics as part of standard run QC on the Bracken abundances.
For beta-diversity, if we want that, we could consider Bray-Curtis or maybe Jensen-Shannon? Aitchison distance might be interesting, but not sure if it's implemented in Scikit-bio
We talked about running Chao1, Simpson, and Shannon diversity metrics as part of standard run QC on the Bracken abundances.
For beta-diversity, if we want that, we could consider Bray-Curtis or maybe Jensen-Shannon? Aitchison distance might be interesting, but not sure if it's implemented in Scikit-bio
Hello,
We are also interested in running alpha diversity on Bracken estimated abundances.
Since Chao1 metric considers singletons separately, bracken threshold should be set to 1, am I right? And are there any other precautions or pitfalls which should be accounted for when calculating Shannon, Simpson or Chao1 on Bracken output?
@jwdebelius or @luhugerth do you have any comments on this? I'm guessing you might have some experience computing alpha diversity measures based on Bracken output.
In general, I think computing diversity measures that assume the data represents proper organism counts on shotgun data quantified with fragmented sequencing data is tricky. That said, I think having functionality in StaG to automatically compute the simple alpha diversity measures for all samples would be good to have.
@kcivkulis If you make an implementation to compute diversity measures, please consider sharing!
I'm not a huge fan of Chao1 to begin with for microbiome data, so take my answer with a grain of salt. I think if you're going to do it, you should do straight richness - observed X.
When I've done it in the past, I've simply rarefied the data to an even depth and calculated from there. (But I'm still kind of a marker gene person at heart). That depth issue with pure richness it hard for me to imagine how you'd do a pure richness automatically in something like StaG; you almost need to be able to look at your data and make a judgement before you runt he analysis. Shannon and Simpson are both more robust to both varying sequencing depth and annotation methods, so they might be easier to automate.
If you're looking for an implementation, I suspect vegan has a few (although I'm not an R person) and scikit-bio has a python version of most of the main metrics that could be shoehorned into something.
Unlike Justine, I do like Chao1 for 16S... But I don't think it makes much sense for Bracken. Then we're definitely doing statistics on noise. So +1 on doing Shannon's, inverted Simpon's and observed species.
KrakenTools now comes with scripts to compute Alpha and Beta diversity from kraken reports. The scripts are available in the workflow/scripts/KrakenTools
subfolder if anyone wants to use them. I am not planning to let StaG compute these automatically at this time, unless there still is a strong interest.
We should consider adding a rule or two to automatically compute alpha diversity and maybe do some simple beta-diversity plots on all input samples, as that is often requested by people.
I'm thinking the best way to implement this is to use scikit-bio and write a few custom Python scripts.