ctmrbio / stag-mwc

StaG Metagenomic Workflow Collaboration
https://stag-mwc.readthedocs.org
MIT License
28 stars 13 forks source link

Compute Alpha and Beta diversity #47

Closed boulund closed 1 year ago

boulund commented 6 years ago

We should consider adding a rule or two to automatically compute alpha diversity and maybe do some simple beta-diversity plots on all input samples, as that is often requested by people.

I'm thinking the best way to implement this is to use scikit-bio and write a few custom Python scripts.

boulund commented 6 years ago

I want to use the tree from gtdb as reference tree for all diversity measurements that require phylogenetic information. This might require considering alternative databases for some tools, or produce remappings from their db sequences to the gtdb tree.

boulund commented 6 years ago

Maybe we could also consider using e.g. taxonkit, or maybe the taxonomy tools in BBTools?

boulund commented 6 years ago

I've prototyped all the required code to solve a basic implementation of this now, ignoring any phylogenetic measures:

boulund commented 5 years ago

Started a branch, alpha-beta-diversity to deal with this. Since we're going to need count information in order to use scikit-bio stuff for diversity calculations this entire issue is now entirely dependent on #49 before we can proceed.

boulund commented 4 years ago

@jwdebelius said:

For alpha diversity, do we want to do depth normalization (Im not sure how with relative abundance) or should we look at something like Breakaway or do a rarefaction curve? Picking a normalization depth seems like something thats harder to do programatically so maybe something to discuss more?

boulund commented 4 years ago

We talked about running Chao1, Simpson, and Shannon diversity metrics as part of standard run QC on the Bracken abundances.

For beta-diversity, if we want that, we could consider Bray-Curtis or maybe Jensen-Shannon? Aitchison distance might be interesting, but not sure if it's implemented in Scikit-bio

kcivkulis commented 3 years ago

We talked about running Chao1, Simpson, and Shannon diversity metrics as part of standard run QC on the Bracken abundances.

For beta-diversity, if we want that, we could consider Bray-Curtis or maybe Jensen-Shannon? Aitchison distance might be interesting, but not sure if it's implemented in Scikit-bio

Hello,

We are also interested in running alpha diversity on Bracken estimated abundances.

Since Chao1 metric considers singletons separately, bracken threshold should be set to 1, am I right? And are there any other precautions or pitfalls which should be accounted for when calculating Shannon, Simpson or Chao1 on Bracken output?

boulund commented 3 years ago

@jwdebelius or @luhugerth do you have any comments on this? I'm guessing you might have some experience computing alpha diversity measures based on Bracken output.

In general, I think computing diversity measures that assume the data represents proper organism counts on shotgun data quantified with fragmented sequencing data is tricky. That said, I think having functionality in StaG to automatically compute the simple alpha diversity measures for all samples would be good to have.

@kcivkulis If you make an implementation to compute diversity measures, please consider sharing!

jwdebelius commented 3 years ago

I'm not a huge fan of Chao1 to begin with for microbiome data, so take my answer with a grain of salt. I think if you're going to do it, you should do straight richness - observed X.

When I've done it in the past, I've simply rarefied the data to an even depth and calculated from there. (But I'm still kind of a marker gene person at heart). That depth issue with pure richness it hard for me to imagine how you'd do a pure richness automatically in something like StaG; you almost need to be able to look at your data and make a judgement before you runt he analysis. Shannon and Simpson are both more robust to both varying sequencing depth and annotation methods, so they might be easier to automate.

If you're looking for an implementation, I suspect vegan has a few (although I'm not an R person) and scikit-bio has a python version of most of the main metrics that could be shoehorned into something.

luhugerth commented 3 years ago

Unlike Justine, I do like Chao1 for 16S... But I don't think it makes much sense for Bracken. Then we're definitely doing statistics on noise. So +1 on doing Shannon's, inverted Simpon's and observed species.

boulund commented 1 year ago

KrakenTools now comes with scripts to compute Alpha and Beta diversity from kraken reports. The scripts are available in the workflow/scripts/KrakenTools subfolder if anyone wants to use them. I am not planning to let StaG compute these automatically at this time, unless there still is a strong interest.