alpha/beta diversity measures

abremges commented 7 years ago

From David's email, some of which to be implemented:

I did find a nice summary of alpha/beta diversity metrics that are important to biologists (imported from the 16S realm): https://www.drive5.com/usearch/manual/alpha_diversity.html https://www.drive5.com/usearch/manual/alpha_metrics.html https://www.drive5.com/usearch/manual/beta_diversity.html https://www.drive5.com/usearch/manual/beta_metrics.html

As well as a very nice math paper that shows how lots of these metrics are related to each other (and can be calculated in a straightforward way in a unified framework): https://www.dropbox.com/s/0nts9ypq5cyorv2/Leinster%202014%20Measuring%20diversity%20the%20importance%20of%20species%20similarity.pdf?dl=0

Given another couple days or so, I'll be able to get a ranked list of important diversity metrics to include, and will look for implementations of them (unless one of you has already done this).

(As I found out the hard way, Bray Curtis is simply 1/2 our L1 Norm, thus we kind of have it already.)

dkoslicki commented 7 years ago

Just a note on this: apparently the Bray Curtis coefficient/index is the same as the Sorensen coefficient, and the Sorensen coefficient and Jaccard index are related by Jaccard = Sorensen / (2+ Sorensen) via the same reference (and also here). So we do indeed have the Bray Curtis index (since Jaccard is rather trivial), but I'm not quite sure how you are getting that BC = 1/2 L1 norm. In general, 1/2*L1 norm is referred to as the total variation (TV) index.

fernandomeyer commented 7 years ago

As discussed with Alice and Shinichi Sunagawa, we should implement Bray Curtis (beta diversity) and Shannon (alpha diversity).

abremges commented 7 years ago

but I'm not quite sure how you are getting that BC = 1/2 L1 norm

From my understanding, Bray-Curtis dissimilarity is applied to absolute counts. For me, it was not straightforward to apply relative counts, and then I stumbled upon Multivariate Analysis of Ecological Data by Michael Greenacre and Raul Primicerio. In particular, formula 5.3 in chapter 5 claims this exact relationship: http://84.89.132.1/~michael/stanford/maeb5.pdf

Please take a look, @dkoslicki, and let me know if they (and/or I) are wrong. Thanks!

abremges commented 7 years ago

P.S. In the CAMI profiling format, we only allow relative counts, i.e. abundances in percent.

fernandomeyer commented 7 years ago

I added the Shannon diversity and equitability and a plot (commit 7f062a2). They are computed as, for example, in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4837688/ https://www.hindawi.com/journals/ijfr/2017/4549756/

sjanssen2 commented 7 years ago

may I point you to the scikit-bio library. It is a highly tested lib for microbial ecology, heavily used by the Knighlab and many others and the foundation of Qiime2: http://scikit-bio.org/docs/latest/generated/generated/skbio.diversity.alpha.shannon.html?highlight=shannon#skbio.diversity.alpha.shannon

We are using Bray Curtis from scipy: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.braycurtis.html

dkoslicki commented 7 years ago

Using an established library like scikit-bio is probably the way to go (if it's not too difficult to implement). @abremges the thing that concerns me about formula 5.3 in the paper you mentioned is that it doesn't correspond to formula 5.2 in the same reference. For example, for two samples with a single species of count 1 and 2 respectively, then according to formula 5.2, Bray-Curtis = |1-2|/3 which is quite from formula 5.3 when using the relative counts (1 and 1): |1-1|/2. That is, unless I'm gravely misunderstanding the Bray-Curtis metric (it's quite difficult to pin down what ecologists/biologists actually mean by this quantity).

alicemchardy commented 7 years ago

We will go with scikit-bio implementation, Define how it is implemented there.

To assess the value of beta-diversity estimates from a given tool (analysing a set of samples) 1) we compute Bray-curtis distance between (a) predictions of one tool across a set of samples, (b) gold standard across the same set of samples -> two matrices with beta-diversity estimates as output.->scatter plot output and L1-norm distance.

dkoslicki commented 7 years ago

After looking into this, it looks like:

Computing alpha_diversity(tool) and alpha_diversity(gold standard) and comparing these two gives an indication of how faithfully the tool recreates the actual diversity present in the sample (since higher shannon alpha diversity means a more even, diverse sample).
Computing beta diversity between(tool, gold standard) gives a more overall indication of similarity to species distribution. So if beta diversity is small, not only is the diversity similar between the tool and the gold standard, but the actual distribution of relative abundances between the tool and gold standard are similar.

So yes, I think also doing beta_diversity(tool, gold standard) for a single sample would add value.

So (incorporating what alice said), we should: a) compute x_i = beta_diversity(tool on sample i, gold standard on sample i) for each sample -> scatter plot/histogram {xi} -> lower is better b) compute x{i,j} = betadiversity(tool on sample i, tool on sample j) and y{i,j} = betadiversity(gold standard on sample i, gold standard on sample j)->scatter plot of { x{i,j}, y_{i,j} )->closer to y=x is better (for each i,j).

sjanssen2 commented 7 years ago

Here are my two cents: 1) I don't expect to see much in alpha diversity, because it is just a very rough summary of a sample, but it might serve well as a positive control. Profilers should at least be able to detect the same alpha div as in the gold standard. 2) we might want to be able to produce some eye candy figure like a PcoA or Emperor plot for all vs. all beta div matrix. I envision some figure like screenshot from 2017-11-08 08 56 49 where the large spheres could be the gold standards and the smaller spheres the predictions (this figure is from https://doi.org/10.1016/j.cell.2016.11.018 where large spheres are human donor samples and small are mice transplanted microbiomes) Readers are always impressed by those fancy plots even if it is hard to really understand what is going on.

fernandomeyer commented 7 years ago

I used scipy to compute Bray-Curtis from the predictions of each tool against the gold standard (CAMI lc) and played around with a 3D plot, plotting Bray-Curtis vs. L1 norm vs. rank. This is the result: braycurtis_l1norm

abremges commented 7 years ago

We agreed: No 3D plots.

alicemchardy commented 7 years ago

For alpha-diversity what do we have and what do we want to get: -Shannon at a given rank - estimate of skewedness of abundance distribution: To make a comparison of sample versus gold standard, compute L1-norm of predicted and true shannons, respectively.

Faiths PD (uses a tree):
Number of taxa (at a given rank), estimate of richness of a community: Number of taxa in prediction versus number of taxa in gold standard, display in a line graph, richness versus taxonomic rank for a particular sample.
rarefaction curves (?): estimates of diversity/richness - display line graph; lets add these
chao1 / chao2: estimate of diversity/richness - display in line graph (?)

Look at implentation of QUIIME of differtent metrics, maybe we can reuse some codes, if license is appropriate

dkoslicki commented 6 years ago

Way to visualize beta-diversity: scatter plot of (x,y) points. The x-values are the beta_diversity(tool k on sample i, tool k on sample j), y-values are the beta_diversity(gold standard sample i, gold standard sample j). Do this for all i and j. One scatter plot for each tool (indexed by k).

fernandomeyer commented 6 years ago

All metrics (but Faith's Phylogenetic Diversity, Chao1 / Chao2) have been implemented.

CAMI-challenge / OPAL

alpha/beta diversity measures #10