Closed abremges closed 6 years ago
Just a note on this: apparently the Bray Curtis coefficient/index is the same as the Sorensen coefficient, and the Sorensen coefficient and Jaccard index are related by Jaccard = Sorensen / (2+ Sorensen) via the same reference (and also here). So we do indeed have the Bray Curtis index (since Jaccard is rather trivial), but I'm not quite sure how you are getting that BC = 1/2 L1 norm. In general, 1/2*L1 norm is referred to as the total variation (TV) index.
As discussed with Alice and Shinichi Sunagawa, we should implement Bray Curtis (beta diversity) and Shannon (alpha diversity).
but I'm not quite sure how you are getting that BC = 1/2 L1 norm
From my understanding, Bray-Curtis dissimilarity is applied to absolute counts. For me, it was not straightforward to apply relative counts, and then I stumbled upon Multivariate Analysis of Ecological Data by Michael Greenacre and Raul Primicerio. In particular, formula 5.3 in chapter 5 claims this exact relationship: http://84.89.132.1/~michael/stanford/maeb5.pdf
Please take a look, @dkoslicki, and let me know if they (and/or I) are wrong. Thanks!
P.S. In the CAMI profiling format, we only allow relative counts, i.e. abundances in percent.
I added the Shannon diversity and equitability and a plot (commit 7f062a2). They are computed as, for example, in https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4837688/ https://www.hindawi.com/journals/ijfr/2017/4549756/
may I point you to the scikit-bio library. It is a highly tested lib for microbial ecology, heavily used by the Knighlab and many others and the foundation of Qiime2: http://scikit-bio.org/docs/latest/generated/generated/skbio.diversity.alpha.shannon.html?highlight=shannon#skbio.diversity.alpha.shannon
We are using Bray Curtis from scipy: https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.spatial.distance.braycurtis.html
Using an established library like scikit-bio is probably the way to go (if it's not too difficult to implement). @abremges the thing that concerns me about formula 5.3 in the paper you mentioned is that it doesn't correspond to formula 5.2 in the same reference. For example, for two samples with a single species of count 1 and 2 respectively, then according to formula 5.2, Bray-Curtis = |1-2|/3 which is quite from formula 5.3 when using the relative counts (1 and 1): |1-1|/2. That is, unless I'm gravely misunderstanding the Bray-Curtis metric (it's quite difficult to pin down what ecologists/biologists actually mean by this quantity).
We will go with scikit-bio implementation, Define how it is implemented there.
To assess the value of beta-diversity estimates from a given tool (analysing a set of samples) 1) we compute Bray-curtis distance between (a) predictions of one tool across a set of samples, (b) gold standard across the same set of samples -> two matrices with beta-diversity estimates as output.->scatter plot output and L1-norm distance.
After looking into this, it looks like:
Computing alpha_diversity(tool) and alpha_diversity(gold standard) and comparing these two gives an indication of how faithfully the tool recreates the actual diversity present in the sample (since higher shannon alpha diversity means a more even, diverse sample).
Computing beta diversity between(tool, gold standard) gives a more overall indication of similarity to species distribution. So if beta diversity is small, not only is the diversity similar between the tool and the gold standard, but the actual distribution of relative abundances between the tool and gold standard are similar.
So yes, I think also doing beta_diversity(tool, gold standard) for a single sample would add value.
So (incorporating what alice said), we should: a) compute x_i = beta_diversity(tool on sample i, gold standard on sample i) for each sample -> scatter plot/histogram {xi} -> lower is better b) compute x{i,j} = betadiversity(tool on sample i, tool on sample j) and y{i,j} = betadiversity(gold standard on sample i, gold standard on sample j)->scatter plot of { x{i,j}, y_{i,j} )->closer to y=x is better (for each i,j).
Here are my two cents: 1) I don't expect to see much in alpha diversity, because it is just a very rough summary of a sample, but it might serve well as a positive control. Profilers should at least be able to detect the same alpha div as in the gold standard. 2) we might want to be able to produce some eye candy figure like a PcoA or Emperor plot for all vs. all beta div matrix. I envision some figure like where the large spheres could be the gold standards and the smaller spheres the predictions (this figure is from https://doi.org/10.1016/j.cell.2016.11.018 where large spheres are human donor samples and small are mice transplanted microbiomes) Readers are always impressed by those fancy plots even if it is hard to really understand what is going on.
I used scipy to compute Bray-Curtis from the predictions of each tool against the gold standard (CAMI lc) and played around with a 3D plot, plotting Bray-Curtis vs. L1 norm vs. rank. This is the result:
We agreed: No 3D plots.
For alpha-diversity what do we have and what do we want to get: -Shannon at a given rank - estimate of skewedness of abundance distribution: To make a comparison of sample versus gold standard, compute L1-norm of predicted and true shannons, respectively.
Look at implentation of QUIIME of differtent metrics, maybe we can reuse some codes, if license is appropriate
Way to visualize beta-diversity: scatter plot of (x,y) points. The x-values are the beta_diversity(tool k on sample i, tool k on sample j), y-values are the beta_diversity(gold standard sample i, gold standard sample j). Do this for all i and j. One scatter plot for each tool (indexed by k).
All metrics (but Faith's Phylogenetic Diversity, Chao1 / Chao2) have been implemented.
From David's email, some of which to be implemented:
(As I found out the hard way, Bray Curtis is simply 1/2 our L1 Norm, thus we kind of have it already.)