biocore / q2-qemistree

Hierarchical orderings for mass spectrometry data. Canonically pronounced "chemis-tree".
BSD 2-Clause "Simplified" License
31 stars 16 forks source link

Wighted vs unweighted UniFrac strong difference #149

Open ArnaudGaudry opened 2 years ago

ArnaudGaudry commented 2 years ago

Hello qemistree developers!

I tried to reproduce analyses from the publication on the evaluation dataset: https://github.com/knightlab-analyses/qemistree-analyses/blob/master/Evaluation-Dataset-Analyses.ipynb

When generating the plot using the metricunweighted_unifrac instead of weighted_normalized_unifrac , it generates a really different plot that is actually quite similar to the one generated using bray-curtis (strong batch effect visible). Is this inherent in the metric and expected? I thought you might have tested it in development!

Thanks and best regards, Arnaud

ElDeveloper commented 2 years ago

@ArnaudGaudry that's interesting. Mathematically speaking Bray-Curtis is more similar to weighted UniFrac than it is to unweighted UniFrac. I don't remember seeing this plot, mainly because we knew the abundance-based weighting inherent to the weighted variant of UniFrac would play an important role based on other experiments and tests we ran before. @anupriyatripathi any thoughts on this?

anupriyatripathi commented 2 years ago

@ArnaudGaudry https://github.com/ArnaudGaudry thanks for the question and your analysis! In line with what Yoshiki said, we used weighted UniFrac because abundances encode important information for metabolomics data analysis. Bray-Curtis also uses abundances similar to weighted UniFrac and therefore we used it for our comparisons. It's interesting that you see unweighted UniFrac capturing batch-to-batch variation. This could be due to the property of this metric to give importance to really low abundance signals as well, which might be different between the batches (due to shifts in retention time.)

I also expect that if you compare unweighted UniFrac to a comparable metric such as the tree-agnostic binary Jaccard distance (using PERMANOVA test statistic), you might see that the UniFrac metric improves the batch effect even if it doesn't reconcile the batches completely. We'd love to take a look at your plots/analysis if you'd like more input.

Thanks again for the question - very interesting!

Anupriya Tripathi, PhD

On Fri, 23 Jul 2021 at 10:52, Yoshiki Vázquez Baeza < @.***> wrote:

@ArnaudGaudry https://github.com/ArnaudGaudry that's interesting. Mathematically speaking Bray-Curtis is more similar to weighted UniFrac than it is to unweighted UniFrac. I don't remember seeing this plot, mainly because we knew the abundance-based weighting inherent to the weighted variant of UniFrac would play an important role based on other experiments and tests we ran before. @anupriyatripathi https://github.com/anupriyatripathi any thoughts on this?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/biocore/q2-qemistree/issues/149#issuecomment-885802742, or unsubscribe https://github.com/notifications/unsubscribe-auth/AD6ULPMXID7H6XVV3XEIFVLTZGT53ANCNFSM5AYAHNKA .

ArnaudGaudry commented 2 years ago

@anupriyatripathi @ElDeveloper Thank you for your detailed answers!

It is indeed maybe due to the weight given to low abundance metabolites. PERMANOVA is a really good idea to measure the groups separations and I'll give it a try. Since the idea is to use chemical relationships to mitigate the batch effect, I also compared Qemistree to CSCS (also weighted and unweighted). As you can see, unweighted CSCS still mitigates the batch effect, unlike unweighted Unifrac. Since both are methodologically completely different, it is hard to compare but I expected a result quite similar for both unweighted versions (as it is the case for weighted versions). This is obviously not the case ^^ unifrac_vs_cscs