Closed dkoslicki closed 2 years ago
Commit a17e6a0 replicates original results, but allows for incorporation of different edge weightings. See self.branch_len_func
Issues discovered:
Incorrect formatting of profiling results were submitted. Examples include things like:
1643949 family 2|200918|188708|1643947 Bacteria|Thermotogae|Thermotogae|Petrotogales 0.00060
where the TaxID in column 0 is not the last taxID in the tax_path
Mapping taxonomic trees to phylogenetic trees is hard. May use sequence similarity instead, but would need to agree on what the "representative sequence(s)" for a given taxID should be (non-trivial given myriad of genomes in the taxID's descendant to, say, "Bacteria").
Will focus on "investigate normalizing weighted and unweighted versions so they scale between [0,1]" next.
Also note: if normalization is successful, Unifrac can be shown along side precision/recall in the absolute performance radar/spider plots
Not sure if it will work for CAMI datasets with many unknown genomes. I think the major issue is that branch length of a taxonomy are none-informative. However, Unifrac is based on those branch length differences. On the other hand, obtaining phylogenetic trees is hard. How about phylogenetic placement? Can we place all genomes of the CAMI gold standard in a high quality reference tree like: https://www.nature.com/articles/s41467-019-13443-4 ? That would be heavy compute, but it needs to be done only once for benchmarking.
@sjanssen2 re: your comment, this is the nuanced issue with utilizing the UniFrac metric on a taxonomic tree: while UniFrac was originally defined on phylogenetic trees, it's perfectly well-defined on a taxonomic tree. However, as you point you, the branch lengths play an important roll. While we previously used branch lengths of 1 for every branch in the taxonomic trees, it was observed during the CAMI2 evaluation meeting that this caused the metric to be basically useless.
Given that we want to UniFrac here to measure "how different are these taxonomic profiles" and given that end users typically care more about changes in the taxonomic profile at higher ranks, the first idea was to scale the branch lengths by the depth in the taxonomic tree (so getting the wrong species is not penalized as much as getting the wrong phylum).
While it would be nice to harmonize the taxonomic tree with a/"the" phylogenetic tree, this (as you know) is quite difficult. Given that OPAL is intended to be a stand-alone tool, taking the phylogenetic placement route would require access to the underlying genomes used in the gold standard as well as those in every profile (since input profiles often predict taxa and branches not in the CAMI gold standard). Realistically, this will not happen. So unless there's some resource out there to get (even just a rough) estimate of phylogenetic distance between two taxID's, I don't think phylogenetic placement will be possible in this setting.
Note: as of de0908e, weighted unifrac now scales between 0 and 1 and unweighted unifrac scales between 0 and \inf. It will be impossible to get unweighted unifrac to scale between 0 and 1, but now that unweighted unifrac measures "relative unweighted unifrac" (relative to the gold standard), the numbers are much more interpretable and informative.
Can close after merge to master pending coordination with @fernandomeyer after Friday meeting. Note that I suggest we default to the (now normalized) unweighted unifrac in plots:
As per the CAMI2 meeting presentation, we found that the UniFrac metric was relatively unhelpful in ranking tools.
This issue will explore one or more of the following:
allowing edge lengths to be based on biological similaritytry mapping taxonomic tree to phylogenetic tree?haven't found a good resource to do thisuse sequence similarity metrics ala Mash?would need underlying genomes (and they are not available from input profiles)normalize by given results in a particular run of OPAL?not comparable across different runs of OPALMust have's include: