Modify EMDUnifrac so the results are more informative

dkoslicki commented 4 years ago

As per the CAMI2 meeting presentation, we found that the UniFrac metric was relatively unhelpful in ranking tools.

This issue will explore one or more of the following:

[x] allowing edge lengths to be changed based on depth in the taxonomic tree (instead of all set to 1)
- [x] experiment with different kinds of weightings (1/depth, 1/depth^pwr)
[ ] ~~allowing edge lengths to be based on biological similarity~~
- [ ] ~~try mapping taxonomic tree to phylogenetic tree?~~ haven't found a good resource to do this
- [ ] ~~use sequence similarity metrics ala Mash?~~ would need underlying genomes (and they are not available from input profiles)
[ ] investigate normalizing weighted and unweighted versions so they scale between [0,1]
- [x] normalize by worst possible UniFrac weight?
- [ ] ~~normalize by given results in a particular run of OPAL?~~ not comparable across different runs of OPAL

Must have's include:

[ ] more test cases to confirm choices above reflect biological realism we are trying to capture (and to catch bugs)
[x] assess on CAMI1 and CAMI2 data to see how results change

dkoslicki commented 4 years ago

Commit a17e6a0 replicates original results, but allows for incorporation of different edge weightings. See self.branch_len_func

Issues discovered:

Incorrect formatting of profiling results were submitted. Examples include things like:
```
1643949 family  2|200918|188708|1643947 Bacteria|Thermotogae|Thermotogae|Petrotogales   0.00060
```
where the TaxID in column 0 is not the last taxID in the tax_path
Mapping taxonomic trees to phylogenetic trees is hard. May use sequence similarity instead, but would need to agree on what the "representative sequence(s)" for a given taxID should be (non-trivial given myriad of genomes in the taxID's descendant to, say, "Bacteria").
Will focus on "investigate normalizing weighted and unweighted versions so they scale between [0,1]" next.

dkoslicki commented 4 years ago

Also note: if normalization is successful, Unifrac can be shown along side precision/recall in the absolute performance radar/spider plots

sjanssen2 commented 4 years ago

Not sure if it will work for CAMI datasets with many unknown genomes. I think the major issue is that branch length of a taxonomy are none-informative. However, Unifrac is based on those branch length differences. On the other hand, obtaining phylogenetic trees is hard. How about phylogenetic placement? Can we place all genomes of the CAMI gold standard in a high quality reference tree like: https://www.nature.com/articles/s41467-019-13443-4 ? That would be heavy compute, but it needs to be done only once for benchmarking.

dkoslicki commented 4 years ago

@sjanssen2 re: your comment, this is the nuanced issue with utilizing the UniFrac metric on a taxonomic tree: while UniFrac was originally defined on phylogenetic trees, it's perfectly well-defined on a taxonomic tree. However, as you point you, the branch lengths play an important roll. While we previously used branch lengths of 1 for every branch in the taxonomic trees, it was observed during the CAMI2 evaluation meeting that this caused the metric to be basically useless.

Given that we want to UniFrac here to measure "how different are these taxonomic profiles" and given that end users typically care more about changes in the taxonomic profile at higher ranks, the first idea was to scale the branch lengths by the depth in the taxonomic tree (so getting the wrong species is not penalized as much as getting the wrong phylum).

While it would be nice to harmonize the taxonomic tree with a/"the" phylogenetic tree, this (as you know) is quite difficult. Given that OPAL is intended to be a stand-alone tool, taking the phylogenetic placement route would require access to the underlying genomes used in the gold standard as well as those in every profile (since input profiles often predict taxa and branches not in the CAMI gold standard). Realistically, this will not happen. So unless there's some resource out there to get (even just a rough) estimate of phylogenetic distance between two taxID's, I don't think phylogenetic placement will be possible in this setting.

dkoslicki commented 4 years ago

Note: as of de0908e, weighted unifrac now scales between 0 and 1 and unweighted unifrac scales between 0 and \inf. It will be impossible to get unweighted unifrac to scale between 0 and 1, but now that unweighted unifrac measures "relative unweighted unifrac" (relative to the gold standard), the numbers are much more interpretable and informative.

dkoslicki commented 4 years ago

Can close after merge to master pending coordination with @fernandomeyer after Friday meeting. Note that I suggest we default to the (now normalized) unweighted unifrac in plots:

it's more informative in that it varies more over CAMI submissions
it's unaffected by normalizing of profiles or not

CAMI-challenge / OPAL

Modify EMDUnifrac so the results are more informative #31