biopython / biopython

Official git repository for Biopython (originally converted from CVS)
http://biopython.org/
Other
4.32k stars 1.74k forks source link

Phylo.Consensus.majority_consensus does not sort bitstrings by terminal name #3345

Open marcsingleton opened 3 years ago

marcsingleton commented 3 years ago

Setup

I am reporting a problem with Biopython version, Python version, and operating system as follows:

import sys; print(sys.version)
import platform; print(platform.python_implementation()); print(platform.platform())
import Bio; print(Bio.__version__)

3.7.6 (default, Jan 8 2020, 13:42:34) [Clang 4.0.1 (tags/RELEASE_401/final)] CPython Darwin-19.6.0-x86_64-i386-64bit 1.76

Expected behaviour

The majority_consensus function in the Phylo.Consensus module should yield a consistent consensus tree regardless of the order of the terminals in the input trees.

Actual behaviour

It actually yields incorrect trees because the encoding of clades into bitstrings depends on the order in which the terminals are returned. I've traced the issue to two lines in the Phylo module.

The first is line 612 within the _tree_to_bitstrs function:

term_names = [term.name for term in tree.find_clades(terminal=True)]

Sorting the terminal names should ensure the clades are consistently encoded from tree to tree. (This function is used a few other places in the module. I'm not sure if this change will affect their function.)

The second is in line 292 of the majority_consensus function itself:

terms = first_tree.get_terminals()

Again the terminals aren't sorted, so the order in which they appear in the first tree sets the order in which the bitstring is later decoded into a list of terminal clades in line 315. Sorting the terminals also should fix this issue.

Additional notes

  1. I've had only light exposure to the Phylo module, so it's possible I'm missing some kind of preprocessing step to ensure the terminals of the input trees are returned in a consistent order. If so, however, it should be documented more clearly in the majority_consensus function.

  2. The calculation of the terminals has some unexpected, if not necessarily incorrect, behavior as well. Currently the branch lengths are taken from the first tree processed whereas the other branch lengths are calculated as an average over all trees displaying that clade. This was likely intentional as a direct application of the same averaging rule to the terminals would average all trees regardless of their topology since all trees with the same taxa display the same terminal clades. While this ensures that trees whose topologies are incompatible with the consensus don't contribute to the terminal branch lengths, it also gives the first tree unilateral power to decide them. Conversely, averaging the terminal branch lengths over all trees is potentially more appropriate since it reflects the uncertainty of that branch regardless of the topology. A third approach is averaging the terminal branch lengths only over the clades that are direct parents of the terminals. This is probably the most intuitive, but it will introduce additional complexity since the terminals would be a special case.

peterjc commented 3 years ago

@lijax are you able to look at this please?