Cibiv / IQ-TREE

Efficient phylogenomic software by maximum likelihood
http://www.iqtree.org
GNU General Public License v2.0
184 stars 44 forks source link

identical sequences not getting brlen = 0 #163

Open tseemann opened 4 years ago

tseemann commented 4 years ago

If you give iqtree 2.1.0 identical sequences Tjeu end up with a BL of 0.000001 I think thjis is a bug THis problem persisits even when --polyomy is used ALso, reducing --blmin seems to also reduce that BL too But it can't be set to zero

pvanheus commented 3 years ago

Following on from this, the logic for collapsing identical sequences means that if there are N identical sequences, N-1 are collapsed into a single entity (with branch length 0 between them) and 1 remains distinct. Given this very artificial alignment:

>outgroup
CCCCGTGAGCCCGGTAGGCCGTCGGATGCTTCCCGCCCGGCGCGCCGTCCGCCACTCGGT
CGCACGCCCGGCCGGCCCCTAATGTTCGGCCACACCGAGCGGGCGAGAGGGGTGACTCGG
>copy
CCCCGGGAGCCCGGTAGGCCGTCGGATGCGTCCCGCCCGGCGCGCCGTCCGCCACTCGGT
CGCACGCCCGGCCGGCCCCTAATGTTCGGCCACACCGAGCGGGCGAGAGGGGTGACTCGG
>copy2
CCCCGGGAGCCCGGTAGGCCGTCGGATGCGTCCCGCCCGGCGCGCCGTCCGCCACTCGGT
CGCACGCCCGGCCGGCCCCTAATGTTCGGCCACACCGAGCGGGCGAGAGGGGTGACTCGG
>copy3
CCCCGGGAGCCCGGTAGGCCGTCGGATGCGTCCCGCCCGGCGCGCCGTCCGCCACTCGGT
CGCACGCCCGGCCGGCCCCTAATGTTCGGCCACACCGAGCGGGCGAGAGGGGTGACTCGG

the following tree is produced:

((outgroup:0.0168863000,(copy:0.0000000000,copy3:0.0000000000):0.0000010000,copy2:0.0000010000);

(btw raxml-ng produces this tree:

((copy3:0.000001,copy:0.000001):0.000001,copy2:0.000001,outgroup:0.017495);

)

For both programs -blmin was kept at its default of 1e-6. This behaviour form iqtree can lead to some very surprising trees in very closely related sequences (as is common in these days of SARS-CoV-2).

pvanheus commented 3 years ago

btw to address the zero branch length issue I noted above (an independent question from the shape of the tree), insertTaxa() could be altered to use min_branch_length rather than 0.0. I don't know if that is the 'correct' way to deal with this, or if all identical sequences should all cluster together with branch length 0. There's no actual data for the algorithm to work with here so I think the word correct might be a misnomer.

Linking to some forum posts that discuss this issue one, two and three. This last one discusses the --polytomy flag that Torsten mentions above.