OpenTreeOfLife / feedback

No code -- just an issue tracker for general feedback (sent here via GitHub's issues API)
1 stars 0 forks source link

Unable to parse Open Tree of Life Newick with ete3 (python) #545

Closed soungalo closed 1 month ago

soungalo commented 2 years ago

I am trying to parse the Newick file for Vertebrata downloaded from the Open Tree of Life server using the ete3 python package:

from ete3 import Tree
tree = Tree('Vertebrata.tre', format=1)

and getting the following error:

raise NewickError('Broken newick structure at: %s' %chunk)
ete3.parser.newick.NewickError: Broken newick structure at:
Malacothrix_typica_ott600700)'Malacothrix You may want to check other
newick loading flags like 'format' or 'quoted_node_names'.

I've seen this mentioned in this old Github issue, but this does not really resolve the problem.
Any idea why this is happening and how it could be resolved?

snacktavish commented 2 years ago

Ah tricky! It seems like it may be a problem with how ete3 is handling quoted internal node names. I'll open an issue over there for clarification.

In the meantime, a hack that works is to strip 'unusual' characters out of the internal node labels - there are many characters, such as colons and parentheses, which are legal in names in quoted newick, but newick parsers often don't handle them well. See:

You can pull a script to replace these characters with '_' from:

pip install opentree
pip install dendropy

python -i subtree-ottol-801601-Vertebrata.tre -o vertebrata_standardized.tre

This replaces the label Malacothrix (genus in Opisthokonta) ott600707 with Malacothrix _genus in Opisthokonta_ ott600707. The reason for this seemingly silly label is that there is plant genus Malacothrix as well!

Ete3 does read that output tree fine. Hope that helps!