OpenTreeOfLife / feedback

No code -- just an issue tracker for general feedback (sent here via GitHub's issues API)
1 stars 0 forks source link

Unable to parse Open Tree of Life Newick with ete3 (python) #545

Closed soungalo closed 1 month ago

soungalo commented 2 years ago

I am trying to parse the Newick file for Vertebrata downloaded from the Open Tree of Life server using the ete3 python package:

from ete3 import Tree
tree = Tree('Vertebrata.tre', format=1)

and getting the following error:

raise NewickError('Broken newick structure at: %s' %chunk)
ete3.parser.newick.NewickError: Broken newick structure at:
Malacothrix_typica_ott600700)'Malacothrix You may want to check other
newick loading flags like 'format' or 'quoted_node_names'.

I've seen this mentioned in this old Github issue, but this does not really resolve the problem.
Any idea why this is happening and how it could be resolved?
Thanks!

snacktavish commented 2 years ago

Ah tricky! It seems like it may be a problem with how ete3 is handling quoted internal node names. I'll open an issue over there for clarification.

In the meantime, a hack that works is to strip 'unusual' characters out of the internal node labels - there are many characters, such as colons and parentheses, which are legal in names in quoted newick, but newick parsers often don't handle them well. See: https://github.com/OpenTreeOfLife/ot-base/issues/10

You can pull a script to replace these characters with '_' from: https://github.com/OpenTreeOfLife/python-opentree/blob/itol_annot/examples/standardize_labels.py

pip install opentree
pip install dendropy

python standardize_labels.py -i subtree-ottol-801601-Vertebrata.tre -o vertebrata_standardized.tre

This replaces the label Malacothrix (genus in Opisthokonta) ott600707 with Malacothrix _genus in Opisthokonta_ ott600707. The reason for this seemingly silly label is that there is plant genus Malacothrix as well! https://en.wikipedia.org/wiki/Malacothrix_(plant)

Ete3 does read that output tree fine. Hope that helps!