etetoolkit / ete

Python package for building, comparing, annotating, manipulating and visualising trees. It provides a comprehensive API and a collection of command line tools, including utilities to work with the NCBI taxonomy tree.
http://etetoolkit.org
GNU General Public License v3.0
782 stars 212 forks source link

Out of memory when NCBITaxa.get_topology is called with a single argument #585

Open wbazant opened 2 years ago

wbazant commented 2 years ago

Discovered by accident. To reproduce:

from ete3 import NCBITaxa
ncbi=NCBITaxa("refdb/taxa.sqlite")
ncbi.get_topology([1])

I am using: Python 3.6.9 [GCC 8.4.0] on linux ete3 version (3.1.2) and the taxonomy version that goes with https://github.com/allind/EukDetect , although I guess it doesn't matter.

I've discovered it as:

ncbi.get_topology(['2759', '2759'])

when looping through pairs to assign a common ancestor.

It took me a while of not understanding what's going on - I actually thought my program is leaky, but what it was was a single bad value.

Suggested behaviour: maybe raise a ValueError when there's not enough elements to build a tree?

Relatedly, there could also be a ValueError when the method is called with no taxids:

ncbi.get_topology(taxids = [])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/wbazant/.local/lib/python3.6/site-packages/ete3/ncbi_taxonomy/ncbiquery.py", line 463, in get_topology
    root = elem2node[1]
KeyError: 1