biocore / improved-octo-waddle

Balanced parentheses succinct data structure in Python
BSD 3-Clause "New" or "Revised" License
6 stars 7 forks source link

Calling parse_newick on Newick strings that don't end in a semicolon hangs indefinitely #26

Closed fedarko closed 2 years ago

fedarko commented 4 years ago

Recreating the problem

I had iow 0.1.2 installed, and this is using python 3.6.7.

import bp
# Finishes almost instantly
bp.parse_newick("((h:1, i:1, j:1, k:1, l: 1),(e:1,f:1),(n:1,o:1,p:1))a:1;")
# Hangs for what seems like forever; I need to manually kill ipython after doing this
# (even ctrl-C-ing doesn't work)
bp.parse_newick("((h:1, i:1, j:1, k:1, l: 1),(e:1,f:1),(n:1,o:1,p:1))a:1")

Not sure why exactly this is happening, but it was pretty confusing to experience :) This problem does impact Empress (QIIME 2 allows you to import the semicolon-less tree above as a Phylogeny[Rooted] artifact, and running Empress on this artifact also hangs indefinitely), so it'd be nice to address it here (either by allowing these sorts of Newick strings to be parsed, or by just failing with an error message explaining the problem).

References

Example tree used here was based on fourth example tree shown in https://github.com/igraph/igraph/issues/879#issuecomment-144861328.

wasade commented 4 years ago

Parsing behavior of malformed newick strings is undefined... that said, it would be nice if it errored out rather than hanging

In [1]: import skbio

In [2]: skbio.TreeNode.read(["(a,b);"])
Out[2]: <TreeNode, name: unnamed, internal node count: 0, tips count: 2>

In [3]: skbio.TreeNode.read(["(a,b)"])
---------------------------------------------------------------------------
NewickFormatError                         Traceback (most recent call last)
<ipython-input-3-15e871ec9967> in <module>
----> 1 skbio.TreeNode.read(["(a,b)"])

~/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/skbio/io/registry.py in read(cls, file, format, **kwargs)
    650         @classonlymethod
    651         def read(cls, file, format=None, **kwargs):
--> 652             return registry.read(file, into=cls, format=format, **kwargs)
    653
    654         imports = registry._import_paths(read_formats)

~/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/skbio/io/registry.py in read(self, file, format, into, verify, **kwargs)
    511                 return (x for x in [])
    512         else:
--> 513             return self._read_ret(file, format, into, verify, kwargs)
    514
    515     def _read_ret(self, file, fmt, into, verify, kwargs):

~/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/skbio/io/registry.py in _read_ret(self, file, fmt, into, verify, kwargs)
    518             reader, kwargs = self._init_reader(file, fmt, into, verify, kwargs,
    519                                                io_kwargs)
--> 520             return reader(file, **kwargs)
    521
    522     def _read_gen(self, file, fmt, into, verify, kwargs):

~/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/skbio/io/registry.py in wrapped_reader(file, encoding, newline, **kwargs)
    996                         # is cheaper than insert
    997                         kwargs.update(zip(file_keys, fhs[:-1]))
--> 998                         return reader_function(fhs[-1], **kwargs)
    999             else:
   1000

~/miniconda3/envs/qiime2-2019.10/lib/python3.6/site-packages/skbio/io/format/newick.py in _newick_to_tree_node(fh, convert_underscores)
    324         last_token = token
    325
--> 326     raise NewickFormatError("Could not parse file as newick."
    327                             " `(Parenthesis)`, `'single-quotes'`,"
    328                             " `[comments]` may be unbalanced, or tree may be"

NewickFormatError: Could not parse file as newick. `(Parenthesis)`, `'single-quotes'`, `[comments]` may be unbalanced, or tree may be missing its root.