etetoolkit / ete

Python package for building, comparing, annotating, manipulating and visualising trees. It provides a comprehensive API and a collection of command line tools, including utilities to work with the NCBI taxonomy tree.
http://etetoolkit.org
GNU General Public License v3.0
795 stars 214 forks source link

Reading nexus files from strings > 220bytes fails on windows with Python 3 #258

Open davidhwyllie opened 7 years ago

davidhwyllie commented 7 years ago

Hi

A minor issue; in the Newick parser read_newick function (parsers/newick.py, line 220) called when reading newick files, as described here: http://etetoolkit.org/docs/latest/tutorial/tutorial_trees.html#reading-newick-trees

does a kind of 'duck typing' approach to deciding whether the string passed is a file path or a newick string. On windows this fails if the string is >220bytes because os.stat returns an error from the operating system.

I guess there's various ways of dealing with this, & I've attached a modified module below which implements one such.

Best wishes David


def read_newick(newick, root_node=None, format=0):
    """ Reads a newick tree from either a string or a file, and returns
    an ETE tree structure.

    A previously existent node object can be passed as the root of the
    tree, which means that all its new children will belong to the same
    class as the root(This allows to work with custom TreeNode
    objects).

    You can also take advantage from this behaviour to concatenate
    several tree structures.
    """

    if root_node is None:
        from ..coretype.tree import TreeNode
        root_node = TreeNode()

    if isinstance(newick, six.string_types):
        # we duck-type the string newick.
        # if it is a path, as identified by stat, we assume it is a file
        # otherwise, we assume it is a string.
        # On windows, the maximal permitted path length is 220 and therefore
        # passing large strings raises a ValueError from stat,
        # which states 'String is too long for Windows'
        # as a workaround, one can test whether newick is too long to be a
        # windows string.
        if len(newick)<220:     # then it could be either a path or a tree
            if os.path.exists(newick):
                if newick.endswith('.gz'):
                    import gzip
                    nw = gzip.open(newick).read()
                else:
                    # note: this raises a deprecated error on  Win 7 with python 3.5.2
                    nw = open(newick, 'rU').read()
            else:
                nw = newick
        else:
            nw = newick

        matcher = compile_matchers(formatcode=format)
        nw = nw.strip()        
        if not nw.startswith('(') and nw.endswith(';'):
            return _read_node_data(nw[:-1], root_node, "single", matcher, format)

        elif not nw.startswith('(') or not nw.endswith(';'):
            raise NewickError('Unexisting tree file or Malformed newick tree structure.')
        else:
            return _read_newick_from_string(nw, root_node, matcher, format)

    else:
        raise NewickError("'newick' argument must be either a filename or a newick string.")
unode commented 7 years ago

@davidhwyllie Can you paste a traceback of the actual error? Depending on the content I'm tempted to report this as an upstream (i.e. python) problem.

unode commented 7 years ago

Also which version of Python 3?

davidhwyllie commented 7 years ago

Hi

at the moment I am having trouble producing a unit test to reproduce this. It isn't as simple as the 'strings longer than 220 chars' described above. When I have a reproducible, failing example I will write further.

In the meantime, I have attached a unit test which passes which loads small and large files from strings and files on windows using unpatched ete3.

It produces a deprecation warning related to reading files in 'rU' mode, but this is a python 3 issue not a windows one.

ete3_issue258.zip

jhcepas commented 7 years ago

Thanks David, can you implement those unitests on your github branch and the Pull Request? It is easy to review and comment that way.

On 4 March 2017 at 22:31, David Wyllie notifications@github.com wrote:

Hi

at the moment I am having trouble producing a unit test to reproduce this. It isn't as simple as the 'strings longer than 220 chars' described above. When I have a reproducible, failing example I will write further.

In the meantime, I have attached a unit test which passes which loads small and large files from strings and files on windows using unpatched ete3.

It produces a deprecation warning related to reading files in 'rU' mode, but this is a python 3 issue not a windows one.

ete3_issue258.zip https://github.com/etetoolkit/ete/files/818947/ete3_issue258.zip

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/etetoolkit/ete/issues/258#issuecomment-284184154, or mute the thread https://github.com/notifications/unsubscribe-auth/AAJ_Sk7j3RuDd_S7T-v0QmLUyY4hmFRJks5ridgsgaJpZM4MEfEV .

davidhwyllie commented 6 years ago

Hi

Thanks. Sorry for the delay in replying.

This issue can be reproduced on the following system using the latest release of ete3.

System details OS: Windows-7-6.1.7601-SP1 Python version: 3.5.3 (v3.5.3:1880cb95a742, Jan 16 2017, 16:02:32) [MSC v.1900 64 bit (AMD64)] Ete3: 3.1.1

How to reproduce A self contained unit test is attached

E:\dwyllie\GitHub\SPIRALTREE\src>python issue258.py

Running testIssue258 
system: Windows-7-6.1.7601-SP1
python version: 3.5.3 (v3.5.3:1880cb95a742, Jan 16 2017,
 16:02:32) [MSC v.1900 64 bit (AMD64)]
ete3 version: 3.1.1
Loading small tree
There are 3 nodes
Loading large tree
E
======================================================================
ERROR: runTest (__main__.testIssue258)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "issue258.py", line 37, in runTest
    largeTree=Tree(largeTreeText)
  File "C:\Program Files\Python35\lib\site-packages\ete3_dev_git\ete3\coretype\t
ree.py", line 211, in __init__
    quoted_names=quoted_node_names)
  File "C:\Program Files\Python35\lib\site-packages\ete3_dev_git\ete3\parser\new
ick.py", line 234, in read_newick
    if os.path.exists(newick):
  File "C:\Program Files\Python35\lib\genericpath.py", line 19, in exists
    os.stat(path)
ValueError: stat: path too long for Windows

----------------------------------------------------------------------
Ran 1 test in 0.019s

FAILED (errors=1)

E:\dwyllie\GitHub\SPIRALTREE\src>

How to fix

Modify the read_newick function as shown

def read_newick(newick, root_node=None, format=0, quoted_names=False):
    """ Reads a newick tree from either a string or a file, and returns
    an ETE tree structure.

    A previously existent node object can be passed as the root of the
    tree, which means that all its new children will belong to the same
    class as the root(This allows to work with custom TreeNode
    objects).

    You can also take advantage from this behaviour to concatenate
    several tree structures.
    """

    if root_node is None:
        from ..coretype.tree import TreeNode
        root_node = TreeNode()

    if isinstance(newick, six.string_types):
        # we duck-type the string newick.
        # if it is a path, as identified by stat, we assume it is a file
        # otherwise, we assume it is a string.
        # On windows, the maximal permitted path length is 220 and therefore
        # passing large strings raises a ValueError from stat,
        # which states 'String is too long for Windows'
        # as a workaround, one can test whether newick is too long to be a
        # windows string.
        if len(newick)<220:     # then it could be either a path or a tree
            if os.path.exists(newick):
                if newick.endswith('.gz'):
                    import gzip
                    nw = gzip.open(newick).read()
                else:
                    nw = open(newick, 'rU').read()
            else:
                nw = newick
        else:
            nw = newick

        matcher = compile_matchers(formatcode=format)
        nw = nw.strip()        
        if not nw.startswith('(') and nw.endswith(';'):
            return _read_node_data(nw[:-1], root_node, "single", matcher, format)

        elif not nw.startswith('(') or not nw.endswith(';'):
            raise NewickError('Unexisting tree file or Malformed newick tree structure.')
        else:
            return _read_newick_from_string(nw, root_node, matcher, format, quoted_names)

    else:
        raise NewickError("'newick' argument must be either a filename or a newick string.")

Post fix:

E:\dwyllie\GitHub\SPIRALTREE\src>python issue258.py
60430
Reproducing issue #258
Platform: Windows-7-6.1.7601-SP1
python version: 3.5.3 (v3.5.3:1880cb95a742, Jan 16 2017, 16:02:32) [MSC v.1900 6
4 bit (AMD64)]
ete3 version: 3.1.1
Loading small tree
There are 3 nodes
Loading large tree
There are 1026 nodes
.
----------------------------------------------------------------------
Ran 1 test in 0.068s

OK

E:\dwyllie\GitHub\SPIRALTREE\src>

I don't think I have permission to create a PR (sorry, I've not done this before) but I have committed my local change to a copy of your repository [here] (https://github.com/davidhwyllie/ETE3_ISSUE258)

If you would like me to do something different, please let me know.

davidhwyllie commented 5 years ago

Thank you for your assistance. I will submit a pull request as advised. Issue reproduced and resolution tested on the following system: Windows-10-10.0.17134-SP0 3.6.6rc1 (v3.6.6rc1:1015e38be4, Jun 12 2018, 08:38:06) [MSC v.1900 64 bit (AMD64)]

Unit test and commit fixing issue https://github.com/davidhwyllie/ete/commit/1752154f361f57144ce4b09f3993f6b8374b2993