etetoolkit / ete

Python package for building, comparing, annotating, manipulating and visualising trees. It provides a comprehensive API and a collection of command line tools, including utilities to work with the NCBI taxonomy tree.
http://etetoolkit.org
GNU General Public License v3.0
779 stars 212 forks source link

'Write argument must be str, not bytes' error using phyloXML project.export() method #255

Open davidhwyllie opened 7 years ago

davidhwyllie commented 7 years ago

Hi

I have been trying a javascript tree viewer https://phyd3.bits.vib.be/view.php?id=91162629d258a876ee994e9233b2ad87&f=xml which accepts phyloXML as input. As part of this I have been trying to generate phyloXML from ete3, as our existing visualisations are static and are produced by ete3.

I have discovered that the ete3 phyloxml project.export() method can produce an unexpected error 'write argument must be str, not bytes' depending on the way that the tree is imported.
I imagine this may be python3 specific, but have not checked. Would you be able to look at this and offer advice as to how to prevent this? I have a workaround (documented below), so there is no great rush.

Code to reproduce this issue is below.

!/usr/bin/env python3

ete3 version is 3.0.0b35; running on Windows server 2012; Python 3.5.2.

import ete3

newickString="(s1:2.5e-08,582:2.18e-07,s2:2.5e-08,s3:0.000441758)14;"; # this is iqTree output;

read it into 'standard' ete3 tree

t=ete3.Tree(newickString) print(t) # no problems

read it using explicit format

t=ete3.Tree(newickString, format=1) print(t) # no problems

sometimes it might be nice to export to phyloxml.

-- from http://etetoolkit.org/support/ --

Building a PhyloXML document out of ETE tree instances is not covered in the documentation, but it possible anyways.

You were on the right track :) You just need to pass the tree structure as a newick string to the PhyloXML constructor.

The following line worked for me in your example:

phylo = phyloxml.PhyloxmlTree(newick=spec_tree.write())

read it to phyloxml

project=ete3.Phyloxml() # create project. phylo=ete3.phyloxml.PhyloxmlTree(newick=newickString) project.add_phylogeny(phylo) xmlString=project.export() # non explanatory error message: write argument must be str, not bytes.

project=ete3.Phyloxml() # create project. phylo=ete3.phyloxml.PhyloxmlTree(newick=newickString, format=1) project.add_phylogeny(phylo) xmlString=project.export() # succeeds

davidhwyllie commented 7 years ago

So sorry, the code got interpreted as markdown. Here it is, correctly formatted..

#!/usr/bin/env python3
# ete3 version is 3.0.0b35; running on Windows server 2012; Python 3.5.2.

import ete3

newickString="(s1:2.5e-08,582:2.18e-07,s2:2.5e-08,s3:0.000441758)14;";      # this is iqTree output;

# read it into 'standard' ete3 tree
t=ete3.Tree(newickString)
print(t)            # no problems

# read it using explicit format
t=ete3.Tree(newickString, format=1)
print(t)            # no problems

# sometimes it might be nice to export to phyloxml.

# -- from http://etetoolkit.org/support/ --
#Building a PhyloXML document out of ETE tree instances is not covered in the documentation, but it possible anyways. 
#You were on the right track :) You just need to pass the tree structure as a newick string to the PhyloXML constructor.
#The following line worked for me in your example:
#phylo = phyloxml.PhyloxmlTree(newick=spec_tree.write())

# read it to phyloxml 
project=ete3.Phyloxml()       # create project.
phylo=ete3.phyloxml.PhyloxmlTree(newick=newickString)
project.add_phylogeny(phylo)
xmlString=project.export()      # non explanatory error message: write argument must be str, not bytes.

project=ete3.Phyloxml()       # create project.
phylo=ete3.phyloxml.PhyloxmlTree(newick=newickString, format=1)
project.add_phylogeny(phylo)
xmlString=project.export()      # succeeds
davidhwyllie commented 7 years ago

Actually, even though the .export succeeds without error using the 'format=1' option, the output is not correct. The content of some of the tags (e.g. b'14') appears to be result of printing a python 3 'bytes' object, compatible with the error message generated on some occasions. e.g.

<phy:name>b's1'</phy:name>

as in


<phy:Phyloxml xmlns:phy="http://www.phyloxml.org/1.10/phyloxml.xsd">
    <phy:phylogeny>
        <phy:clade>
            <phy:name>b'14'</phy:name>
            <phy:branch_length>0.000000e+00</phy:branch_length>
            <phy:clade branch_length_attr=b'"2.5e-08"'>
                <phy:name>b's1'</phy:name>
                <phy:branch_length>2.500000e-08</phy:branch_length>
            </phy:clade>
            <phy:clade branch_length_attr=b'"2.18e-07"'>
                <phy:name>b'582'</phy:name>
                <phy:branch_length>2.180000e-07</phy:branch_length>
            </phy:clade>
            <phy:clade branch_length_attr=b'"2.5e-08"'>
                <phy:name>b's2'</phy:name>
                <phy:branch_length>2.500000e-08</phy:branch_length>
            </phy:clade>
            <phy:clade branch_length_attr=b'"0.000441758"'>
                <phy:name>b's3'</phy:name>
                <phy:branch_length>4.417580e-04</phy:branch_length>
            </phy:clade>
        </phy:clade>
    </phy:phylogeny>
</phy:Phyloxml>
davidhwyllie commented 7 years ago

If we use the 'format=0' option, as in the below code.

#!/usr/bin/env python3
# ete3 version is 3.0.0b35; running on Windows server 2012; Python 3.5.2.

import ete3

newickString="(s1:2.5e-08,582:2.18e-07,s2:2.5e-08,s3:0.000441758)14;";      # this is iqTree output;

# read it into 'standard' ete3 tree
t=ete3.Tree(newickString)
print(t)            # no problems

# read it using explicit format
t=ete3.Tree(newickString, format=1)
print(t)            # no problems

# sometimes it might be nice to export to phyloxml.

# -- from http://etetoolkit.org/support/ --
#Building a PhyloXML document out of ETE tree instances is not covered in the documentation, but it possible anyways. 
#You were on the right track :) You just need to pass the tree structure as a newick string to the PhyloXML constructor.
#The following line worked for me in your example:
#phylo = phyloxml.PhyloxmlTree(newick=spec_tree.write())

# read it to phyloxml 
project=ete3.Phyloxml()       # create project.
#phylo=ete3.phyloxml.PhyloxmlTree(newick=newickString)
#project.add_phylogeny(phylo)
#xmlString=project.export()      # non explanatory error message: write argument must be str, not bytes.

project=ete3.Phyloxml()       # create project.
phylo=ete3.phyloxml.PhyloxmlTree(newick=newickString, format=0)
project.add_phylogeny(phylo)
xmlString=project.export()      # succeeds

then we get traceback as below (note that the export starts, but then raises an error :

<phy:Phyloxml xmlns:phy="http://www.phyloxml.org/1.10/phyloxml.xsd">
    <phy:phylogeny>
        <phy:clade>
            <phy:name>b''</phy:name>
            <phy:branch_length>0.000000e+00</phy:branch_length>
            <phy:confidence type=b'"branch_support"'>Traceback (most recent call
 last):
  File "ete3PhyloXMLOutputTest.py", line 34, in <module>
    xmlString=project.export()      # succeeds
  File "C:\python352\lib\site-packages\ete3\phyloxml\__init__.py", line 65, in e
xport
    return super(Phyloxml, self).export(outfile=outfile, level=level, namespaced
ef_=namespace)
  File "C:\python352\lib\site-packages\ete3\phyloxml\_phyloxml.py", line 423, in
 export
    self.exportChildren(outfile, level + 1, namespace_, name_)
  File "C:\python352\lib\site-packages\ete3\phyloxml\_phyloxml.py", line 432, in
 exportChildren
    phylogeny_.export(outfile, level, namespace_, name_='phylogeny')
  File "C:\python352\lib\site-packages\ete3\phyloxml\_phyloxml_tree.py", line 14
8, in export
    self.phyloxml_phylogeny.export(outfile=outfile, level=level, name_=name_, na
mespacedef_=namespacedef_)
  File "C:\python352\lib\site-packages\ete3\phyloxml\_phyloxml.py", line 562, in
 export
    self.exportChildren(outfile, level + 1, namespace_, name_)
  File "C:\python352\lib\site-packages\ete3\phyloxml\_phyloxml.py", line 595, in
 exportChildren
    self.clade.export(outfile, level, namespace_, name_='clade')
  File "C:\python352\lib\site-packages\ete3\phyloxml\_phyloxml.py", line 901, in
 export
    self.exportChildren(outfile, level + 1, namespace_, name_)
  File "C:\python352\lib\site-packages\ete3\phyloxml\_phyloxml.py", line 921, in
 exportChildren
    confidence_.export(outfile, level, namespace_, name_='confidence')
  File "C:\python352\lib\site-packages\ete3\phyloxml\_phyloxml.py", line 3008, i
n export
    outfile.write(str(self.valueOf_).encode(ExternalEncoding))
TypeError: write() argument must be str, not bytes

This error can be prevented by replacing line 3008 with the below:

if type(self.valueOf_)==float then the below raises an error.

        ##    outfile.write(str(self.valueOf_).encode(ExternalEncoding))
        if not type(self.valueOf_)==float:
            outfile.write(str(self.valueOf_).encode(ExternalEncoding))
        else:
            outfile.write(str(self.valueOf_))

However, the display of other elements still has the formatting issue.

jhcepas commented 7 years ago

thanks for reporting this, @davidhwyllie It definitely sounds as a compatibility problem with Py3. Could you run the export command using python2 as a workaround?

davidhwyllie commented 7 years ago

Thank you.

I think that the issue is that the outfile.write() parameter is exporting to file the repr of the parameter passed, which in python 2 is a str, but which I think it python 3 is a bytes object.

# python 3
>>> x=b'ABC'
>>>print(x)
b'ABC'

## the fix is to substitute x.decode()
>>> print(x.decode())
ABC

By default 2.7 is generating a string object which renders correctly without the .decode() but in 3.x is is generating a bytes object for which the repr is b'thing' not thing.

As I understand it from studying your great code, each phyloxml element is mapped to a class, and each class has an .export() method which often calls relevant related methods e.g. .exportAttributes().

There's quite a lot of outfile.write() commands (> 100) but only a few export self.value, which I think is the issue. Is it possible that just appending a .decode method to relevant parameters would make this work in both 2.7 and 3.x?

## tested on python 2.76 
# python 2.7.6
>>> x=b'ABC'
>>> print(x.decode())
ABC
>>> x='ABC'
>>> print(x.decode())
ABC
>>> x=u'ABC'
>>> print(x.decode())
ABC

what is your advice? I would really like to get this going on 3.x.

davidhwyllie commented 7 years ago

Is there an existing test suite for the phyloxml module? If so, could you tell me how to run it? I will make a fork and attempt to fix this. If not, if I write tests using unittest is this OK, or do you want a different framework?

jhcepas commented 7 years ago

Hi @davidhwyllie , thanks for your interest! the phyloXML tests are really basic: https://github.com/etetoolkit/ete/blob/master/ete3/test/test_xml_parsers.py#L10

The main reason is that the PhyloXML parser itself is not well supported. The parsing code was automatically generated using generateDS based on phyloXML schema. I did not have enough experience with phyloXML (and XML in general), so I could not create a proper parser providing good integration with ETEs Tree instances. Any improvement in that front is more than welcome.