chapmanb / bcbb

Incubator for useful bioinformatics code, primarily in Python and R
http://bcbio.wordpress.com
603 stars 243 forks source link

Python 3 compatibility issue + fix #125

Closed DSLituiev closed 5 years ago

DSLituiev commented 5 years ago

error something like :

startwith accepts bytes, not strings:
            if line.startswith("##FASTA"):

adding following fixes on line 856 in BCBio/GFF/GFFParser.py:

line = line.decode()
chapmanb commented 5 years ago

Thanks for the report and sorry about the issue. Could you provide a bit more detail about how you're using the GFF parser? I was hoping we might be able to resolve this generally without needing to call decode on every line during parsing. With a little more understanding of the full traceback we can hopefully reproduce. Thanks again.

tipabu commented 5 years ago

It sounds like the file was opened in binary mode (as opposed to the default text mode); was this intentional? I can repro this easily enough with one of the test files:

>>> import BCBio.GFF
>>> e = BCBio.GFF.GFFExaminer()
>>> e.parent_child_map(open('Tests/GFF/c_elegans_WS199_shortened_gff.txt'))
{('Coding_transcript', 'gene'): [('Coding_transcript', 'mRNA')],
 ('Coding_transcript', 'mRNA'): [('Coding_transcript', 'CDS'),
                                 ('Coding_transcript', 'exon'),
                                 ('Coding_transcript', 'five_prime_UTR'),
                                 ('Coding_transcript', 'intron'),
                                 ('Coding_transcript', 'three_prime_UTR')]}
>>> e.parent_child_map(open('Tests/GFF/c_elegans_WS199_shortened_gff.txt', 'rb'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../bcbb/gff/BCBio/GFF/GFFParser.py", line 777, in _file_or_handle_inside
    out = fn(*args, **kwargs)
  File ".../bcbb/gff/BCBio/GFF/GFFParser.py", line 858, in parent_child_map
    if line.startswith("##FASTA"):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str

But... I'm not sure that we should necessarily be in the business of determining the correct encoding to use. I guess UTF-8 (which is the default you get with bytes.decode()) might be a safe-enough bet... maybe we could do something with io.TextIOWrapper up in @_file_or_handle:

diff --git a/gff/BCBio/GFF/GFFParser.py b/gff/BCBio/GFF/GFFParser.py
index a8ea059..c5e2da8 100644
--- a/gff/BCBio/GFF/GFFParser.py
+++ b/gff/BCBio/GFF/GFFParser.py
@@ -19,8 +19,10 @@ import os
 import copy
 import re
 import collections
+import io
 import itertools
 import warnings
+import six
 from six.moves import urllib
 # Make defaultdict compatible with versions of python older than 2.4
 try:
@@ -770,6 +772,8 @@ def _file_or_handle(fn):
         if hasattr(in_file, "read"):
             need_close = False
             in_handle = in_file
+            if six.PY3 and not isinstance(in_handle, io.TextIOBase):
+                in_handle = io.TextIOWrapper(in_handle, encoding="UTF-8")
         else:
             need_close = True
             in_handle = open(in_file)

... but I'm still not entirely sure that we should. Could at least raise a TypeError instead, though:

            if six.PY3 and not isinstance(in_handle, io.TextIOBase):
                raise TypeError('input handle must be opened in text mode')
chapmanb commented 5 years ago

Tim; Thanks for the suggestion, the approach of identifying and raising a useful error is a great idea. Would you be able to submit a PR for that so you can get proper credit in the git history? I appreciate you thinking about this.