Closed DSLituiev closed 5 years ago
Thanks for the report and sorry about the issue. Could you provide a bit more detail about how you're using the GFF parser? I was hoping we might be able to resolve this generally without needing to call decode
on every line during parsing. With a little more understanding of the full traceback we can hopefully reproduce. Thanks again.
It sounds like the file was opened in binary mode (as opposed to the default text mode); was this intentional? I can repro this easily enough with one of the test files:
>>> import BCBio.GFF
>>> e = BCBio.GFF.GFFExaminer()
>>> e.parent_child_map(open('Tests/GFF/c_elegans_WS199_shortened_gff.txt'))
{('Coding_transcript', 'gene'): [('Coding_transcript', 'mRNA')],
('Coding_transcript', 'mRNA'): [('Coding_transcript', 'CDS'),
('Coding_transcript', 'exon'),
('Coding_transcript', 'five_prime_UTR'),
('Coding_transcript', 'intron'),
('Coding_transcript', 'three_prime_UTR')]}
>>> e.parent_child_map(open('Tests/GFF/c_elegans_WS199_shortened_gff.txt', 'rb'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File ".../bcbb/gff/BCBio/GFF/GFFParser.py", line 777, in _file_or_handle_inside
out = fn(*args, **kwargs)
File ".../bcbb/gff/BCBio/GFF/GFFParser.py", line 858, in parent_child_map
if line.startswith("##FASTA"):
TypeError: startswith first arg must be bytes or a tuple of bytes, not str
But... I'm not sure that we should necessarily be in the business of determining the correct encoding to use. I guess UTF-8 (which is the default you get with bytes.decode()
) might be a safe-enough bet... maybe we could do something with io.TextIOWrapper
up in @_file_or_handle
:
diff --git a/gff/BCBio/GFF/GFFParser.py b/gff/BCBio/GFF/GFFParser.py
index a8ea059..c5e2da8 100644
--- a/gff/BCBio/GFF/GFFParser.py
+++ b/gff/BCBio/GFF/GFFParser.py
@@ -19,8 +19,10 @@ import os
import copy
import re
import collections
+import io
import itertools
import warnings
+import six
from six.moves import urllib
# Make defaultdict compatible with versions of python older than 2.4
try:
@@ -770,6 +772,8 @@ def _file_or_handle(fn):
if hasattr(in_file, "read"):
need_close = False
in_handle = in_file
+ if six.PY3 and not isinstance(in_handle, io.TextIOBase):
+ in_handle = io.TextIOWrapper(in_handle, encoding="UTF-8")
else:
need_close = True
in_handle = open(in_file)
... but I'm still not entirely sure that we should. Could at least raise a TypeError
instead, though:
if six.PY3 and not isinstance(in_handle, io.TextIOBase):
raise TypeError('input handle must be opened in text mode')
Tim; Thanks for the suggestion, the approach of identifying and raising a useful error is a great idea. Would you be able to submit a PR for that so you can get proper credit in the git history? I appreciate you thinking about this.
error something like :
adding following fixes on line 856 in
BCBio/GFF/GFFParser.py
: