astor.parsefile does not handle non-UTF8 files properly

berkerpeksag / astor

Python AST read/write

https://pypi.org/project/astor/

BSD 3-Clause "New" or "Revised" License

810 stars 102 forks source link

astor.parsefile does not handle non-UTF8 files properly #26

Closed pmaupin closed 7 years ago

pmaupin commented 9 years ago

Here is an example in the 3.5 std library. Perhaps we can find/reuse the std library code for handling the file encoding descriptor?

>>> import astor
>>> astor.parsefile('/usr/local/lib/python3.5/sqlite3/test/dbapi.py')
Traceback (most recent call last):
  File "", line 1, in 
  File "/home/pat/projects/opensource/astor/astor/misc.py", line 166, in parsefile
    fstr = f.read()
  File "/usr/local/lib/python3.5/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 120: invalid continuation byte

berkerpeksag commented 9 years ago

Yes, we should use io.open(..., encoding='utf-8') instead of https://github.com/berkerpeksag/astor/blob/master/astor/misc.py#L165

pmaupin commented 9 years ago

I tried that -- it doesn't actually help. The problem is that the files in question are encoded in iso-8859-1, not utf. They are properly marked according to https://www.python.org/dev/peps/pep-0263/ with a first line of

#-*- coding: iso-8859-1 -*-

PEP263 gives the instructions for decoding this -- read the first two lines, look for a match for this regexp, etc.

1) We can do that -- it's not rocket science -- but I was wondering if there was an accessible thing in the library to do it already (I think the implementation of PEP263 does it down in the compiler.)

2) Then we have the round-trip issue. Obviously, we don't need to go back to any random encoding, but I think the default encoding for < Python 3.0 is ASCII, so we probably haven't thought out how to generate non-ASCII characters in the output dump -- we probably need to automatically add the UTF-8 encoding at the top of the file.

berkerpeksag commented 9 years ago

1) We can do that -- it's not rocket science -- but I was wondering if there was an accessible thing in the library to do it already (I think the implementation of PEP263 does it down in the compiler.)

https://docs.python.org/3/library/tokenize.html#tokenize.detect_encoding would probably do the trick. We can backport it for Python 2.

pmaupin commented 9 years ago

Great! I'll work on that as part of merging anti8.

ztane commented 9 years ago

Why not use the tokenize.open?

berkerpeksag commented 9 years ago

tokenize.open also doesn't exist on Python 2. Since detect_encoding accepts a file object's readline method, we'll need to do duplicate tokenize.open anyway. So I'm +1 for using tokenize.open.

pmaupin commented 9 years ago

It turns out the only failing issues I had were either on 3.4.0, or with broken pretty-print string code. I added some goofy stuff to the pretty-print string code, and everything seems to work. If nobody has a counter-example, I will close.

pmaupin commented 7 years ago

So, for some reason, Python 3.6 works on my home machine but not my work machine. Both Ubuntu using the same PPA.

pmaupin commented 7 years ago

I'm pretty sure this works on Python 3 now.

Closing until someone who cares generates any sort of failing test.