Open hayd opened 9 years ago
@hayd Thanks for the PR!
How hard would it be to test whether ensureUtf
should call encode
or decode
? If you can convince me that your change is correct, I'll be comfortable merging your changes.
Here's some examples of using str/encode/decode. As you can see the behaviour is very different. Python 2 always returns a str in the below cases. Python 3 returns a str (utf8) only with the above decode path.
python2
>>> u"abc".encode("utf-8", "ignore")
'abc'
>>> b"abc".encode("utf-8", "ignore")
'abc'
>>> str(b"abc")
'abc'
python 3
>>> u"abc".encode("utf-8", "ignore")
b'abc'
>>> b"abc".encode("utf-8", "ignore")
AttributeError
>>> str(b"abc")
"b'abc'"
>>> b"abc".decode("utf-8", "ignore")
'abc'
>>> u"abc".decode("utf-8", "ignore")
AtrributeError
>>> str(u"abc")
'abc'
Note: unicode encoded becomes bytes (not utf), thus not "ensuring utf".
Which is to say, in python 2 encode on unicode did nothing, but semantically it was doing the wrong thing (as we can see when python 3 tries to do "the right thing"). decode is what it ought to have been doing, that gives the same results as before but now works in python 2 and 3.
@hayd I think the coders intention was to do something like the following with python 2 in mind:
#!/usr/bin/env python
# -*- coding: iso-8859-1 -*-
# define text
s_iso88591 = "ÄÖÜ"
# convert text to unicode
s_unicode = s_iso88591.decode("iso-8859-1")
# encode to utf-8
s_utf8 = s_unicode.encode("utf-8", "ignore")
print(type(s_iso88591))
print(type(s_unicode))
print(type(s_utf8))
try:
print(s_iso88591)
except:
print("not presentable")
try:
print(s_unicode)
except:
print("not presentable")
try:
print(s_utf8)
except:
print("not presentable")
Console output:
<type 'str'>
<type 'unicode'>
<type 'str'>
��
ÄÖÜ
ÄÖÜ
If you define the text as following, the decode part isn’t necessary because it is already Unicode, but I wanted to show the iso-8859-1 effect.
s_iso88591 = u"ÄÖÜ"
So if you want to change the encoding you have to do it with Unicode as intermediate step.
In addition I want to quote from the python documentation:
compile(source, filename, mode, flags=0, dont_inherit=False, optimize=-1)
Compile the source into a code or AST object. Code objects can be executed by exec() or eval(). source can either be a normal string, a byte string, or an AST object.
The source parameter needs a normal string or a byte string, the ensureUft() function is called there, so it should return one of these types.
After all I’m not sure what to do with this function. I didn’t notice any difference on output in lighttable with python 3 neither with encode nor with decode, if they are surrounded by a try-except block.
@UnknownProgrammer Running your code above in python3 gives
AttributeError: 'str' object has no attribute 'decode'
(That is, if it's already unicode it'll hit the except in python 3. It's a no-op in python 2.)
whilst I agree that it was doing something different before (not always returning a unicode/utf type), I think that was a bug rather than programmer intention of the "ensure uft" function.
The source parameter needs a normal string or a byte string, the ensureUft() function is called there
That's great, that's precisely my example above.
I didn’t notice any difference on output in lighttable
This is the important thing IMO.
@hayd I'm inclined to agree that your change is correct:
Here's the commit where Chris added the ensureUtf
function.
@UnknownProgrammer Would you test these changes and confirm whether they work for you with Python 3 code?
@UnknownProgrammer Upon reflection, your example above is quite interesting - thanks! Although I can't seem to get latin-1 coding to work on python 3 (I can see this on python 2): strings seem to always to be unicode (so "ÄÖÜ" is read as 'Ã\x84Ã\x96Ã\x9c' which prints as "ÃÃÃ"). Note that this would have to do a encode/decode round trip to do this properly...
>>> "\xc3\x84\xc3\x96\xc3\x9c"
'Ã\x84Ã\x96Ã\x9c'
>>> "\xc3\x84\xc3\x96\xc3\x9c".encode("iso-8859-1")
b'\xc3\x84\xc3\x96\xc3\x9c'
>>> "\xc3\x84\xc3\x96\xc3\x9c".encode("iso-8859-1").decode("utf-8")
'ÄÖÜ'
So the good news is that previously your example would have actually raised (in python 2):
print(s_iso88591.encode("utf-8", "ignore"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)
whereas decode parses it to valid unicode u'\xc4\xd6\xdc' (i.e. it's works/is fixed with this PR).
Which is to say, this PR also fixes #10.
@hayd the example is Python 2 only code
we have two text types: str
and unicode
which is equivalent to the Python 3 str
• a Unicode string gets encoded to a Python 2.x string (actually, a sequence of bytes) • a Python 2.x string gets decoded to a Unicode string
That‘s a quote from the accepted answer of this post [http://stackoverflow.com/questions/368805/python-unicodedecodeerror-am-i-misunderstanding-encode#370199]
def ensureUtf(s):
if type(s) == unicode:
return s.encode('utf8', 'ignore')
else:
return str(s)
I checked the type of the input objects s
in LightTable with Python 2 and all tested objects were of type Unicode. The return value for the Unicode objects is a byte string (str) encoded with utf8.
Unicode -> str(utf8)
Calling decode
for Unicode objects makes no sense, because the objects are already Unicode.
def ensureUtf(s):
try:
return s.decode('utf8', 'ignore')
except AttributeError:
return str(s)
If the input object is of type str and you want return Unicode objects, decode
makes sense, but you have to know the encoding of the byte string to have success.
Str(?) -> unicode
In python 2 the default encoding is ascii, Pyton3: utf-8
>>> import sys
>>> sys.getdefaultencoding()
'ascii'
The prefferedencoding on windows is cp1252
>>> import locale
>>> locale.getpreferredencoding()
'cp1252'
In linux it‘s utf-8
Many encodings to take care of.
Python 3 removed support for non Unicode data text (the python 2 str
class) and replaced it with the byte type bytes
.
Therefore all strings are sequences of unicode characters and you can’t do the same as in Python 2.
>>> s_unicodeutf8 = "ÄÖÜ"
>>> type(s_unicodeutf8)
<class 'str'>
>>> s_unicodeutf8.encode("iso-8859-1")
b'\xc4\xd6\xdc'
>>> type(b'\xc4\xd6\xdc')
<class 'bytes'>
s_unicodeutf8 is a unicode str so it gets encoded to bytes (here with iso-8859-1 encoding)
>>> b'\xc4\xd6\xdc'.decode("iso-8859-1", "ignore")
'ÄÖÜ'
Bytes decoded to unicode str
>>> s_utf8bytes = "ÄÖÜ".encode("utf8")
>>> s_utf8bytes
b'\xc3\x84\xc3\x96\xc3\x9c'
>>> b'\xc3\x84\xc3\x96\xc3\x9c'.decode("utf8", "ignore")
'ÄÖÜ'
>>> b'\xc4\xd6\xdc'.decode("utf8", "ignore")
''
>>> b'\xc4\xd6\xdc'.decode("utf8")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid
continuation byte
If you try to decode the wrong encoding you get nothing or error.
And that is what could happen in the second ensureUtf
function above, but in LightTable with Python 3 all tested input values of this function were of type str.
And you can’t call decode on str.
With encode, all str objects are encoded to bytes in utf8, which compile()
can handle.
At the moment I can't find a case where decode
is called, but encode
is called every time and converts from unicode
to str
(Python 2) and str
to bytes
(Python 3).
"\xc3\x84\xc3\x96\xc3\x9c" 'Ã\x84Ã\x96Ã\x9c' "\xc3\x84\xc3\x96\xc3\x9c".encode("iso-8859-1") b'\xc3\x84\xc3\x96\xc3\x9c' "\xc3\x84\xc3\x96\xc3\x9c".encode("iso-8859-1").decode("utf-8") 'ÄÖÜ'
>>> type("\xc3\x84\xc3\x96\xc3\x9c")
<class 'str'>
>>> type(b"\xc3\x84\xc3\x96\xc3\x9c")
<class 'bytes'>
You used the wrong object type, without the small b it is only a unicode string. The decode
function exists only on bytes
.
>>> b"\xc3\x84\xc3\x96\xc3\x9c".decode("utf8")
'ÄÖÜ'
The "iso-8859-1" representation of „ÄÖÜ“ looks like this
>>> 'ÄÖÜ'.encode("iso-8859-1")
b'\xc4\xd6\xdc'
>>> b'\xc4\xd6\xdc'.decode("iso-8859-1")
'ÄÖÜ'
Conversion path from iso-8859-1 to utf8 and back: iso-8859-1 -> decode -> unicode string -> encode -> utf8
utf8 -> decode -> unicode string ->encode -> iso-8859-1
@hayd @UnknownProgrammer Thanks for looking at this. I'll gladly merge a PR with which you're both happy.
@UnknownProgrammer I don't follow what your advocating here going forward? (and whether it fixes #10?)
but you have to know the encoding of the byte string to have success.
http://stackoverflow.com/a/436299/1240268 :s
Or do you look up #- coding
line if it exists otherwise use getdefaultencoding (or getpreferredencoding??). Does this mean you're suggesting:
try:
return s.decode(ENCODING_ABOVE, 'ignore')
Like I say, in python 3 I can't get your example with latin-1 and the coding line to work at all (the initial string is read as broken unicode, whereas in python 2 it's a str).
@hayd The picture in #10 shows me that the latest python plugin isn't installed. The last time I checked it was updated separately and nobody mentioned the version of the plugin. With the latest plugin it is fixed in Python 2 and my PR will make in able to run with Python 3, too.
Using encode
is correct.
The addition of:
# -*- coding: utf-8 -*-
Is necessary if you run LT with Python 2 and want to use uft8 encoded files, because the defaultencoding of Python 2 is ascii. LT calls __import__()
on these files, if you don't specify another encoding, it tries to import with ascii, which will cause an error on non ascii encoded files.
Like I say, in python 3 I can't get your example with latin-1 and the coding line to work at all (the initial string is read as broken unicode, whereas in python 2 it's a str).
As I explained in the first line of my previous post it is Python 2 only, because in Python 3 doesn't exist a non Unicode data text any more. All strings are decoded implicit to unicode strings. And in Python 3 you can only encode unicode to byte and decode byte to unicode. After the quote of your post (the quote with the 4 bars) I showed you that you did something wrong.
@hayd @kenny-evitt This PR creates an error in LT with Python 2 and the following code:
# -*- coding: utf-8 -*-
s="äüö"
print(s)
Error:
Traceback (most recent call last):
File "/home/user/.config/LightTable/plugins/Python/py-src/ltmain.py", line 214, in handleEval
ensureUtf(code), ensureUtf(data[2]["name"]), 'exec')
File "/home/user/.config/LightTable/plugins/Python/py-src/ltmain.py", line 55, in ensureUtf
return s.decode('utf8', 'ignore')
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-7: ordinal not in range(128)
If you're using non-unicode in a utf-coded document isn't that a bug in the python file? This is the bit I don't quite understand: why is non-valid unicode ending up in the (python3) str?
Perhaps we could be forgive non-valid unicode and simply catch the UnicodeEncodeError... ? and just return s?
The bit with 4 bars was intentional (str rather than bytes), as this seems the only way a round trip can be done in python 3, and it is the way your original example is interpreted in python 3 even with the latin-1 coding header.
@hayd You want me to try this:
# -*- coding: utf-8 -*-
s=unicode("äüö")
print(s)
Or that:
# -*- coding: utf-8 -*-
s=unicode("äüö","utf8")
print(s)
Same error, both times.
The bit with 4 bars was intentional (str rather than bytes), as this seems the only way a round trip can be done in python 3, and it is the way your original example is interpreted in python 3 even with the latin-1 coding header.
It is not meant to be used with Python 3 because it takes advantage of the Python 2 str class, which doesn't exist in Python 3 in the same way. The only thing you had was the utf-8 encoded bytes representation:
b"\xc3\x84\xc3\x96\xc3\x9c"
You never had the "iso-8859-1" representation:
b'\xc4\xd6\xdc'
The only thing you did was converting utf8 bytes to unicode str in a strange way.
Maybe it is clearer this way (Python 2 only!):
#!/usr/bin/env python
# -*- coding: iso-8859-1 -*-
# define text
s_iso88591 = "ÄÖÜ"
# convert text to unicode
s_unicode = s_iso88591.decode("iso-8859-1")
# encode to utf-8
s_utf8 = s_unicode.encode("utf-8", "ignore")
print(type(s_iso88591))
print(type(s_unicode))
print(type(s_utf8))
try:
print(s_iso88591)
print(repr(s_iso88591))
except:
print("not presentable")
try:
print(s_unicode)
print(repr(s_unicode))
except:
print("not presentable")
try:
print(s_utf8)
print(repr(s_utf8))
except:
print("not presentable")
Output:
<type 'str'>
<type 'unicode'>
<type 'str'>
��
'\xc4\xd6\xdc'
ÄÖÜ
u'\xc4\xd6\xdc'
ÄÖÜ
'\xc3\x84\xc3\x96\xc3\x9c'
@hayd @UnknownProgrammer What should we do with this PR?
Hi, I just found this PR which is pretty much the same as mine I think (I have no time right now to read it thoroughly). Might be worth cross-ref' it : #47
The first commit uses autopep8 to clean up the code a little. Fixes #15.
The second commit makes a couple of changes for python 3 compatibility. #24 #28
Note: a smaller diff can be seen with whitespace ignored (though obviously whitespace is meaningful in python)...