Pep8 and Python 3 compatibility

hayd commented 9 years ago

The first commit uses autopep8 to clean up the code a little. Fixes #15.

The second commit makes a couple of changes for python 3 compatibility. #24 #28

Note: a smaller diff can be seen with whitespace ignored (though obviously whitespace is meaningful in python)...

kenny-evitt commented 9 years ago

@hayd Thanks for the PR!

How hard would it be to test whether ensureUtf should call encode or decode? If you can convince me that your change is correct, I'll be comfortable merging your changes.

hayd commented 9 years ago

Here's some examples of using str/encode/decode. As you can see the behaviour is very different. Python 2 always returns a str in the below cases. Python 3 returns a str (utf8) only with the above decode path.

python2

>>> u"abc".encode("utf-8", "ignore")
'abc'
>>> b"abc".encode("utf-8", "ignore")
'abc'
>>> str(b"abc")
'abc'

python 3

>>> u"abc".encode("utf-8", "ignore")
b'abc'
>>> b"abc".encode("utf-8", "ignore")
AttributeError
>>> str(b"abc")
"b'abc'"

>>> b"abc".decode("utf-8", "ignore")
'abc'
>>> u"abc".decode("utf-8", "ignore")
AtrributeError
>>> str(u"abc")
'abc'

Note: unicode encoded becomes bytes (not utf), thus not "ensuring utf".

hayd commented 9 years ago

Which is to say, in python 2 encode on unicode did nothing, but semantically it was doing the wrong thing (as we can see when python 3 tries to do "the right thing"). decode is what it ought to have been doing, that gives the same results as before but now works in python 2 and 3.

UnknownProgrammer commented 9 years ago

@hayd I think the coders intention was to do something like the following with python 2 in mind:


#!/usr/bin/env python
# -*- coding: iso-8859-1 -*-
# define text
s_iso88591 = "ÄÖÜ"
# convert text to unicode
s_unicode = s_iso88591.decode("iso-8859-1")
# encode to utf-8
s_utf8 = s_unicode.encode("utf-8", "ignore")
print(type(s_iso88591))
print(type(s_unicode))
print(type(s_utf8))
try:
    print(s_iso88591)
except:
    print("not presentable")
try:
    print(s_unicode)
except:
    print("not presentable")
try:
    print(s_utf8)
except:
    print("not presentable")

Console output:


<type 'str'>
<type 'unicode'>
<type 'str'>
��
ÄÖÜ
ÄÖÜ

If you define the text as following, the decode part isn’t necessary because it is already Unicode, but I wanted to show the iso-8859-1 effect.


 s_iso88591 = u"ÄÖÜ"

So if you want to change the encoding you have to do it with Unicode as intermediate step.

In addition I want to quote from the python documentation:

compile(source, filename, mode, flags=0, dont_inherit=False, optimize=-1)

Compile the source into a code or AST object. Code objects can be executed by exec() or eval(). source can either be a normal string, a byte string, or an AST object.

The source parameter needs a normal string or a byte string, the ensureUft() function is called there, so it should return one of these types.

After all I’m not sure what to do with this function. I didn’t notice any difference on output in lighttable with python 3 neither with encode nor with decode, if they are surrounded by a try-except block.

hayd commented 9 years ago

@UnknownProgrammer Running your code above in python3 gives

AttributeError: 'str' object has no attribute 'decode'

(That is, if it's already unicode it'll hit the except in python 3. It's a no-op in python 2.)

whilst I agree that it was doing something different before (not always returning a unicode/utf type), I think that was a bug rather than programmer intention of the "ensure uft" function.

The source parameter needs a normal string or a byte string, the ensureUft() function is called there

That's great, that's precisely my example above.

I didn’t notice any difference on output in lighttable

This is the important thing IMO.

kenny-evitt commented 9 years ago

@hayd I'm inclined to agree that your change is correct:

python - What is the difference between encode/decode? - Stack Overflow

Here's the commit where Chris added the ensureUtf function.

@UnknownProgrammer Would you test these changes and confirm whether they work for you with Python 3 code?

hayd commented 9 years ago

@UnknownProgrammer Upon reflection, your example above is quite interesting - thanks! Although I can't seem to get latin-1 coding to work on python 3 (I can see this on python 2): strings seem to always to be unicode (so "ÄÖÜ" is read as 'Ã\x84Ã\x96Ã\x9c' which prints as "ÃÃÃ"). Note that this would have to do a encode/decode round trip to do this properly...

>>> "\xc3\x84\xc3\x96\xc3\x9c"
'Ã\x84Ã\x96Ã\x9c'
>>> "\xc3\x84\xc3\x96\xc3\x9c".encode("iso-8859-1")
b'\xc3\x84\xc3\x96\xc3\x9c'
>>> "\xc3\x84\xc3\x96\xc3\x9c".encode("iso-8859-1").decode("utf-8")
'ÄÖÜ'

So the good news is that previously your example would have actually raised (in python 2):

    print(s_iso88591.encode("utf-8", "ignore"))
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

whereas decode parses it to valid unicode u'\xc4\xd6\xdc' (i.e. it's works/is fixed with this PR).

hayd commented 9 years ago

Which is to say, this PR also fixes #10.

UnknownProgrammer commented 9 years ago

@hayd the example is Python 2 only code

In Python 2:

we have two text types: str and unicode which is equivalent to the Python 3 str

• a Unicode string gets encoded to a Python 2.x string (actually, a sequence of bytes) • a Python 2.x string gets decoded to a Unicode string

That‘s a quote from the accepted answer of this post [http://stackoverflow.com/questions/368805/python-unicodedecodeerror-am-i-misunderstanding-encode#370199]

The original function:


def ensureUtf(s):
    if type(s) == unicode:
        return s.encode('utf8', 'ignore')
    else:
        return str(s)

I checked the type of the input objects s in LightTable with Python 2 and all tested objects were of type Unicode. The return value for the Unicode objects is a byte string (str) encoded with utf8. Unicode -> str(utf8) Calling decode for Unicode objects makes no sense, because the objects are already Unicode.


def ensureUtf(s):
    try:
        return s.decode('utf8', 'ignore')
    except AttributeError:
        return str(s)

If the input object is of type str and you want return Unicode objects, decode makes sense, but you have to know the encoding of the byte string to have success. Str(?) -> unicode In python 2 the default encoding is ascii, Pyton3: utf-8


>>> import sys
>>> sys.getdefaultencoding()
'ascii'

The prefferedencoding on windows is cp1252


>>> import locale
>>> locale.getpreferredencoding()
'cp1252'

In linux it‘s utf-8

Many encodings to take care of.

In Python 3:

Python 3 removed support for non Unicode data text (the python 2 str class) and replaced it with the byte type bytes. Therefore all strings are sequences of unicode characters and you can’t do the same as in Python 2.

In python 3 a unicode str becomes encoded to bytes
And a bytes type becomes decoded to a unicode str


>>> s_unicodeutf8 = "ÄÖÜ"
>>> type(s_unicodeutf8)
<class 'str'>
>>> s_unicodeutf8.encode("iso-8859-1")
b'\xc4\xd6\xdc'
>>> type(b'\xc4\xd6\xdc')
<class 'bytes'>

s_unicodeutf8 is a unicode str so it gets encoded to bytes (here with iso-8859-1 encoding)


>>> b'\xc4\xd6\xdc'.decode("iso-8859-1", "ignore")
'ÄÖÜ'

Bytes decoded to unicode str


>>> s_utf8bytes = "ÄÖÜ".encode("utf8")
>>> s_utf8bytes
b'\xc3\x84\xc3\x96\xc3\x9c'
>>> b'\xc3\x84\xc3\x96\xc3\x9c'.decode("utf8", "ignore")
'ÄÖÜ'

BUT:


>>> b'\xc4\xd6\xdc'.decode("utf8", "ignore")
''
>>> b'\xc4\xd6\xdc'.decode("utf8")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc4 in position 0: invalid
continuation byte

If you try to decode the wrong encoding you get nothing or error. And that is what could happen in the second ensureUtf function above, but in LightTable with Python 3 all tested input values of this function were of type str. And you can’t call decode on str.

With encode, all str objects are encoded to bytes in utf8, which compile() can handle.

At the moment I can't find a case where decode is called, but encodeis called every time and converts from unicode to str (Python 2) and str to bytes (Python 3).

"\xc3\x84\xc3\x96\xc3\x9c" 'Ã\x84Ã\x96Ã\x9c' "\xc3\x84\xc3\x96\xc3\x9c".encode("iso-8859-1") b'\xc3\x84\xc3\x96\xc3\x9c' "\xc3\x84\xc3\x96\xc3\x9c".encode("iso-8859-1").decode("utf-8") 'ÄÖÜ'


>>> type("\xc3\x84\xc3\x96\xc3\x9c")
<class 'str'>
>>> type(b"\xc3\x84\xc3\x96\xc3\x9c")
<class 'bytes'>

You used the wrong object type, without the small b it is only a unicode string. The decode function exists only on bytes.


>>> b"\xc3\x84\xc3\x96\xc3\x9c".decode("utf8")
'ÄÖÜ'

The "iso-8859-1" representation of „ÄÖÜ“ looks like this


>>> 'ÄÖÜ'.encode("iso-8859-1")
b'\xc4\xd6\xdc'
>>> b'\xc4\xd6\xdc'.decode("iso-8859-1")
'ÄÖÜ'

Conversion path from iso-8859-1 to utf8 and back: iso-8859-1 -> decode -> unicode string -> encode -> utf8

utf8 -> decode -> unicode string ->encode -> iso-8859-1

kenny-evitt commented 9 years ago

@hayd @UnknownProgrammer Thanks for looking at this. I'll gladly merge a PR with which you're both happy.

hayd commented 9 years ago

@UnknownProgrammer I don't follow what your advocating here going forward? (and whether it fixes #10?)

but you have to know the encoding of the byte string to have success.

http://stackoverflow.com/a/436299/1240268 :s Or do you look up #- coding line if it exists otherwise use getdefaultencoding (or getpreferredencoding??). Does this mean you're suggesting:

try:
    return s.decode(ENCODING_ABOVE, 'ignore')

Like I say, in python 3 I can't get your example with latin-1 and the coding line to work at all (the initial string is read as broken unicode, whereas in python 2 it's a str).

UnknownProgrammer commented 9 years ago

@hayd The picture in #10 shows me that the latest python plugin isn't installed. The last time I checked it was updated separately and nobody mentioned the version of the plugin. With the latest plugin it is fixed in Python 2 and my PR will make in able to run with Python 3, too.

Using encode is correct.

The addition of:


# -*- coding: utf-8 -*-

Is necessary if you run LT with Python 2 and want to use uft8 encoded files, because the defaultencoding of Python 2 is ascii. LT calls __import__() on these files, if you don't specify another encoding, it tries to import with ascii, which will cause an error on non ascii encoded files.

Like I say, in python 3 I can't get your example with latin-1 and the coding line to work at all (the initial string is read as broken unicode, whereas in python 2 it's a str).

As I explained in the first line of my previous post it is Python 2 only, because in Python 3 doesn't exist a non Unicode data text any more. All strings are decoded implicit to unicode strings. And in Python 3 you can only encode unicode to byte and decode byte to unicode. After the quote of your post (the quote with the 4 bars) I showed you that you did something wrong.

UnknownProgrammer commented 9 years ago

@hayd @kenny-evitt This PR creates an error in LT with Python 2 and the following code:


# -*- coding: utf-8 -*-
s="äüö"
print(s)

Error:


Traceback (most recent call last):
  File "/home/user/.config/LightTable/plugins/Python/py-src/ltmain.py", line 214, in handleEval
    ensureUtf(code), ensureUtf(data[2]["name"]), 'exec')
  File "/home/user/.config/LightTable/plugins/Python/py-src/ltmain.py", line 55, in ensureUtf
    return s.decode('utf8', 'ignore')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeEncodeError: 'ascii' codec can't encode characters in position 5-7: ordinal not in range(128)

hayd commented 9 years ago

If you're using non-unicode in a utf-coded document isn't that a bug in the python file? This is the bit I don't quite understand: why is non-valid unicode ending up in the (python3) str?

Perhaps we could be forgive non-valid unicode and simply catch the UnicodeEncodeError... ? and just return s?

The bit with 4 bars was intentional (str rather than bytes), as this seems the only way a round trip can be done in python 3, and it is the way your original example is interpreted in python 3 even with the latin-1 coding header.

UnknownProgrammer commented 9 years ago

@hayd You want me to try this:


# -*- coding: utf-8 -*-
s=unicode("äüö")
print(s)

Or that:


# -*- coding: utf-8 -*-
s=unicode("äüö","utf8")
print(s)

Same error, both times.

The bit with 4 bars was intentional (str rather than bytes), as this seems the only way a round trip can be done in python 3, and it is the way your original example is interpreted in python 3 even with the latin-1 coding header.

It is not meant to be used with Python 3 because it takes advantage of the Python 2 str class, which doesn't exist in Python 3 in the same way. The only thing you had was the utf-8 encoded bytes representation:


b"\xc3\x84\xc3\x96\xc3\x9c"

You never had the "iso-8859-1" representation:


b'\xc4\xd6\xdc'

The only thing you did was converting utf8 bytes to unicode str in a strange way.

Maybe it is clearer this way (Python 2 only!):


#!/usr/bin/env python
# -*- coding: iso-8859-1 -*-
# define text
s_iso88591 = "ÄÖÜ"
# convert text to unicode
s_unicode = s_iso88591.decode("iso-8859-1")
# encode to utf-8
s_utf8 = s_unicode.encode("utf-8", "ignore")
print(type(s_iso88591))
print(type(s_unicode))
print(type(s_utf8))
try:
    print(s_iso88591)
    print(repr(s_iso88591))
except:
    print("not presentable")
try:
    print(s_unicode)
    print(repr(s_unicode))
except:
    print("not presentable")
try:
    print(s_utf8)
    print(repr(s_utf8))
except:
    print("not presentable")

Output:


<type 'str'>
<type 'unicode'>
<type 'str'>
��
'\xc4\xd6\xdc'
ÄÖÜ
u'\xc4\xd6\xdc'
ÄÖÜ
'\xc3\x84\xc3\x96\xc3\x9c'

kenny-evitt commented 9 years ago

@hayd @UnknownProgrammer What should we do with this PR?

RockyRoad29 commented 8 years ago

Hi, I just found this PR which is pretty much the same as mine I think (I have no time right now to read it thoroughly). Might be worth cross-ref' it : #47

LightTable / Python