adriank / ObjectPath

The agile query language for semi-structured data
http://objectpath.org
MIT License
380 stars 93 forks source link

Python 3-compatible checking for unicode #14

Closed sseemayer closed 9 years ago

sseemayer commented 10 years ago

This patch adds support for Python 3 to the ObjectPath shell by detecting what string type the current Python version uses.

adriank commented 10 years ago

Can you tell me what is the use case where shell doesn't work as expected? May be a bigger issue with unicode vs strings somewhere in the code.

sseemayer commented 9 years ago

Sorry for the super-brief 'bug report', I'll try to be more detailed now:

I'm testing this on ArchLinux x86_64 with the most recent stable Python 3.4.1 and Python 2.7.8 - the default Python interpreter living at /usr/bin/env python being Python 3.

I'm getting the following behavior:

objectpath

After removing the shell error handling so I can find out where the exception is thrown, I find out that the problem is that in Python 3, the unicode type does not exist anymore because all strings are unicode by default and have type str. Explicit byte strings have the bytes type. After I apply the patch included in this pull request, all features of ObjectPath that I've tried so far (shell, objectpath.execute) seem to work fine both in Python 2 and 3.

I've also searched recursively for other occurrences of unicode using grep -nIR "unicode" . to see if unicode might potentially cause problems in areas that I did not test:

Hope this helps!

sseemayer commented 9 years ago

Another remark, since you seem to require a lot of checking between Python 2 and 3, it might be a good idea to build a central compatibility layer or use six to clean up the code.

adriank commented 9 years ago

Indeed this bug is there. I wonder how it got to the current version, because I'm checking for Py3 support before committing anything to GH. I'll investigate the problem and get back to you soon!

adriank commented 9 years ago

OK, I see the issue. Your implementation does solve the issue, but I am strong believer in clean solutions. try-except clause around those two lines also solves the problem and is cleaner in my opinion.

It's hard to indicate what exactly the try-except clause does in your code without reading this pull request. If you are writing such code it would be a good idea to put a comment about it, because it is not a common thing for people to see in code.

I'm pushing this:

#python 3 raises error here - unicode is not a proper type there
try:
    if type(r) is unicode:
        r=r.encode("utf8")
except NameError:
    pass

r=r.encode("utf8") is a line that is relevant only to Python2, because Py2 has enormous problems with str and unicode handling - you can't be sure what type is going to be returned from different libraries etc. It's a nightmare there! On the other hand Py3 doesn't need this conversion because strings are always unicode (unless you are dealing with UWSGI or other low level libraries which require str type inputs). That's why above try-except always results in skipping the block of code in Py3.

I think this is called "duck typing" and it's the preferred way of dealing with types in Python.

P.S. FYI: isinstance() is slower than type(). In case of checking simple types it is way faster.

sseemayer commented 9 years ago

In my opinion the cleanest solution would be to have a central utility library for enforcing utf8-ness that can be used over multiple places in the code. It would be a good way to reduce the code re-use and make the actual logic in your implementation more readable.

Your example is not duck typing because you are performing type checks instead of simply treating r as if it was of the desired type. Here is a duck-typing way of enforcing UTF-8 that works in Python 2 and 3:

def ensure_utf8(x):
    try:
        return x.encode("utf8")
    except AttributeError:
        return x

bytes_string = b'asdf'
unicode_string = u'asdf'

print("type(bytes_string) = {0}".format(type(bytes_string)))
print("type(unicode_string) = {0}".format(type(unicode_string)))

enc_bytes_string = ensure_utf8(bytes_string)
enc_unicode_string = ensure_utf8(unicode_string)

print("type(enc_bytes_string) = {0}".format(type(enc_bytes_string)))
print("type(enc_unicode_string) = {0}".format(type(enc_unicode_string)))

Demo:

$ python2 ensure_utf8.py 
type(bytes_string) = <type 'str'>
type(unicode_string) = <type 'unicode'>
type(enc_bytes_string) = <type 'str'>
type(enc_unicode_string) = <type 'str'>
$ python3 ensure_utf8.py 
type(bytes_string) = <class 'bytes'>
type(unicode_string) = <class 'str'>
type(enc_bytes_string) = <class 'bytes'>
type(enc_unicode_string) = <class 'bytes'>

P.S.: While type is faster than isinstance, it will not behave correctly once someone starts subclassing one of the string types (oh god). I guess it's debatable how likely it is that somebody will do that in practice :) For my original use case of handling input in the shell, using isinstance is plenty fast enough, though and both strategies should work just as well.

adriank commented 9 years ago

I wonder if doing py2 & py3 compatible code base is a good idea at all. I'm trying this, but it seems that it has more problems than I expected. Especially when I want OP to be compatible also with PyPy + have plans to use Cython to make the execution faster.

BTW:

adrian@localhost:~$ python3
Python 3.2.3 (default, Feb 27 2014, 21:31:18) 
[GCC 4.6.3] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> "sss".encode("utf8")
b'sss'

b'' would probably break code in other places:/.