newbytes not fully compatible with bytes: newbytes(newstr(...), '<encoding>') looks like it produces something similar to (but not quite the same as) newbytes(repr(newstr(...)), '<encoding>')

PythonCharmers / python-future

Easy, clean, reliable Python 2/3 compatibility

http://python-future.org

MIT License

1.17k stars 291 forks source link

newbytes not fully compatible with bytes: newbytes(newstr(...), '<encoding>') looks like it produces something similar to (but not quite the same as) newbytes(repr(newstr(...)), '<encoding>') #171

Open posita opened 9 years ago

posita commented 9 years ago

On Python 3.4:

>>> from __future__ import print_function, unicode_literals ; from builtins import *
>>> bytes
<class 'bytes'>
>>> str
<class 'str'>
>>> b1 = str(u'abc \u0123 do re mi').encode(u'utf_8') # this works
>>> b1
b'abc \xc4\xa3 do re mi'
>>> b2 = bytes(u'abc \u0123 do re mi', u'utf_8') # so does this
>>> b2
b'abc \xc4\xa3 do re mi'
>>> b1 == b2
True
>>> b3 = bytes(str(u'abc \u0123 do re mi'), u'utf_8') # this works too (unsurprisingly)
>>> b3
b'abc \xc4\xa3 do re mi'
>>> b1 == b3
True

On Python 2.7:

>>> from __future__ import print_function, unicode_literals ; from builtins import *
>>> bytes
<class 'future.types.newbytes.newbytes'>
>>> str
<class 'future.types.newstr.newstr'>
>>> b1 = str(u'abc \u0123 do re mi').encode(u'utf_8') # this works
>>> b1
b'abc \xc4\xa3 do re mi'
>>> type(b1)
<class 'future.types.newbytes.newbytes'>
>>> b2 = bytes(u'abc \u0123 do re mi', u'utf_8') # so does this (argument is native unicode object)
>>> b2
b'abc \xc4\xa3 do re mi'
>>> b1 == b2
True
>>> b3 = bytes(str(u'abc \u0123 do re mi'), u'utf_8') # but this looks like it's encoding the repr() of the newstr
>>> b3
b"b'abc \xc4\xa3 do re mi'"
>>> b1 == b3
False
>>> # I can't figure out what it's actually doing though; these aren't quite the same
>>> bytes(repr(str(u'abc \u0123 do re mi')).encode(u'utf_8'))
b"'abc \\u0123 do re mi'"
>>> bytes(repr(str(u'abc \u0123 do re mi').encode(u'utf_8')), 'utf_8')
b"b'abc \\xc4\\xa3 do re mi'"

edschofield commented 9 years ago

Thanks, Matt! I'll look into this ASAP.

posita commented 9 years ago

The good news is that there's an easy work around (modified from above):

from builtins import *
s = str(...) # make a newstr on Python 2
# Instead of bytes(s, u'utf_8'), do:
s.encode(u'utf_8') # will return newbytes object on Python 2

So this issue is really about interface compatibility rather than about available functionality. My guess is that most people use the str(...).encode(<encoding>) method as opposed to the bytes(..., <encoding>) method, which may be why this hasn't been discovered yet?

posita commented 9 years ago

I think I see what is happening here. (The following code snippets are all from Python 2, in case that wasn't obvious.) Consider:

>>> from future.types.newstr import newstr
>>> isinstance(newstr(u'asdf'), unicode)
True

From future/types/newbytes.py at 93:

class newbytes(with_metaclass(BaseNewBytes, _builtin_bytes)):
    ...
    def __new__(cls, *args, **kwargs):
        ... # gets `encoding` from *args, **kwargs
        elif isinstance(args[0], unicode):
            ...
            newargs = [encoding]
            ...
            value = args[0].encode(*newargs)
            ...
        return super(newbytes, cls).__new__(cls, value)

Which, in the case of newbytes(newstr(u'asdf'), u'utf_8') is basically:

value = <newstr>.encode(u'utf_8') # returns instance of <class 'future.types.newbytes.newbytes'>
value = super(<newbytes>, cls).__new__(cls, value) # but see below

So newbytes's parent constructor (i.e., from builtin bytes) is being called with a newbytes instance as its argument. We'll see why this is a problem below, but first consider:

>>> nativebytes = bytes ; nativestr = str ; from builtins import *
>>> bytes(bytes(b'asdf'))
b'asdf'
*>>> bytes(nativebytes(b'asdf'))
b'asdf'
>>> nativebytes(bytes(b'asdf'))
"b'asdf'" # whoops!

From future/types/newbytes.py at 120:

def __str__(self):
    return 'b' + "'{0}'".format(super(newbytes, self).__str__())

This behavior mirrors Python 3, so it's correct. However, because the native bytes constructor doesn't know how to deal with a newbytes argument, it's calling its __str__() method to figure out how to populate itself, what's really happening is something like this:

<newbytes> = <newstr>.encode(u'utf_8') # returns instance of <class 'future.types.newbytes.newbytes'>
value = super(<newbytes>, cls).__new__(cls, <newbytes>.__str__())

I'm not quite sure what the correct fix is if one wants to safely allow for the ability derive subclasses from both newbytes and newstr.

edschofield commented 8 years ago

Matt, thanks for filing this issue and your pull request!

The tests that this issue mentions actually seem to be passing on the v0.15.x branch for me. Could you please confirm whether this is true in your testing too? I'm wondering whether I still need to merge in your PR #173.

posita commented 8 years ago

@edschofield, my apologies for the delay. After some investigation, it looks like #193 is a duplicate of this issue. In response to your question, I no longer see the behavior addressed by #173 after abf19bbe002cdf24e42a6c9a2aab0e64fee9fd22. 👍

I'll close #173, but please bear in mind that some of the errant (or at least confusing) behavior still exists. From https://github.com/PythonCharmers/python-future/pull/173#issue-109235947:

[W]ithout monkey-patching Python 2's native str's constructor, I do not know how to handle this case:

Python 2.7.10 (default, Sep 24 2015, 10:13:45)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>> nativebytes = bytes ; nativestr = str ; from builtins import *
>> nativebytes(bytes(b'asdf'))
"b'asdf'" # Whoops!
>> # This means you can't pass newbytes in many contexts, such as:
>> from urllib import urlencode
>> urlencode({ bytes(b'a'): 1, bytes(b'b'): 2 })
'b%27a%27=1&b%27b%27=2'
>> # :o(

That behavior remains and is not addressed by either #173 or abf19bbe002cdf24e42a6c9a2aab0e64fee9fd22. I'll leave it to you as to whether you want to close this issue and track the above via #193.

depau commented 7 years ago

This is still an issue.

Running str(bytes(b"hello")) results in "b'hello'".

posita commented 7 years ago

Hmmm. I'm not sure this is broken. Or at least if it is, it might be broken semi-consistently with Python 3.x:

$ python -c 'import sys ; print(sys.version) ; c = "type(bytes)" ; print("{}: {}".format(c, eval(c))) ; c = "type(str)" ; print("{}: {}".format(c, eval(c))) ; c = "str(b\"asdf\")" ; print("{}: {}".format(c, eval(c))) ; nativestr = str ; nativebytes = bytes ; from builtins import * ; c = "type(bytes)" ; print("{}: {}".format(c, eval(c))) ; c = "type(str)" ; print("{}: {}".format(c, eval(c))) ; c = "str(b\"asdf\")" ; print("{}: {}".format(c, eval(c))) ; c = "str(nativebytes(b\"asdf\"))" ; print("{}: {}".format(c, eval(c))) ; c = "nativestr(bytes(b\"asdf\"))" ; print("{}: {}".format(c, eval(c))) ; c = "str(bytes(b\"asdf\"))" ; print("{}: {}".format(c, eval(c)))'
2.7.13 (default, Dec 18 2016, 17:56:59)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
type(bytes): <type 'type'>
type(str): <type 'type'>
str(b"asdf"): asdf
type(bytes): <class 'future.types.newbytes.BaseNewBytes'>
type(str): <class 'future.types.newstr.BaseNewStr'>
str(b"asdf"): asdf
str(nativebytes(b"asdf")): asdf
nativestr(bytes(b"asdf")): b'asdf'
str(bytes(b"asdf")): b'asdf'
$ python3.5 -c 'import sys ; print(sys.version) ; c = "type(bytes)" ; print("{}: {}".format(c, eval(c))) ; c = "type(str)" ; print("{}: {}".format(c, eval(c))) ; c = "str(b\"asdf\")" ; print("{}: {}".format(c, eval(c))) ; c = "str(bytes(b\"asdf\"))" ; print("{}: {}".format(c, eval(c)))'
3.5.3 (default, Feb  1 2017, 17:52:10)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
type(bytes): <class 'type'>
type(str): <class 'type'>
str(b"asdf"): b'asdf'
str(bytes(b"asdf")): b'asdf'

So newstr(newbytes(b'asdf')) mirrors the Python 3 behavior, as does nativestr(newbytes(b'asdf')). newstr(nativebytes(b'asdf')) does not, however. EDIT: In fairness, I don't think it should. In Python 2, newstr(nativebytes(…)) is equivalent to newstr(nativestr(…)) is probably not something that should end up as b'…'.

@Depaulicious, also note that the original issue was about passing a Unicode value to newbytes vs passing it to Python 3's native bytes. Yours is an inverted case.

rectalogic commented 5 years ago

I'm hitting this in Python 2 (passing an encoded string into a library that uses native Python 2 urllib):

import urllib

from future import standard_library
standard_library.install_aliases()
from builtins import *

d = {"k": str('a@b').encode("utf-8")}
urllib.urlencode(d, doseq=True)

Traceback (most recent call last):
  File "/tmp/f.py", line 10, in <module>
    urllib.urlencode(d, doseq=True)
  File "/usr/lib/python2.7/urllib.py", line 1348, in urlencode
    v = quote_plus(v)
  File "/usr/lib/python2.7/urllib.py", line 1305, in quote_plus
    return quote(s, safe)
  File "/usr/lib/python2.7/urllib.py", line 1298, in quote
    return ''.join(map(quoter, s))
KeyError: 97

Is there a workaround for this?

rectalogic commented 5 years ago

So urllib.quote wants to map over the string https://github.com/python/cpython/blob/2.7/Lib/urllib.py#L1298, but newbytes returns integers instead of string chars like python2 str. urllib.quote is looking these up in a map, but none of the ordinals exist so it raises KeyError.

>>> from future import standard_library
>>> standard_library.install_aliases()
>>> from builtins import *
>>> [c for c in 'a@b']
['a', '@', 'b']
>>> [c for c in str('a@b').encode("utf-8")]
[97, 64, 98]

So if I encode a future str, the resulting newbytes is not useable with 3rd party python libraries that may use the unpatched standard library?