Open posita opened 9 years ago
Thanks, Matt! I'll look into this ASAP.
The good news is that there's an easy work around (modified from above):
from builtins import *
s = str(...) # make a newstr on Python 2
# Instead of bytes(s, u'utf_8'), do:
s.encode(u'utf_8') # will return newbytes object on Python 2
So this issue is really about interface compatibility rather than about available functionality. My guess is that most people use the str(...).encode(<encoding>)
method as opposed to the bytes(..., <encoding>)
method, which may be why this hasn't been discovered yet?
I think I see what is happening here. (The following code snippets are all from Python 2, in case that wasn't obvious.) Consider:
>>> from future.types.newstr import newstr
>>> isinstance(newstr(u'asdf'), unicode)
True
From future/types/newbytes.py
at 93:
class newbytes(with_metaclass(BaseNewBytes, _builtin_bytes)):
...
def __new__(cls, *args, **kwargs):
... # gets `encoding` from *args, **kwargs
elif isinstance(args[0], unicode):
...
newargs = [encoding]
...
value = args[0].encode(*newargs)
...
return super(newbytes, cls).__new__(cls, value)
Which, in the case of newbytes(newstr(u'asdf'), u'utf_8')
is basically:
value = <newstr>.encode(u'utf_8') # returns instance of <class 'future.types.newbytes.newbytes'>
value = super(<newbytes>, cls).__new__(cls, value) # but see below
So newbytes
's parent constructor (i.e., from builtin bytes
) is being called with a newbytes
instance as its argument. We'll see why this is a problem below, but first consider:
>>> nativebytes = bytes ; nativestr = str ; from builtins import *
>>> bytes(bytes(b'asdf'))
b'asdf'
*>>> bytes(nativebytes(b'asdf'))
b'asdf'
>>> nativebytes(bytes(b'asdf'))
"b'asdf'" # whoops!
From future/types/newbytes.py
at 120:
def __str__(self):
return 'b' + "'{0}'".format(super(newbytes, self).__str__())
This behavior mirrors Python 3, so it's correct. However, because the native bytes
constructor doesn't know how to deal with a newbytes
argument, it's calling its __str__()
method to figure out how to populate itself, what's really happening is something like this:
<newbytes> = <newstr>.encode(u'utf_8') # returns instance of <class 'future.types.newbytes.newbytes'>
value = super(<newbytes>, cls).__new__(cls, <newbytes>.__str__())
I'm not quite sure what the correct fix is if one wants to safely allow for the ability derive subclasses from both newbytes
and newstr
.
Matt, thanks for filing this issue and your pull request!
The tests that this issue mentions actually seem to be passing on the v0.15.x
branch for me. Could you please confirm whether this is true in your testing too? I'm wondering whether I still need to merge in your PR #173.
@edschofield, my apologies for the delay. After some investigation, it looks like #193 is a duplicate of this issue. In response to your question, I no longer see the behavior addressed by #173 after abf19bbe002cdf24e42a6c9a2aab0e64fee9fd22. 👍
I'll close #173, but please bear in mind that some of the errant (or at least confusing) behavior still exists. From https://github.com/PythonCharmers/python-future/pull/173#issue-109235947:
[W]ithout monkey-patching Python 2's native
str
's constructor, I do not know how to handle this case:Python 2.7.10 (default, Sep 24 2015, 10:13:45) [GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)] on darwin Type "help", "copyright", "credits" or "license" for more information. >> nativebytes = bytes ; nativestr = str ; from builtins import * >> nativebytes(bytes(b'asdf')) "b'asdf'" # Whoops! >> # This means you can't pass newbytes in many contexts, such as: >> from urllib import urlencode >> urlencode({ bytes(b'a'): 1, bytes(b'b'): 2 }) 'b%27a%27=1&b%27b%27=2' >> # :o(
That behavior remains and is not addressed by either #173 or abf19bbe002cdf24e42a6c9a2aab0e64fee9fd22. I'll leave it to you as to whether you want to close this issue and track the above via #193.
This is still an issue.
Running str(bytes(b"hello"))
results in "b'hello'"
.
Hmmm. I'm not sure this is broken. Or at least if it is, it might be broken semi-consistently with Python 3.x:
$ python -c 'import sys ; print(sys.version) ; c = "type(bytes)" ; print("{}: {}".format(c, eval(c))) ; c = "type(str)" ; print("{}: {}".format(c, eval(c))) ; c = "str(b\"asdf\")" ; print("{}: {}".format(c, eval(c))) ; nativestr = str ; nativebytes = bytes ; from builtins import * ; c = "type(bytes)" ; print("{}: {}".format(c, eval(c))) ; c = "type(str)" ; print("{}: {}".format(c, eval(c))) ; c = "str(b\"asdf\")" ; print("{}: {}".format(c, eval(c))) ; c = "str(nativebytes(b\"asdf\"))" ; print("{}: {}".format(c, eval(c))) ; c = "nativestr(bytes(b\"asdf\"))" ; print("{}: {}".format(c, eval(c))) ; c = "str(bytes(b\"asdf\"))" ; print("{}: {}".format(c, eval(c)))'
2.7.13 (default, Dec 18 2016, 17:56:59)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
type(bytes): <type 'type'>
type(str): <type 'type'>
str(b"asdf"): asdf
type(bytes): <class 'future.types.newbytes.BaseNewBytes'>
type(str): <class 'future.types.newstr.BaseNewStr'>
str(b"asdf"): asdf
str(nativebytes(b"asdf")): asdf
nativestr(bytes(b"asdf")): b'asdf'
str(bytes(b"asdf")): b'asdf'
$ python3.5 -c 'import sys ; print(sys.version) ; c = "type(bytes)" ; print("{}: {}".format(c, eval(c))) ; c = "type(str)" ; print("{}: {}".format(c, eval(c))) ; c = "str(b\"asdf\")" ; print("{}: {}".format(c, eval(c))) ; c = "str(bytes(b\"asdf\"))" ; print("{}: {}".format(c, eval(c)))'
3.5.3 (default, Feb 1 2017, 17:52:10)
[GCC 4.2.1 Compatible Apple LLVM 6.0 (clang-600.0.57)]
type(bytes): <class 'type'>
type(str): <class 'type'>
str(b"asdf"): b'asdf'
str(bytes(b"asdf")): b'asdf'
So newstr(newbytes(b'asdf'))
mirrors the Python 3 behavior, as does nativestr(newbytes(b'asdf'))
. newstr(nativebytes(b'asdf'))
does not, however. EDIT: In fairness, I don't think it should. In Python 2, newstr(nativebytes(…))
is equivalent to newstr(nativestr(…))
is probably not something that should end up as b'…'
.
@Depaulicious, also note that the original issue was about passing a Unicode value to newbytes
vs passing it to Python 3's native bytes. Yours is an inverted case.
I'm hitting this in Python 2 (passing an encoded string into a library that uses native Python 2 urllib):
import urllib
from future import standard_library
standard_library.install_aliases()
from builtins import *
d = {"k": str('a@b').encode("utf-8")}
urllib.urlencode(d, doseq=True)
Traceback (most recent call last):
File "/tmp/f.py", line 10, in <module>
urllib.urlencode(d, doseq=True)
File "/usr/lib/python2.7/urllib.py", line 1348, in urlencode
v = quote_plus(v)
File "/usr/lib/python2.7/urllib.py", line 1305, in quote_plus
return quote(s, safe)
File "/usr/lib/python2.7/urllib.py", line 1298, in quote
return ''.join(map(quoter, s))
KeyError: 97
Is there a workaround for this?
So urllib.quote
wants to map over the string https://github.com/python/cpython/blob/2.7/Lib/urllib.py#L1298, but newbytes
returns integers instead of string chars like python2 str. urllib.quote
is looking these up in a map, but none of the ordinals exist so it raises KeyError
.
>>> from future import standard_library
>>> standard_library.install_aliases()
>>> from builtins import *
>>> [c for c in 'a@b']
['a', '@', 'b']
>>> [c for c in str('a@b').encode("utf-8")]
[97, 64, 98]
So if I encode
a future str
, the resulting newbytes
is not useable with 3rd party python libraries that may use the unpatched standard library?
On Python 3.4:
On Python 2.7: