Decoder returns python str (not unicode) for JSON string (new in 2.0.7)

GoogleCodeExporter commented 9 years ago

Expected (and found in 2.0.6):
>>> type(simplejson.loads('"foo"'))
<type 'unicode'>
>>> type(simplejson.loads(u'"foo"'))
<type 'unicode'>
>>> type(simplejson.loads(simplejson.dumps(u'foo')))
<type 'unicode'>
>>> type(simplejson.loads(simplejson.dumps(u'\xfffoo')))
<type 'unicode'>

Actual (2.0.7):
>>> type(simplejson.loads('"foo"'))
<type 'str'>
>>> type(simplejson.loads(u'"foo"'))
<type 'unicode'>
>>> type(simplejson.loads(simplejson.dumps(u'foo')))
<type 'str'>
>>> type(simplejson.loads(simplejson.dumps(u'\xfffoo')))
<type 'unicode'>

since JSON output is encoded unicode, the parsed version should be a
unicode object.

Original issue reported on code.google.com by Stelmina...@gmail.com on 12 Feb 2009 at 5:08

GoogleCodeExporter commented 9 years ago

This is an optimization. If given a str object as input, then it will give str 
strings as output if and only if the 
string is ASCII-only. ASCII-only strings are interchangable with unicode. If 
you give it unicode input then 
you'll get unicode output strings regardless. This optimization is not new in 
2.0.7.

>>> simplejson.loads('"foo"')
'foo'
>>> simplejson.loads(u'"foo"')
u'foo'

dumps always returns an ASCII-only string by default, so that's why 
loads(dumps(unistr)) can give you ASCII 
strings. You'd want to do loads(unicode(dumps(unistr))) if you want to get 
unicode strings back out.

Original comment by bob.ippo...@gmail.com on 12 Feb 2009 at 5:19

Changed state: WontFix
Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Bob, I know you've now refused to fix this in several situations now (such as: 
http://www.nabble.com/simplejson-2.0.0-released,-much-faster.-td19705153.html), 
and I
can actually name you a place where I think it causes issues.  

In Sqlalchemy, the "Unicode" type
(http://www.sqlalchemy.org/docs/05/reference/sqlalchemy/types.html#sqlalchemy.ty
pes.Unicode),
warns when you insert str() objects.  

My work flow:  create some complicated thing, serialize it to json, which gets 
used
by many other different workflow processes.  When I read it back in, I'd really 
like
every string in the thing to come back in as unicode type, if possible.  

Thanks!

Original comment by gregg.l...@gmail.com on 19 May 2009 at 4:19

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Oh, I see that in issue 28, someone mentioned this exact issue, and you bdfl'd 
it
there too!  I guess I'll deal with it on my own then!

Original comment by gregg.l...@gmail.com on 19 May 2009 at 4:21

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

If you want unicode strings, use a unicode input document.

Original comment by bob.ippo...@gmail.com on 19 May 2009 at 4:46

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

I have personally wasted hours on this.  I can't afford to track down subtle 
bugs that depend on what version 
of simplejson someone has installed and whether the speedups are present, so 
nowadays I only use it through 
the following wrapper module.

Eliminating the need for this wrapper is one of the benefits I have hoped to 
reap by dropping support for 
Python 2.5 someday.  I just hope the issue doesn't recur in Python 2.x's 
built-in json module.

try:
    import json                 # Python 2.6
except ImportError:
    import simplejson as json   # Python 2.5

dumps = json.dumps

def loads(s, *args, **kwargs):
    # When its argument is of type str, loads() decodes strings as
    # either str or unicode depending on whether simplejson's speedups
    # are installed (at least this is true in simplejson 2.0.7).  It
    # always decodes strings as unicode when the argument to loads()
    # is of type unicode.
    return json.loads(unicode(s), *args, **kwargs)

Original comment by ken.ri...@gmail.com on 19 May 2009 at 8:01

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

It is the same in Python 2.7 trunk. If you want unicode even for ASCII strings, 
use unicode input.

Original comment by bob.ippo...@gmail.com on 19 May 2009 at 8:07

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

This cost me several hours as well.  Decoding external input into unicode seems 
like
something that should happen at a program's data boundaries - which is where I
suspect the simplejson/json module is frequently used.  As such, the principle 
of
least astonishment suggests to me that I should be getting unicode back.  I 
don't
know about other users, but the speed optimization isn't that valuable to me at 
the
moment - maybe some kind of 'output_ascii' keyword, for people who need the 
speed
enhancement, for loads would be a better solution?

Original comment by markhuet...@gmail.com on 24 May 2009 at 2:04

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Believe it or not, some applications still require ascii and don't play well 
with unicode. For an application I have to work with every day, this is a 
feature, not a bug. I'm voting in order to be notified if this ever gets 
"fixed"...

Original comment by bradalle...@gmail.com on 28 Mar 2012 at 8:34

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

The issue tracker for simplejson is here: 
https://github.com/simplejson/simplejson/issues

Original comment by b...@launchcommander.com on 28 Mar 2012 at 9:23

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

This is crazy - a full day of 2 developers down the drain!

>>> import simplejson as json
>>> dump = json.dumps((u"$123", u"₪123"))
>>> [type(object) for object in json.loads(dump)] 
[<type 'str'>, <type 'unicode'>] # This is bad!

vs.

>>> import json
>>> dump = json.dumps((u"$123", u"₪123"))
>>> [type(object) for object in json.loads(dump)] 
[<type 'unicode'>, <type 'unicode'>] # This is good!

Original comment by major....@gmail.com on 21 Apr 2013 at 10:48

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

The pure python version of simplejson gives different type than the c speedups 
version. I ran into this when installing in virtual env without python-dev.  
You can demo the problem on the version installed with speedups by using 
_toggle_speedups to go back to pure version.

>>> import simplejson as json
>>> json.loads('"foo"')
'foo'
>>> json._toggle_speedups(False)
>>> json.loads('"foo"')
u'foo'

This needs to be fixed one way or the other.

Original comment by tom2...@gmail.com on 18 Oct 2013 at 1:36

Added labels: ****
Removed labels: ****

GoogleCodeExporter commented 9 years ago

Hm, for me, both libraries do it 'wrong'-ish:
json returns <type 'unicode'> even for "$123", withOUT the 'u' that renders it 
unicode.
simplejson returns <type 'str'> when the input is u"$123"? What's the reason 
for this inconsistency?

Original comment by kmichael...@gmail.com on 8 Sep 2014 at 5:38

Added labels: ****
Removed labels: ****

Averroes / simplejson

Decoder returns python str (not unicode) for JSON string (new in 2.0.7) #40