Averroes / simplejson

MIT License
0 stars 0 forks source link

simplejson spits "Invalid control character" for vertical tab character \x0b #89

Closed GoogleCodeExporter closed 9 years ago

GoogleCodeExporter commented 9 years ago
Hi, I think this is a bug in simplejson, since u'\x0b' (u'\u000b') is a valid 
Unicode character.

In [15]: import simplejson

In [16]: simplejson.__version__
Out[16]: '2.1.3'

In [17]: simplejson.loads(u'''"\u003Cp\u003EPeopleBrowsr is a data mining, 
analytics and brand engagement service provider for enterprise brand managers, 
social media strategists, hedge fund managers, advertising agencies and IT 
developers.\n\nFounded in 2006 by..."''')
---------------------------------------------------------------------------
JSONDecodeError                           Traceback (most recent call last)

/home/yang/<ipython console> in <module>()

/home/yang/work/pod/env/lib/python2.6/site-packages/simplejson/__init__.pyc in 
loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, 
object_pairs_hook, use_decimal, **kw)
    383             parse_constant is None and object_pairs_hook is None
    384             and not use_decimal and not kw):
--> 385         return _default_decoder.decode(s)
    386     if cls is None:
    387         cls = JSONDecoder

/home/yang/work/pod/env/lib/python2.6/site-packages/simplejson/decoder.pyc in 
decode(self, s, _w)
    400 
    401         """
--> 402         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    403         end = _w(s, end).end()
    404         if end != len(s):

/home/yang/work/pod/env/lib/python2.6/site-packages/simplejson/decoder.pyc in 
raw_decode(self, s, idx)
    416         """
    417         try:
--> 418             obj, end = self.scan_once(s, idx)
    419         except StopIteration:
    420             raise JSONDecodeError("No JSON object could be decoded", s, idx)

JSONDecodeError: Invalid control character at: line 1 column 210 (char 210)

Original issue reported on code.google.com by yanghate...@gmail.com on 24 Feb 2011 at 11:43

GoogleCodeExporter commented 9 years ago
Not a bug, the string is not valid JSON if it contains this character unescaped.

>>> simplejson.loads(u'"\\u000b"')
u'\x0b'
>>> simplejson.dumps(u'\x0b')
'"\\u000b"'

Original comment by bob.ippo...@gmail.com on 24 Feb 2011 at 11:52

GoogleCodeExporter commented 9 years ago
OK. Do you have a workaround for this? I just ran into another instance of this 
(with \xe2). What are all the characters that don't work and what should I 
replace them with?

Original comment by yanghate...@gmail.com on 25 Feb 2011 at 12:50

GoogleCodeExporter commented 9 years ago
I think you are confused about how JSON and/or unicode works, I'm not sure 
which and I don't know exactly how to help you.

>>> simplejson.loads(u'"\\u00e2"')
u'\xe2'
>>> simplejson.dumps(u'\xe2')
'"\\u00e2"'

Original comment by bob.ippo...@gmail.com on 25 Feb 2011 at 1:03

GoogleCodeExporter commented 9 years ago
Bob, you're right in that I'm confused, and I think it's about how JSON works.

First, I think something went wrong when I tried pasting the original string in 
my first post, since it's not even showing the \x0b. That should have been:

In [7]: open('aoeu').read()
Out[7]: '"\\u003Cp\\u003EPeopleBrowsr is a data mining, analytics and brand 
engagement service provider for enterprise brand managers, social media 
strategists, hedge fund managers, advertising agencies and IT 
developers.\\n\x0b\\nFounded in 2006 by..."\n'

In [8]: Out[7].decode('utf8')
Out[8]: u'"\\u003Cp\\u003EPeopleBrowsr is a data mining, analytics and brand 
engagement service provider for enterprise brand managers, social media 
strategists, hedge fund managers, advertising agencies and IT 
developers.\\n\x0b\\nFounded in 2006 by..."\n'

In [9]: simplejson.loads(Out[7])
[...error...]

I'm dealing with a data source that is giving me strings like this one, whether 
I like it or not. So I'm really just asking how I should munge that string into 
a form that simplejson won't choke on. I thought it might be helpful to ask 
here in case others who come by here have the same question.

(Also, please disregard my comment about \xe2 - that was actually something 
else.)

Original comment by yanghate...@gmail.com on 25 Feb 2011 at 1:22

GoogleCodeExporter commented 9 years ago
Okay, so the JSON you have is actually not valid JSON. You can parse it with 
strict=False.

>>> import simplejson
>>> s =  '"\\u003Cp\\u003EPeopleBrowsr is a data mining, analytics and brand 
engagement service provider for enterprise brand managers, social media 
strategists, hedge fund managers, advertising agencies and IT 
developers.\\n\x0b\\nFounded in 2006 by..."\n'
>>> simplejson.loads(s, strict=False)
'<p>PeopleBrowsr is a data mining, analytics and brand engagement service 
provider for enterprise brand managers, social media strategists, hedge fund 
managers, advertising agencies and IT developers.\n\x0b\nFounded in 2006 by...'

Original comment by bob.ippo...@gmail.com on 25 Feb 2011 at 1:35

GoogleCodeExporter commented 9 years ago
Thank you. I wasn't aware of that flag, and it made all my error-avoidance code 
go away.

Original comment by yanghate...@gmail.com on 25 Feb 2011 at 1:53

GoogleCodeExporter commented 9 years ago
Thanks strict=False save my day ^^

Original comment by adesanto...@gmail.com on 13 Dec 2012 at 10:29