Nykakin / chompjs

Parsing JavaScript objects into Python data structures
MIT License
197 stars 11 forks source link

Investigate the possibility of parsing json5 format #38

Closed Nykakin closed 1 year ago

Nykakin commented 2 years ago

JSON5 is an extension of JSON format, allowing things such as hexadecimal numbers and trailing commas.

This library could theoretically be used to parse this format as well. It is now failing for the sample provided in the linked page:

>>> data = """
... {
...   // comments
...   unquoted: 'and you can quote me on that',
...   singleQuotes: 'I can use "double quotes" here',
...   lineBreaks: "Look, Mom! \
... No \\n's!",
...   hexadecimal: 0xdecaf,
...   leadingDecimalPoint: .8675309, andTrailing: 8675309.,
...   positiveSign: +1,
...   trailingComma: 'in objects', andIn: ['arrays',],
...   "backwardsCompatible": "with JSON",
... }
... """
>>> 
>>> chompjs.parse_js_object(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/mariusz/Documents/Praca/venv/lib/python3.9/site-packages/chompjs-1.1.8-py3.9-linux-x86_64.egg/chompjs/chompjs.py", line 25, in parse_js_object
    return json.loads(parsed_data, **json_params)
  File "/usr/lib/python3.9/json/__init__.py", line 346, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python3.9/json/decoder.py", line 337, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python3.9/json/decoder.py", line 353, in raw_decode
    obj, end = self.scan_once(s, idx)
json.decoder.JSONDecodeError: Expecting ',' delimiter: line 1 column 206 (char 205)

It should be possible to extend this library in order to parse this.

If it's indeed possible, then some benchmarking tests could be made to compare the speed with other JSON5 Python libraries such as https://github.com/dpranke/pyjson5.

Nykakin commented 1 year ago

With https://github.com/Nykakin/chompjs/pull/39 in place it's possible to parse example provided in JSON5 page:

>>> data = """{
...   // comments
...   unquoted: 'and you can quote me on that',
...   singleQuotes: 'I can use "double quotes" here',
...   lineBreaks: "Look, Mom! \
... No \\n's!",
...   hexadecimal: 0xdecaf,
...   leadingDecimalPoint: .8675309, andTrailing: 8675309.,
...   positiveSign: +1,
...   trailingComma: 'in objects', andIn: ['arrays',],
...   "backwardsCompatible": "with JSON",
... }
... """
>>> chompjs.parse_js_object(data)
{'unquoted': 'and you can quote me on that', 'singleQuotes': 'I can use "double quotes" here', 'lineBreaks': "Look, Mom! No \n's!", 'hexadecimal': '0xdecaf', 'leadingDecimalPoint': 0.8675309, 'andTrailing': 8675309.0, 'positiveSign': '+1', 'trailingComma': 'in objects', 'andIn': ['arrays'], 'backwardsCompatible': 'with JSON'}

The remaining problem is that hexadecimal constants are parsed as strings, not numbers.

Nykakin commented 1 year ago

40 implements parsing hex (and binary and octal) values as numbers instead of strings:

>>> data = """
... {
...     // comments
...     unquoted: 'and you can quote me on that',
...     singleQuotes: 'I can use "double quotes" here',
...     lineBreaks: "Look, Mom! \
...     No \\n's!",
...     hexadecimal: 0xdecaf,
...     leadingDecimalPoint: .8675309, andTrailing: 8675309.,
...     positiveSign: +1,
...     trailingComma: 'in objects', andIn: ['arrays',],
...     "backwardsCompatible": "with JSON",
... }
... """
>>> import chompjs
>>> chompjs.parse_js_object(data)
{'unquoted': 'and you can quote me on that', 'singleQuotes': 'I can use "double quotes" here', 'lineBreaks': "Look, Mom!     No \n's!", 'hexadecimal': 912559, 'leadingDecimalPoint': 0.8675309, 'andTrailing': 8675309.0, 'positiveSign': '+1', 'trailingComma': 'in objects', 'andIn': ['arrays'], 'backwardsCompatible': 'with JSON'}