Nykakin / chompjs

Parsing JavaScript objects into Python data structures
MIT License
197 stars 11 forks source link

JavaScript sample failing #4

Closed pawelmhm closed 4 years ago

pawelmhm commented 4 years ago

Hey @Nykakin found another sample that is failing for unclear reasons, pasted this sample here:

https://pastebin.com/2tZEm5EL

It fails with: "ValueError: Parser error: ... Lawn Sweeper, get (1) Agri-Fab"

I see this is actually invalid JavaScript with quotes that are not escaped. Do you think we should support something like this?

Nykakin commented 4 years ago

@pawelmhm this looks like farmandfleet.com website, and from what I can see, I can parse its data just fine with json.loads alone. For example, for the page https://www.farmandfleet.com/lawn-aerators-and-rollers/:

>>> type(json.loads(response.css('script:contains(searchResult)').re_first('window.searchResult = (.*);')))
<class 'dict'>

Please provide url to the actual page and the way you've obtained this invalid input from it.

pawelmhm commented 4 years ago

Yes farmandfleet. This is broken JavaScript: https://www.farmandfleet.com/lawn-mower-and-atv-attachments/ you can notice inches character interferes with quotes. "description": "Buy (1) Agri-Fab 44" Lawn Sweeper.

I'm not sure we can do anything here. Probably we can close ticket and assume it's just broken. But leaving decision to you. If you think we can parse it in chompjs let me know

Nykakin commented 4 years ago

As I said, I also need to see how did you extract the string from the website. Using scrapy I can parse it just fine with json.loads if I pass replace_entities=False to re_first:

>>> script = response.css('script:contains(searchResult)')
>>> type(json.loads(script.re_first('window.searchResult = (.*);')))
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/usr/lib/python2.7/json/__init__.py", line 339, in loads
    return _default_decoder.decode(s)
  File "/usr/lib/python2.7/json/decoder.py", line 364, in decode
    obj, end = self.raw_decode(s, idx=_w(s, 0).end())
  File "/usr/lib/python2.7/json/decoder.py", line 380, in raw_decode
    obj, end = self.scan_once(s, idx)
ValueError: Expecting ',' delimiter: line 1 column 27105 (char 27104)
>>> type(json.loads(script.re_first('window.searchResult = (.*);', replace_entities=False)))
<type 'dict'>
Nykakin commented 4 years ago

Closing due to inactivity.