Adopt a streaming JSON parser

GoogleChrome / lighthouse

Automated auditing, performance metrics, and best practices for the web.

https://developer.chrome.com/docs/lighthouse/overview/

Apache License 2.0

28.32k stars 9.36k forks source link

Adopt a streaming JSON parser #1798

Closed paulirish closed 7 years ago

paulirish commented 7 years ago

Details are here: https://github.com/GoogleChrome/lighthouse/issues/1685#issuecomment-

We can use the existing incremental JSON parser that's in devtools (see WebInspector.TimelineLoader and WebInspector.TextUtils.BalancedJSONTokenizer.)

Or alternatively we pull in a new npm module for that.

This would fix #1685 and #1733

evenstensberg commented 7 years ago

Working on this pretty soon 👍

paulirish commented 7 years ago

And here's the DevTools BalancedJSONTokenizer and _writeBalancedJSON i mentioned.

evenstensberg commented 7 years ago

Is there any more tips you've got that might help me in a trivial way? Also, thank you so much so far!

paulirish commented 7 years ago

our existing json.parse() for the trace is in driver.js that's the one we need to replace.

presumably we'd parse each chunk as its comes in from the IO.read stream.

i think it's best for us to start with an npm module for the incremental json parsing. we don't have a real node stream, so we'll need a library that okay with that. also we want it to have minimal runtime dependencies/

evenstensberg commented 7 years ago

Gotcha!

brendankenny commented 7 years ago

I spent some time today looking at the available options (I was working on a separate problem, but it intersected streaming JSON parsing, so it was worth a look :)

Right now I think the BalancedJSONTokenizer and _writeBalancedJSON mentioned in https://github.com/GoogleChrome/lighthouse/issues/1798#issuecomment-301934352 are the way forward. The options out there are kind of overkill, and they don't seem particularly effective at the relatively simple case that we do have here (we don't need to parse any JSON, just the timeline/trace format).

BalancedJSONTokenizer is nice in particular because it a) discards the chunk string after it's been parsed and b) it just uses JSON.parse on that chunk, not requiring custom parsing character-by-character.

It is optimized for a few particular formats and could struggle with others, but one of those particular formats is exactly what we want to parse :)

This was all fairly cursory, however, so I'd be interested in hearing about more options if others have looked into this.

evenstensberg commented 7 years ago

I've been trying to reproduce the bug using lighthouse, lighthouse CLI, clone of lighthouse against my clone of chrome, but I don't get this error. Do you mind walking me through how to reproduce this? Assume I've got everything set up. A screenshot would be great. @brendankenny do you mind me working on this? Good First Contribution :)

brendankenny commented 7 years ago

@ev1stensberg unfortunately this is one I've only seen once or twice myself, and haven't been able to reliably reproduce, so it's more of something we know is happening for some people and (think we) know why it's happening, so hopefully can just fix and watch the number of error reports drop :)

One option for testing/development is creating an artificial test case, taking one of the test traces (e.g. this one) and increasing the number of traceEvents in it until it breaks JSON.parse. The result can be used for developing the incremental parser, which we can verify parses JSON and works for very large JSON, and then cross our fingers that it starts fixing #1685 and #1733.

do you mind me working on this? Good First Contribution :)

It's great to work on this! I was just looking at a related problem today (OOM errors when saving traces to disk) and want to make sure I don't step on your toes. Can you share what you're thinking?

evenstensberg commented 7 years ago

I've really just hovered around the source code, but using the suggestions from you & @paulirish is enough to convince me. Also, this is internal, so we are avoiding extra dependencies, which is good. Stepping on my toes are alright.

It's great to work on this! I was just looking at a related problem today (OOM errors when saving traces to disk) and want to make sure I don't step on your toes. Can you share what you're thinking?

Do you have a crbug on this that I could have a look at? Also, my guess is that our work isn't colliding extensively, as this isn't a very novel & code extensive exercise?

evenstensberg commented 7 years ago

@brendankenny seems people have reported this has been resolved in v2. I want to work on this still though, but we should try to get a test case up. Do you have time to exchange a few emails to get me started? I'm eager to help out, and also this interacts with a project that is starting in June ( hopefully )

paulirish commented 7 years ago

heya @ev1stensberg.

brendan's made some more headway here so i think it makes sense for him to finish off the streaming json parser/writer. shouldn't take him too long.

i'm looking at our issues and #2320 and #2291 could work for your next issue. wdyt?

evenstensberg commented 7 years ago

Sounds good. Will start on this tomorrow. Is there anyone I can get some help from regarding testing the source code against canary/master? Tried w/ no luck, would be good with some feedback on my walkthrough to see if there's anything I'm doing wrong :)