Improve VM's JSON decoder

mraleph commented 4 months ago

We sometimes hear complaints that our JSON decoder is slow compared to, for example, V8's. Usually we just shrug these complaints away saying that:

Dart's JSON decoder is likely fast enough
Dart's JSON decoder is written in Dart and V8's is written in C++ - so some degree of performance difference is expected.

Developers are then encouraged to use high performance native JSON decoders if JSON decoding is a bottleneck for them.

This does not square very well with my belief that Dart should be considered a general purpose high-level programming language. So I would like to take a closer look at JSON's decoder performance and see if it can be considerably improved.

I have taken JSON files from https://github.com/simdjson/simdjson-data/tree/master/jsonexamples and run them through JS and Dart benchmarks.

Dart benchmark is measuring performance of JSON decoder fused with UTF8 decoder:

  final decoder = Utf8Decoder().fuse(JsonDecoder());
  // ...
    for (var i = 0; i < N; i++) {
      decoder.convert(bytes);
    }

JS benchmark is measuring performance of decoding JSON from pre-decoded JS string as well as performance of coverting UTF8 bytes to a JS string:

function benchmark(baseDir, fileName) {
  let content = read(/* ... */);
  // ...
    for (var i = 0; i < N; i++) {
      JSON.parse(content);
    }
  // ...
}

function benchmarkWithUtf8Decoding(baseDir, fileName) {
  let content = readbuffer(/* ... */);
    // ...
    for (var i = 0; i < N; i++) {
      JSON.parse(decodeUtf8(content));
    }
    // ...
}

decodeUtf8 is a helper function which I added to d8 shell which effectively does the following: String::NewFromUtf8(isolate, buffer, NewStringType::kNormal, buffer_size).

This was done to measure the cost of utf8 -> utf16 convertion which will otherwise be hidden.

The baseline measurement for V8 on my desktop looks like this:

"apache_builds.json",string,0.30905068359375
"apache_builds.json",buffer,0.3241399414062499
"canada.json",string,14.897924999999987
"canada.json",buffer,16.510981250000007
"citm_catalog.json",string,3.759248437500003
"citm_catalog.json",buffer,6.9653749999999945
"github_events.json",string,0.14244892578124996
"github_events.json",buffer,0.18018886718749982
"google_maps_api_compact_response.json",string,0.07089770507812503
"google_maps_api_compact_response.json",buffer,0.07248544921875003
"google_maps_api_response.json",string,0.08099453125
"google_maps_api_response.json",buffer,0.08426477050781252
"gsoc-2018.json",string,4.6573312500000155
"gsoc-2018.json",buffer,6.232159375000014
"instruments.json",string,0.49713339843750076
"instruments.json",buffer,0.635085742187502
"marine_ik.json",string,13.519856249999975
"marine_ik.json",buffer,14.717656249999981
"mesh.json",string,2.9209296875000064
"mesh.json",buffer,3.354901562500004
"mesh.pretty.json",string,4.357474999999999
"mesh.pretty.json",buffer,5.1385874999999945
"numbers.json",string,0.5762896484374977
"numbers.json",buffer,0.6817013671874974
"random.json",string,2.822692968749993
"random.json",buffer,3.569248437499982
"repeat.json",string,0.025887060546875063
"repeat.json",buffer,0.051292358398437446
"semanticscholar-corpus.json",string,50.387549999999464
"semanticscholar-corpus.json",buffer,55.51836249999978
"tree-pretty.json",string,0.09902978515625023
"tree-pretty.json",buffer,0.10347050781250004
"twitter_api_compact_response.json",string,0.03364210205078137
"twitter_api_compact_response.json",buffer,0.04489652099609387
"twitter_api_response.json",string,0.04411997070312523
"twitter_api_response.json",buffer,0.06057795410156288
"twitterescaped.json",string,1.7861546875000158
"twitterescaped.json",buffer,2.1810593749999954
"twitter.json",string,1.7194124999999985
"twitter.json",buffer,3.385740625000017
"twitter_timeline.json",string,0.13089736328124957
"twitter_timeline.json",buffer,0.13675043945312382
"update-center.json",string,2.1542187499999956
"update-center.json",buffer,3.237673437500007

Dart in AOT mode yields:

"apache_builds.json",0.6998046875
"canada.json",31.8375
"citm_catalog.json",9.1125
"github_events.json",0.39033203125
"google_maps_api_compact_response.json",0.115283203125
"google_maps_api_response.json",0.162744140625
"gsoc-2018.json",20.49375
"instruments.json",1.45390625
"marine_ik.json",23.26875
"mesh.json",4.640625
"mesh.pretty.json",11.571875
"numbers.json",0.570703125
"random.json",4.7703125
"repeat.json",0.0804443359375
"semanticscholar-corpus.json",69.275
"tree-pretty.json",0.22900390625
"twitter_api_compact_response.json",0.0744384765625
"twitter_api_response.json",0.1037109375
"twitterescaped.json",4.553125
"twitter.json",4.6828125
"twitter_timeline.json",0.45234375
"update-center.json",4.2859375

With the exceptions of numbers.json Dart's decoder is 1.5-3x slower.

Good news is that it is not impossible to significantly improve its performance - I have done some experiments and got to the following preliminary numbers:

"apache_builds.json",0.3947265625
"canada.json",30.7375
"citm_catalog.json",5.0671875
"github_events.json",0.22177734375
"google_maps_api_compact_response.json",0.093896484375
"google_maps_api_response.json",0.101416015625
"gsoc-2018.json",8.790625
"instruments.json",0.956640625
"marine_ik.json",18.00625
"mesh.json",3.984375
"mesh.pretty.json",8.625
"numbers.json",0.49609375
"random.json",3.4328125
"repeat.json",0.04671630859375
"semanticscholar-corpus.json",36.4875
"tree-pretty.json",0.14794921875
"twitter_api_compact_response.json",0.056640625
"twitter_api_response.json",0.071142578125
"twitterescaped.json",3.8953125
"twitter.json",3.1515625
"twitter_timeline.json",0.37626953125
"update-center.json",2.89453125

Which are significantly better.

I am going to use this issue as an umbrella issue to land my changes.

modulovalue commented 4 months ago

@mraleph Do you see any potential performance improvements from merging the underlying regular UTF8 automaton with an automaton that lexes JSON and using that as the lexer for the JsonDecoder (to have a Utf8JsonDecoder) so that no fusing is needed?

lrhn commented 4 months ago

The fusing creates a UTF-8 specialized JSON decoder, _JsonUtf8Decoder.

modulovalue commented 4 months ago

Thank you for the hint @lrhn. It looks to me like the specialized UTF8-JSON decoder is using a table-driven UTF8 decoder and a non-table-driven JSON parser.

What I meant to say in this comment was, since the language of UTF8 appears to be regular, and the lexical structure of JSON seems to be regular too, it feels like there might be performance gains to be had by merging both automata into one and having the tokenization phase be based on a single automaton that does UTF8 decoding and JSON lexing at the same time.

lrhn commented 4 months ago

I think the UTF-8 is a red herring here. A table based tokenizer would work equally well for both UTF-8 and string based JSON parsing, since everything outside of string values is ASCII anyway, and what's inside strings need to be interpreted if it contains escapes -- or non-ASCII characters in a UTF-8 source.

The reason the current implementation doesn't use a table based tokenizer is that it doesn't need to tokenize at all. The moment it sees the first character of a value, it can start interpreting it. It tries to collect the value of a number and the contents of a string while scanning, in one pass, so it doesn't have to first tokenize, and then interpret the token afterwards.

Maybe that process can be made more efficient, but doing an extra tokenization step first doesn't sound like an improvement. (Unless it can be done very efficiently using SIMD operations or using parallelism, or something.)

rakudrama commented 1 month ago

The current decoder generates maps that take a lot of memory. Decoding

[{"x":1,"y":2},{"x":3,"y":4}]

creates fresh strings for the repeated "x" (and "y"), and these strings are inserted into similar Maps that have their own index.

There is a cost to using a fresh string as a key. Other than memory footprint, the hashCode needs to be computed for each instance. Perhaps this cost could be better used to identify a common index to be shared by similar JSON objects (marked as copy-on-write if used more than once).

lrhn commented 1 month ago

Maps and strings can probably be optimized. Large JSON inputs trendy to have repetitions of the same map structure. Canonicalized map keys, maybe even some short values, can avoid doing as many allocations, and if it's possible to reuse the map structure, then there might be a saving there too. The cost of that is an overhead while using, for recognizing those existing values and structures. I have tried it before, and the overhead was visible.

It's much more efficient to build a custom parser for the known structures using a more pull-based JSON interpreter, than to try to recognize possibly repeated structures at parse time in a generic parser.

(That is: if you care about performance, why are you building maps and lists at all?)

mraleph commented 1 month ago

The current decoder generates maps that take a lot of memory.

I have a prototype patch which canonicalizes keys when decoding JSON (this helps both with memory consumption and improves access speed as well, because accessing maps with literal keys can short-circuit string comparison later). You still need to compute hashCode though.

I entertained the idea of using "predictive" decoding, where decoding an array of JSON objects makes an assumption about repeated shape and uses that to short-circuit both key decoding and Map hashing and maybe even use specialized COW Map implementation which allows sharing of the "index" part. This is what V8 JSON decoder effectively does. But I am not sure I want to go there - I feel the need to keep JSON decoder complexity down. My thinking here is aligned with @lrhn's: if code is doing bytes -> JSON objects (Map's) -> proper Dart objects then it is better to actually do bytes -> proper Dart objects.

gnprice commented 1 month ago

A pull-based JSON interpreter for going straight from bytes to proper Dart objects sounds great. Previous issue threads for that:

kevmoo commented 1 month ago

See also https://github.com/kevmoo/json_alt – although this code is VERY old

schultek commented 4 weeks ago

feinstein commented 4 weeks ago

Maybe we should look also at how other languages have dealt with this? I feel that if we only look at how js and dart solutions have being built so far, we might miss some clever solutions that might exist for other languages.

dart-lang / sdk

Improve VM's JSON decoder #55522