arangodb / velocypack

A fast and compact format for serialization and storage
Other
420 stars 40 forks source link

Why not use the state-of-art JSON parser? #107

Closed murphyatwork closed 2 years ago

murphyatwork commented 2 years ago

VPack is a great library for serializing and storage binary JSON, along with a handy iterator, builder interfaces.

But the parser of VPack looks like just a normal recursive-descent parser, which could not take advantage of SIMD instructions of modern CPU. As I know, the simdjson parser is a few times faster than normal parser.

So, do you consider combine the simdjson with VPack builder? which could make the VPack better.

I'm considering do this work, since json parsing speed is critical in our system. I could submit a PR if you consider it's useful to this project.

jsteemann commented 2 years ago

Well, the VPack parser also uses SIMD instructions for string parsing, since 2015. Have you actually compared the performance of simdjson and the velocypack parser? I haven't, so I can't say if there would be a benefit from using simdjson and how large it would be. @mofeiatwork : did you perfom any benchmarks?

murphyatwork commented 2 years ago

Thank you for reply.

I haven't perform a benchmark to compare them, but I will do it in a few days. If anything interesting found, I could submit again.

As well as I know, simdjson not only uses SIMD to parse string and numbers, but also apply a two-pass algorithm to make json structure paring more efficient. At the first stage, identifying the structure tokens like {[]},", which could utilize SIMD to processing characters. At the second stage, state-machine parsing based on structure tokens is carried out to parse the structure. As a result, the first stage could be executed parallel at the instruction granularity, and the second stage is pretty lightweight. According to their paper, it could deliver several times speed than RapidJSON.

@jsteemann

image

jsteemann commented 2 years ago

Looks good. Would be happy to try this, but may not be able to do so soon due to lack of time. But definitely looks interesting.

murphyatwork commented 2 years ago

Well, I could take a try. Just wait a few days.

jsteemann commented 2 years ago

That would be super awesome! Thanks! :+1:

murphyatwork commented 2 years ago

Hi, Jan. I have perform the benchmark for JSON parsers. This PR explains the parameter, environment and detailed results.

As the benchmark result, vpack JSON parser is quite competitive compared to rapidjson, but much slower than simdjson over many datasets. Maybe it's worth that port the simdjson parser into this project.

DataSet Parser Bytes/second document/seccond
small.json vpack 188899782.86 2303655.89
small.json rapidjson 127351861.73 1553071.48
small.json simdjson 260920607.46 3181958.63
sample.json vpack 904174249.95 1315.18
sample.json rapidjson 1413211628.26 2055.61
sample.json simdjson 4950599132.35 7200.97
sampleNoWhite.json vpack 244545692.90 1419.63
sampleNoWhite.json rapidjson 459349740.48 2666.61
sampleNoWhite.json simdjson 4149985092.36 24091.40
commits.json vpack 163986219.85 6503.26
commits.json rapidjson 252490281.96 10013.10
commits.json simdjson 4078092802.08 161726.40
api-docs.json vpack 849059621.72 704.05
api-docs.json rapidjson 653520761.57 541.91
api-docs.json simdjson 6662158225.83 5524.34
countries.json vpack 244453766.46 215.56
countries.json rapidjson 271919886.47 239.78
countries.json simdjson 2603110140.11 2295.45
directory-tree.json vpack 184676956.61 620.36
directory-tree.json rapidjson 223655990.89 751.29
directory-tree.json simdjson 2903582799.94 9753.55
doubles-small.json vpack 83976076.75 529.13
doubles-small.json rapidjson 505519075.98 3185.25
doubles-small.json simdjson 4748838987.22 29922.24
doubles.json vpack 58083903.63 48.93
doubles.json rapidjson 333404631.13 280.87
doubles.json simdjson 4322472494.62 3641.32
file-list.json vpack 329627271.95 2178.39
file-list.json rapidjson 266579913.89 1761.73
file-list.json simdjson 5316260708.28 35133.27
object.json vpack 59581590.89 377.62
object.json rapidjson 385906478.91 2445.84
object.json simdjson 4601729866.12 29165.30
pass1.json vpack 250051310.75 173526.24
pass1.json rapidjson 485213399.57 336719.92
pass1.json simdjson 2061878098.89 1430866.13
pass2.json vpack 73247764.40 1408610.85
pass2.json rapidjson 67899297.43 1305755.72
pass2.json simdjson 138015174.90 2654137.98
pass3.json vpack 629831861.52 4255620.69
pass3.json rapidjson 227191647.91 1535078.70
pass3.json simdjson 390059939.88 2635540.13
random1.json vpack 459378717.52 47495.73
random1.json rapidjson 716353526.80 74064.67
random1.json simdjson 4812340388.09 497553.80
random2.json vpack 454839652.06 55205.69
random2.json rapidjson 735467504.66 89266.60
random2.json simdjson 4596730574.84 557923.36
random3.json vpack 437717347.45 5999.99
random3.json rapidjson 423421851.34 5804.04
random3.json simdjson 5465165253.42 74913.51
murphyatwork commented 2 years ago

I have enforced an benchmark for the parser, which show that the performance of velocypack::Parser is good enough for realistic workload. So this issue should be closed.

PR: https://github.com/arangodb/velocypack/pull/108