Why not use the state-of-art JSON parser?

murphyatwork commented 2 years ago

VPack is a great library for serializing and storage binary JSON, along with a handy iterator, builder interfaces.

But the parser of VPack looks like just a normal recursive-descent parser, which could not take advantage of SIMD instructions of modern CPU. As I know, the simdjson parser is a few times faster than normal parser.

So, do you consider combine the simdjson with VPack builder? which could make the VPack better.

I'm considering do this work, since json parsing speed is critical in our system. I could submit a PR if you consider it's useful to this project.

jsteemann commented 2 years ago

Well, the VPack parser also uses SIMD instructions for string parsing, since 2015. Have you actually compared the performance of simdjson and the velocypack parser? I haven't, so I can't say if there would be a benefit from using simdjson and how large it would be. @mofeiatwork : did you perfom any benchmarks?

murphyatwork commented 2 years ago

Thank you for reply.

I haven't perform a benchmark to compare them, but I will do it in a few days. If anything interesting found, I could submit again.

As well as I know, simdjson not only uses SIMD to parse string and numbers, but also apply a two-pass algorithm to make json structure paring more efficient. At the first stage, identifying the structure tokens like {[]},", which could utilize SIMD to processing characters. At the second stage, state-machine parsing based on structure tokens is carried out to parse the structure. As a result, the first stage could be executed parallel at the instruction granularity, and the second stage is pretty lightweight. According to their paper, it could deliver several times speed than RapidJSON.

@jsteemann

jsteemann commented 2 years ago

Looks good. Would be happy to try this, but may not be able to do so soon due to lack of time. But definitely looks interesting.

murphyatwork commented 2 years ago

Well, I could take a try. Just wait a few days.

jsteemann commented 2 years ago

That would be super awesome! Thanks! :+1:

murphyatwork commented 2 years ago

Hi, Jan. I have perform the benchmark for JSON parsers. This PR explains the parameter, environment and detailed results.

As the benchmark result, vpack JSON parser is quite competitive compared to rapidjson, but much slower than simdjson over many datasets. Maybe it's worth that port the simdjson parser into this project.

DataSet	Parser	Bytes/second	document/seccond
small.json	vpack	188899782.86	2303655.89
small.json	rapidjson	127351861.73	1553071.48
small.json	simdjson	260920607.46	3181958.63
sample.json	vpack	904174249.95	1315.18
sample.json	rapidjson	1413211628.26	2055.61
sample.json	simdjson	4950599132.35	7200.97
sampleNoWhite.json	vpack	244545692.90	1419.63
sampleNoWhite.json	rapidjson	459349740.48	2666.61
sampleNoWhite.json	simdjson	4149985092.36	24091.40
commits.json	vpack	163986219.85	6503.26
commits.json	rapidjson	252490281.96	10013.10
commits.json	simdjson	4078092802.08	161726.40
api-docs.json	vpack	849059621.72	704.05
api-docs.json	rapidjson	653520761.57	541.91
api-docs.json	simdjson	6662158225.83	5524.34
countries.json	vpack	244453766.46	215.56
countries.json	rapidjson	271919886.47	239.78
countries.json	simdjson	2603110140.11	2295.45
directory-tree.json	vpack	184676956.61	620.36
directory-tree.json	rapidjson	223655990.89	751.29
directory-tree.json	simdjson	2903582799.94	9753.55
doubles-small.json	vpack	83976076.75	529.13
doubles-small.json	rapidjson	505519075.98	3185.25
doubles-small.json	simdjson	4748838987.22	29922.24
doubles.json	vpack	58083903.63	48.93
doubles.json	rapidjson	333404631.13	280.87
doubles.json	simdjson	4322472494.62	3641.32
file-list.json	vpack	329627271.95	2178.39
file-list.json	rapidjson	266579913.89	1761.73
file-list.json	simdjson	5316260708.28	35133.27
object.json	vpack	59581590.89	377.62
object.json	rapidjson	385906478.91	2445.84
object.json	simdjson	4601729866.12	29165.30
pass1.json	vpack	250051310.75	173526.24
pass1.json	rapidjson	485213399.57	336719.92
pass1.json	simdjson	2061878098.89	1430866.13
pass2.json	vpack	73247764.40	1408610.85
pass2.json	rapidjson	67899297.43	1305755.72
pass2.json	simdjson	138015174.90	2654137.98
pass3.json	vpack	629831861.52	4255620.69
pass3.json	rapidjson	227191647.91	1535078.70
pass3.json	simdjson	390059939.88	2635540.13
random1.json	vpack	459378717.52	47495.73
random1.json	rapidjson	716353526.80	74064.67
random1.json	simdjson	4812340388.09	497553.80
random2.json	vpack	454839652.06	55205.69
random2.json	rapidjson	735467504.66	89266.60
random2.json	simdjson	4596730574.84	557923.36
random3.json	vpack	437717347.45	5999.99
random3.json	rapidjson	423421851.34	5804.04
random3.json	simdjson	5465165253.42	74913.51

murphyatwork commented 2 years ago

I have enforced an benchmark for the parser, which show that the performance of velocypack::Parser is good enough for realistic workload. So this issue should be closed.

PR: https://github.com/arangodb/velocypack/pull/108

arangodb / velocypack

Why not use the state-of-art JSON parser? #107