TeskaLabs / cysimdjson

Very fast Python JSON parsing library
Apache License 2.0
357 stars 16 forks source link

Performance comparison from readme seems a bit unfair #29

Closed PawelTroka closed 2 years ago

PawelTroka commented 2 years ago

Hi!

First of all thanks for this library! This sounds like a very good idea.

However, what I noticed is that since it evaluates fields lazily, comparing it directly to other JSON libraries that provide you with the full dictionary right away is a bit unfair.

Assuming you will use the whole dict anyway, IMHO, a more fair comparison would be with .export() call.

>>> _json_string = '{"a fairly": "expensive", "json": "goes-in", "here": 121}'
>>> timeit(lambda: cysimdjson_parser.parse_string(_json_string).export(), number=100000)
3.6677306999990833
>>> timeit(lambda: orjson_parser.loads(_json_string), number=100000)
2.9754124999963096

Then however, it is slower than orjson.

It gets a lot faster if you will not use the whole dictionary.

>>> timeit(lambda: cysimdjson_parser.loads(_json_string)[7]['revisionNumber'], number=100000)
0.4328116000033333
>>> timeit(lambda: orjson_parser.loads(_json_string)[7]['revisionNumber'], number=100000)
3.0126906000004965

However, in my experience this is rarely the case.

ateska commented 2 years ago

Hi, yes, this is because the conversion to Python dict (and other types) is the most expensive bit of the whole JSON parsing. The idea behind this is to harvest the raw power of SIMDJSON in Python; not to race against orjson.

It is also - as you point out correctly - not universal replacement, you need to make some trade-offs (read-only parsing output which is not a true Python dictionary). The message is that: (1) these speeds are possible in Python (2) you need to adjust your design if you want to be in this performance range.

In our case, we parse rather big (10kb) JSONs in very high frequency (>50000 per second), we don't need to access all attributes (by far) and we don't need to modify the dictionary. For this SIMDJSON is ideal choice.

I'll try to highlight that in the README.

ateska commented 2 years ago

https://github.com/TeskaLabs/cysimdjson/blob/main/README.md#trade-offs