Evaluate whether pysimdjson could be used in Rally

dliappis commented 4 years ago

There are largely two areas where handling large chunks of JSON impacts performance in Rally:

Parsing the JSON source
Creating Python (dict) Objects from JSON

The simjson project seems to take advantage of modern SIMD vector instructions to achieve much higher performance than other libraries.

The pysimjson project beings those benefits to Python via bindings with prebuilt binary wheels for a lot of platforms. Additionally, it provides JSON pointers via at(), or proxies for objects and lists to reduce the creation of Python objects. We've been hitting these issues at various points e.g. in https://github.com/elastic/rally/pull/941 and https://github.com/elastic/rally/pull/935 (i.e. especially after using an async-io based load generator).

Given the benchmark results this could be a very useful library to use. Both projects use Apache 2.0 license.

TkTech commented 4 years ago

I'm watching this issue - if you find any missing functionality or issues in pysimdjson that would block this, let me know and they'll be resolved.

pquentin commented 2 years ago

The three main contenders for parsing JSON are:

the standard json module, written in pure Python
orjson, written in Rust
as mentioned above, pysimdjson, a Cython wrapper around simdjson, a C++ library that can handle JSON at multiple GB/s.

Ease of use

Nothing beats the standard library here, but orjson and pysimdjson both provide wheels, so no compilation is needed in practice. orjson is more popular (3.4k stars vs. 0.5k for pysimdjson). orjson is also more actively maintained (which makes sense as pysimdjson is only a wrapper). But orjson had Python 3.10 wheels before pysimdjson. Neither currently has Python 3.11 wheels. Small note: orjson only serializes to/deserializes from bytes, which makes sense but is more restrictive than the standard library.

Speed

For small JSON documents with a lot of structure, using an alternative JSON parser won't help much, because the bulk of the time is spent inside CPython creating and allocating the correct structure.
With larger documents and less structure, orjson is actually slightly faster than pysimdjson, but you still spend 95% of your time in CPython, so both are around 2x faster
However, if you don't need all the keys, but only a few of them, then pysimdjson has an API for you that can get you 10x speedups over orjson

pquentin commented 2 years ago

A good test bed for pysimdjson support for extracting specific keys is this parse() function that currently uses ijson and is crucial to avoid client-side bottlenecks: https://github.com/elastic/rally/blob/master/esrally/driver/runner.py#L736-L792

TkTech commented 2 years ago

Neither currently has Python 3.11 wheels.

Keep in mind 3.11 is not out yet, and you should never push beta tag wheels to pypi as the ABI is not yet stable. When 3.11 is released and cibuildwheel is updated, pysimdjson (and orjson) will push 3.11 wheels.

berglh commented 2 years ago

While I don't have anything super useful to add here in terms of replacements, I would just like to throw my anecdotal hat into this ring with respect to the elastic/logs track I was trying to run against our new NVMe backed hot data tier on on-prem hardware within an ECE cluster. The results I was getting scaling from targeting 1 shard to 2 shards and beyond didn't improve the overall indexing throughput. I specifically increased the corpus size to around 60 days of data to ensure I had plenty of events to index. My goal was to understand the behaviour the new cluster with respect to hot spotting, shard and replica counts. Unfortunately, Elastic Rally initially gave me the wrong idea.

It wasn't until I ran multiple copies of Elastic Rally with identical settings concurrently from the same host was I able to actually start approach any of the hardware limits in the cluster. In the end, I had to run 12x Elastic Rally instances on the elastic\logs track to bottleneck the CPU on the hot data tier. I executed all 12 instances from a single server (backed by NVMe, 128 GB of RAM, 32c/64t, 10 Gb network). This resulted in the actual indexing rate rising from 60-70,000 doc/s to 550-600,000 docs/s. The reality was that the server sending the logs weren't a limiting factor, nor were the hot data tier nodes, but Elastic Rally in quickly providing the documents fast enough to index.

My suspicion was that, similar to the Golang stdlb for encoding/json, that the performance is not super optimised in Python. This issue seems to validate that theory, I just wanted to provide a real world example of where Elastic Rally performance is producing results that could be easily misconstrued by naive users such as myself.

pquentin commented 2 years ago

@berglh Thanks for the report! It's true that you should always check that the client is not the bottleneck. Until we fix https://github.com/elastic/rally/issues/1399, would you mind running https://github.com/benfred/py-spy on one of the Rally processes? It will tell us what exactly is being slow.

berglh commented 2 years ago

@pquentin I'm not sure if you were after the flame graph specifically or a different format. Can run again with the other output if required. I went ahead and cleared out or cluster password from the SVG. I didn't see anything specifically JSON related in the hotspots, but there's a lot going on as I captured the parent and subprocesses of the elastic/logs track. esrally_profile Edit: ~Looks like github munged the SVG :?~

pquentin commented 2 years ago

I opened https://github.com/elastic/rally/issues/1566 so that this issue stays focused on pysimdjson.

elastic / rally

Evaluate whether pysimdjson could be used in Rally #1046

Ease of use

Speed