getsentry / sentry

Developer-first error tracking and performance monitoring
https://sentry.io
Other
39.16k stars 4.2k forks source link

Replace existing json usages with `orjson` #68903

Open anonrig opened 7 months ago

anonrig commented 7 months ago

We currently use rapidjson and default json library to parse/stringify JSON. The alternative library orjson is 6 times faster than json and 2-3 times faster than rapidjson.

We (me and @fpacifici) validated that simply replacing rapidjson with orjson results in 25% improvement (from 12ms to 9ms) on a particular task.

For example, here's graph to visualize the difference between orjson and rapidjson:

Screenshot 2024-04-15 at 9 14 58 AM

The drill

The drill is always this:

  1. Pick up a consumer or a component or a class
  2. Change the json implementation - surround it with a feature flag
  3. Test - enabled feature flags
  4. Rollout - usage without feature flags
  5. Cleanup - removal of feature flags
  6. Done

Todos

Task list

Component Status Author Pull Request Notes
Relay 4 @anonrig https://github.com/getsentry/sentry/pull/69613 -
Indexer consumer 6 @anonrig https://github.com/getsentry/sentry/pull/68802 20-25% overall reduction
Ingest processor 6 @anonrig https://github.com/getsentry/sentry/pull/68811 -
Snuba 6 @anonrig https://github.com/getsentry/sentry/pull/68938 -
Slack integration 4 @loewenheim https://github.com/getsentry/sentry/pull/69409 -
reconstruct messages 4 @ayirr7 https://github.com/getsentry/sentry/pull/69378 50% reduction in JSON serialization and 25% reduction in average in all indexer payloads
outcomes_consumer 2 @vbro https://github.com/getsentry/getsentry/pull/13779 -
event_manager 4 @loewenheim https://github.com/getsentry/sentry/pull/69653 -
eventstore 4 @loewenheim https://github.com/getsentry/sentry/pull/69653 -
auth 4 @loewenheim https://github.com/getsentry/sentry/pull/69653 -
integrations 4 @loewenheim https://github.com/getsentry/sentry/pull/69653 -
backup 4 @loewenheim https://github.com/getsentry/sentry/pull/69653 -
JoshFerge commented 7 months ago

is JSON parsing a significant cause of latency in our consumers or endpoints? and would check sentry as you roll out, as with replays we noticed a bunch of errors associated with switching json implementations

anonrig commented 7 months ago

is JSON parsing a significant cause of latency in our consumers or endpoints? and would check sentry as you roll out, as with replays we noticed a bunch of errors associated with switching json implementations

@JoshFerge JSON parsing is a bottleneck in couple of places, and I think it has more impact than we assume it should have. Here's an example: https://sentry-st.sentry.io/performance/summary/?display=trend&end=2024-04-16T03%3A59%3A59&project=1513938&query=&referrer=performance-transaction-summary&start=2024-04-10T04%3A00%3A00&transaction=ingest_consumer.process_event&trendFunction=p95&unselectedSeries=p100%28%29%2Cavg%28%29

The only change done caused this improvement was due to rapidjson to orjson change.

Screenshot 2024-04-15 at 8 29 50 PM
ayirr7 commented 6 months ago

Here's an example of how this improved the time spent in one of the processing phases of the indexer:

Screenshot 2024-04-22 at 11 51 24 AM

This is the phase where JSON deserialization is done.

loewenheim commented 6 months ago

One difference between json and orjson is that orjson dumps to bytes, not str. Decoding that back to a str might cost us some performance, let's see how it shakes out.

anonrig commented 6 months ago

One difference between json and orjson is that orjson dumps to bytes, not str. Decoding that back to a str might cost us some performance, let's see how it shakes out.

@loewenheim Some functions such as Snuba ones take both bytes and str as parameters. We might not need to encode/decode in some places.

TkTech commented 6 months ago

In all of the cases where this change is being made, is the parsed JSON fully utilized? If there are cases where the JSON is only parsed for a routing key for example, this can be sped up typically by another order of magnitude.

There are quite a few differences between how rapidjson and orjson parse JSON. This doesn't mean either is wrong, just that they might differ and introduce behaviour changes. A good place to start is to compare the two is:

Specifically, if you're depending on very-accurate float or integer parsing/serialization there will be quirks. For example with rapidjson you would have been ingesting Nan and +/-Infinity as valid numbers, but with orjson you no longer do. Although yyjson is now the basis for the parser in orjson, orjson doesn't use its support for arbitrary integer parsing so the maximum size of an integer will also be different from Python's json.

Narretz commented 6 months ago

Why is the title of this issue "replace orjson with existing json usages"? It's the other way around, isn't it?

In the description it's fine"... simply replacing rapidjson with orjson results ..."

loewenheim commented 6 months ago

Changed the title.

anonrig commented 6 months ago

Max duration of reconstruct_messages.build_new_payload.json_step has been impacted by this changes as well.

Screenshot 2024-05-07 at 9 38 36 AM

Referencing @ayirr7

This gave us a 50% reduction in time spent, on average, on JSON serialization in the indexer and about 25% reduction in the time spent, on average, by the indexer in building a new payload which it sends along to Snuba. Cool!

Screenshot 2024-05-10 at 10 57 54 AM
anonrig commented 6 months ago

After replacing middlewares and integrations to use orjson, here's the changes:

Screenshot 2024-05-09 at 10 16 20 AM
simon-liebehenschel commented 3 months ago

@anonrig Could you please show the memory usage plot with/without orjson? I am not familiar with the orjson.dumps() internals, but I have heard from colleagues about some known problems with memory usage or memory leaks (if you want, you can scroll through the orjson issues closed by a Stale Bot). I am not saying that Sentry will definitely use more memory, because it depends on what is being dumped and other things, but if it does, it is better to know about it now than to debug OOMKilled errors on a server.