[RFC] Switching from json + ujson to orjson JSON serialization and de-serialization library

StackStorm / community

Async conversation about ideas, planning, roadmap, issues, RFCs, etc around StackStorm

https://stackstorm.com/

Apache License 2.0

8 stars 3 forks source link

[RFC] Switching from json + ujson to orjson JSON serialization and de-serialization library #65

Closed Kami closed 3 years ago

Kami commented 3 years ago

Right now StackStorm code utilizes native json library for json serialization / deserialization and also ujson for fast_deepcopy function.

Now that we don't support Python 2 anymore, we should evaluate using orjson - https://github.com/ijl/orjson. Based on micro benchmarks in another project it offers substantial improvements, even over ujson.

In the past there were some issues with compatibility with native json library, but I believe most of them have been resolved.

Having said that, we should still do research specifically for this project (update the code, verify everything works + micro benchmarks).

TODO

[ ] Write some micro benchmarks for our typical data sets (also add some large executions for use with fast_deepcopy)
[ ] Test implement it in StackStorm/st2 and ensure all tests pass
[ ] Add feature flag for disabling this functionality and falling back to standard json library

Kami commented 3 years ago

While quickly looking at it and trying to get tests to pass with orjson I noticed some issues with unnecessary (and in case of large requests, slower) bytes -> unicode -> bytes conversions.

Likely some of those were needed when we still needed to support Python 2 and Python 3, but now that we only support Python 3, we should be able to get rid of some of those and working directly with bytes / unicode (depending on what specific code expects).

This change also exposed more places where we don't correctly encode / decode incoming and outgong data so the request / response contains string with b'' prefix.

On a related note - does anyone happen to have some executions with large result field laying around (maybe some from ci/cd server)? So I can use it in the micro benchmarks.

arm4b commented 3 years ago

Awesome stuff, + 100! Python3 definitely should open us some new doors with the perf optimizations. Thanks @Kami for the research and new energy here :+1:

You can find a few construction blocks here as a starting example: https://github.com/StackStorm/st2/issues/4798. Someone did a good job preparing it for us.

Kami commented 3 years ago

@armab That issue is actually a different one - it's related to bad mongoengine performance when working with large datasets.

Using a different json library would likely still speed that action since operation involves parsing json, but it wouldn't help with the mongoengine related issue (aka storing result in the database).

For mongoengine related issue, work would need to continue on those issues - https://github.com/StackStorm/st2/pull/4838, https://github.com/StackStorm/st2/pull/4846.

Kami commented 3 years ago

Re mongoengine performance issues with large documents - one option we should also consider experimenting and benchmarking with is compression (would likely require some changes to store result as string and handle that transparently inside to_mongo and from_mongo functions).

Compression is not free (CPU cycles), but executions with large results usually contain textual data which compresses well so it's, possible that trading some CPU cycles for compression would still result in overall better throughput and shorter duration of execution related DB operations because we would be passing less bytes to mongoengine later.

We should start with benchmarking (perhaps using zstandard algorithm) and then decide if it's worth pursuing those efforts.

arm4b commented 3 years ago

@Kami I just meant there was an example in that issue that may help composing a workflow which would pass some larger data along.

arm4b commented 3 years ago

Implemented in https://github.com/StackStorm/st2/pull/4846/