Performance of JSON manipulation

Geal commented 2 years ago

A large part of the router's work is done in deserializing, filtering, merging and serializing JSON data, so any work done in improving its performance will have a large impact. As is often the case with perf work, it boils down to:

can we avoid doing a task
can we avoid allocating

I propose that we record here the perf issues we encounter, as low hanging fruit if someone has time to look into it:

[x] #144 fixed by https://github.com/apollographql/router/pull/172
[x] we always call serde_json::to_string_pretty on the response even if trace level log is deactivated (I've seen that take 7% of CPU time on large responses with the fix from #172): https://github.com/apollographql/router/blob/main/crates/apollo-router-core/src/federated.rs#L256-L258 - fixed in #206
[x] deep_merge takes a reference to a json Value then calls to_owned on it, could we instead pass an owned value, and let the caller decide if it wants to clone it? https://github.com/apollographql/router/blob/main/crates/apollo-router-core/src/json_ext.rs#L126-L132 (3.75% of CPU time with #172) testing in #132
[x] avoid creating an in memory representation for objects and arrays that we will not modify: we do not need to look at a large part of the JSON value, so if we know (from the schema) an object will not be modified, we could keep a slice that corresponds to the object in the buffer(as done with #284)
[x] instead of allocating to a string then wrapping in a Bytes, maybe we could serialize directly to Bytes or a Write instance: https://github.com/apollographql/router/blob/main/crates/apollo-router/src/warp_http_server_factory.rs#L290-L304
[x] https://github.com/apollographql/router/issues/61
[x] avoid copying data coming from a subgraph in one big buffer that must reallocate: https://github.com/apollographql/router/blob/a961a201cadf59e87646718b61043f0d8d1e40a1/crates/apollo-router/src/http_subgraph.rs#L78-L90 we could instead have a kind of BufList type (cf linkerd's article on http request retries) that implements Read and we'd use serde_json::from_reader on it
[ ] streaming serialization: we serialize the entire response before sending it. For big response, that means multiple reallocations and copies of the underlying buffer to get to the required size. We could instead serialize to small fixed size buffers that we then send as a stream and wrap in hyper::Body. We could make the default case with a buffer big enough for most responses and that would allocate only once. testing in #302

Geal commented 2 years ago

zero copy deserialization

another thing we should explore in the future is zero copy deserialization. Assuming we have to store the entire graphql response from a subgraph in memory (which is the common case, since most fields will be returned to the clients), instead of parsing it entirely to an owned JSON struct, with allocations for everything (hashmaps, strings, etc), we could parse it to a structure that references slices of the input.

This greatly improves string handling: currently when a field is a string, we parse it, then unescape it to a String instance, that will then be reserialized. We could instead parse it, keep a reference to the slice, then write the slice to the ouput stream directly. This is doable right now with a Cow<'a, str> field and the #[serde(borrow)] attribute, but it still allocates if some characters are escaped. For fields that we just need to transmit we could even avoid unescaping.

references:

Geal commented 2 years ago

Investigate simd-json

Using simd can get us faster deserialization, and it's compatible with serde: https://github.com/simd-lite/simd-json

Geal commented 2 years ago

I now have a good idea of the way to implement the zero copy deserialization, and I think we should leverage the Bytes struct instead of raw slices. We can have the Bytes hold the entire subgraph response, and aggregate data from there in the client response, without caring much about lifetimes (Bytes instances are refcounted). And they will provide a nice way to plug into caching: we can store in the cache an object that has been reserialized to a very small buffer in memory (so we do not keep large subgraph response buffers for a long time) or in an external service (redis, memcached) that we can query, and will return a Bytes that we can use in the same way.

apollographql / router

Performance of JSON manipulation #173

zero copy deserialization

Investigate simd-json