Summary

It would be good for us to better understand the performance of the Rest API, so that we can improve throughput (i.e., QPS) and reduce latencies. I suspect we'll probably want to do this on a continuous basis (e.g., in CI/CD) as we develop and extend the API.

Preliminary insights

For reference, I spent a little bit of time playing around with a hacky benchmark to see if there was any low hanging fruit and/or quick performance optimizations:

Benchmark script: I wrote a simple benchmark script that uses the rewrk tool to benchmark the API. Note: I only benchmarked some GET endpoints, so we'll probably want to extend it to also test all endpoints and operations (e.g., POST for transaction submission and simulation). Here is a quick summary of the benchmark script: https://gist.github.com/JoshLind/2c0760130333ee978fa9e5267010aad8
Test setup and baseline results: I setup two machines: one for the REST API server (running a devnet fullnode) and another for the client (running the script above). Both machines were e2-standard-16 with 16 cores, 64 GB ram, and a 1TB SSD. In all experiments, the server was completely saturated in terms of CPU usage and the client was running at low utilization. The benchmark results for the baseline can be seen here: https://gist.github.com/JoshLind/ee9991f51b7f954717585db85b7ee711.
Tuning the runtime workers: The first thing I noticed was that the poem API server uses hyper under the hood and spawns a new tokio async thread to handle each request. The problem is that almost all of our REST API endpoints perform blocking CPU/IO heavy workloads, which means each request will likely not yield the thread when it should, leading to inefficiencies. To improve this, I made the number of runtime workers configurable, and found that doubling the number of runtime workers when compared to the number of machine cores produces more optimal results. On average, this: (i) increased the throughput by 1.5x - 2x for QPS; and (ii) reduced the latencies by 1.5x to 2x. The results can be seen here: https://gist.github.com/JoshLind/6f6a76aeea7cab6148c9c78aad2bcf48.
Storage caching: The next thing I noticed was that requests that fetch a lot of data from storage don't seem to improve much with the tweak above (which makes sense). Moreover, we're hitting storage for every single request, even if storage hasn't changed since the last request. This is sub-optimal and we should be caching our storage reads. So, I hacked up a very simple cache that will store a number of storage results in memory (to avoid hitting storage on identical requests). On average, this: (i) increased the throughput by another 2x-3x for QPS; and (ii) reduced the latencies by another 1.5x-2x. The results can be seen here: https://gist.github.com/JoshLind/02903b03d296512af34a74f601caed11.
API caching: The next thing I wondered: if each request is trivially satisfied (i.e., we just serve static content), how many QPS can we achieve on the machine (so we know the upper-bound)? I modified the index response (i.e., /v1/) to also perform caching and simply return a cached response (to avoid serialization/deserialization, etc). This enabled us to achieve around 165k QPS, with latencies between 8ms-45ms. The results can be seen here: https://gist.github.com/JoshLind/f4f823bd9ebe1679ecf9e834560bd966. Overall, it seems like response caching is worth further exploring (given that we are still not close to hitting this upper bound for each request).

Notes/Caveats:

We should further improve the benchmark tooling to test the different types of requests and endpoints that we serve.
We should also analyze the existing API access patterns in our networks to better understand the real-world workloads of our APIs, so that we can make our benchmarks realistic. For example, the wrk tool seems to support scripts of this type.
I'd be very interested to see how the storage caching approach performs in the real world:
- On one hand, the results above are likely an under-estimate of the performance gains because devnet is a small network, which means storage calls are very cheap. But, in a network like testnet, storage calls are much more expensive, so I'd imagine storage caching would have a greater impact.
- On the other hand, the cache hit rate will have strong implications on performance, so we'd want to understand real-world access patterns to optimize for this. The experiments above assume a near-perfect hit rate.
I'd also very much like to further explore the API caching approach. It seems that by skipping serialization/deserialization, we get some impressive numbers. But, I suspect this would take more work because it requires handling different edge cases, e.g., each request can have different parameters, so we'd need to cache at the granularity of each call. This is definitely do-able, but likely more work.
Regarding cache freshness, both the storage and API caches can operate in a very simple manner, i.e., cache requests/responses and clear the cache periodically (e.g., every 100ms). This is because data on the fullnode (e.g., storage) is very unlikely to change more frequently than that.

aptos-labs / aptos-core

[Rest API] Benchmark and improve performance #7082

Summary

Preliminary insights

Notes/Caveats: