Open glevco opened 1 month ago
Branch | perf/sync-improvements |
Testbed | ubuntu-22.04 |
🚨 1 ALERT: Threshold Boundary Limit exceeded!
Benchmark | Measure Units | View | Benchmark Result (Result Δ%) | Lower Boundary (Limit %) | Upper Boundary (Limit %) |
---|---|---|---|---|---|
sync-v2 (up to 20000 blocks) | Latency nanoseconds (ns) | 📈 plot 🚨 alert 🚷 threshold | 48,856,385,056.48 (-50.97%) | 89,674,168,696.37 (183.55%) | 109,601,761,740.01 (44.58%) |
Benchmark | Latency | Benchmark Result nanoseconds (ns) (Result Δ%) | Lower Boundary nanoseconds (ns) (Limit %) | Upper Boundary nanoseconds (ns) (Limit %) |
---|---|---|---|---|
sync-v2 (up to 20000 blocks) | 📈 view plot 🚨 view alert 🚷 view threshold | 48,856,385,056.48 (-50.97%) | 89,674,168,696.37 (183.55%) | 109,601,761,740.01 (44.58%) |
All modified and coverable lines are covered by tests :white_check_mark:
Project coverage is 84.37%. Comparing base (
7f53032
) to head (5008eb5
).
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Putting this on hold for now.
Motivation
This PR contains two changes that together reduce sync time by 60%, considering a sync of up to 100,000 blocks:
Syncing the whole dag will probably yield smaller gains, though.
I benchmarked each change independently, generating two charts, one for syncing up to 20,000 blocks and one for 100,000 blocks. Each chart shows the sync time in seconds for different values of the stream limit, for both the current
master
commit (4f80786) and for the window cache change.The charts clearly show a increase in performance that seems to stale at a streaming limit of 10,000 blocks. That change by itself yields a 40% decrease in sync time. Then, adding the sliding block window cache reduces it even more, reaching a 60% decrease with the 10,000 streaming limit for 100,000 synced blocks. Interestingly, that second change seems to only affect performance when combined with the first change.
Each value in the chart was generated by running each experiment 2 times and taking the mean. No significant variations were observed. The complete table of values can be found below.
Complete data tables
For 20,000 blocks: Stream limit | Master | Window cache -- | -- | -- 1000 (current) | 101.682 | 102.02 2000 | 52.408 | 51.641 10000 | 36.588 | 21.739 20000 | 31.609 | 21.427 For 100,000 blocks: Stream limit | Master | Window cache -- | -- | -- 1000 (current) | 509.678 | 502.582 2000 | 379.551 | 307.11 10000 | 292.656 | 206.793 50000 | 277.269 | 194.31 100000 | 264.143 | 199.613 And all output files generated by `hyperfine` can be downloaded here: [benches.zip](https://github.com/user-attachments/files/16820341/benches.zip). All benchmarks ran locally on my Mac mini 2023, with Apple M2 and 24GB of RAM, on macOS Sonoma 14.3.Increase streaming limit
The current streaming limit is arbitrarily low, at 1,000 blocks. There's a considerable overhead cause by resuming the stream between each batch, and therefore increasing it severely speeds up the sync. We didn't observe significant gains after the 10,000 mark, so that's the new value I'm setting.
Sliding block window cache
When calculating the weight for a block during verification, a window of ~200 ancestor blocks is used. Currently, for each block being verified, we iterate using
get_block_parent
. This adds an unnecessary overhead for the calculation, as during the sync blocks are received sequentially, and therefore the window of ancestor blocks is mostly the same as the one from the previously verified block, except for its first and last elements.We can leverage this by creating a cache that is a sliding block window. Whenever we calculate the weight for a block, we check the the cached window is usable by it, and simply pop its head and append to its tail to update the window, instead of re-fetching all ~200 ancestors.
Here's a tracing graph for the processing of 3 blocks in
master
:And here's the same graph for this branch:
We can observe how the proportion of time spent in verification over all time processing a new vertex severely decreases.
Future improvements
We can further optimize the weight calculation by caching the solvetimes and weights instead of the blocks themselves, as they're the only values derived from the window. I didn't do it in this PR as this change is a bit more convoluted and the performance overhead caused by it is not that significant.
Acceptance Criteria
get_parent_block
~200 times per verification.DEFAULT_STREAMING_LIMIT
from 1,000 to 10,000.Checklist
master
, confirm this code is production-ready and can be included in future releases as soon as it gets merged