perf(sync): improve sync speed

glevco commented 1 month ago

Motivation

This PR contains two changes that together reduce sync time by 60%, considering a sync of up to 100,000 blocks:

Increase streaming limit
Add a sliding block window cache in DAA

Syncing the whole dag will probably yield smaller gains, though.

I benchmarked each change independently, generating two charts, one for syncing up to 20,000 blocks and one for 100,000 blocks. Each chart shows the sync time in seconds for different values of the stream limit, for both the current master commit (4f80786) and for the window cache change.

Sync-v2 up to 20 000 blocks

Sync-v2 up to 100 000 blocks

The charts clearly show a increase in performance that seems to stale at a streaming limit of 10,000 blocks. That change by itself yields a 40% decrease in sync time. Then, adding the sliding block window cache reduces it even more, reaching a 60% decrease with the 10,000 streaming limit for 100,000 synced blocks. Interestingly, that second change seems to only affect performance when combined with the first change.

Each value in the chart was generated by running each experiment 2 times and taking the mean. No significant variations were observed. The complete table of values can be found below.

Complete data tables

For 20,000 blocks: Stream limit | Master | Window cache -- | -- | -- 1000 (current) | 101.682 | 102.02 2000 | 52.408 | 51.641 10000 | 36.588 | 21.739 20000 | 31.609 | 21.427 For 100,000 blocks: Stream limit | Master | Window cache -- | -- | -- 1000 (current) | 509.678 | 502.582 2000 | 379.551 | 307.11 10000 | 292.656 | 206.793 50000 | 277.269 | 194.31 100000 | 264.143 | 199.613 And all output files generated by `hyperfine` can be downloaded here: [benches.zip](https://github.com/user-attachments/files/16820341/benches.zip). All benchmarks ran locally on my Mac mini 2023, with Apple M2 and 24GB of RAM, on macOS Sonoma 14.3.

Increase streaming limit

The current streaming limit is arbitrarily low, at 1,000 blocks. There's a considerable overhead cause by resuming the stream between each batch, and therefore increasing it severely speeds up the sync. We didn't observe significant gains after the 10,000 mark, so that's the new value I'm setting.

Sliding block window cache

When calculating the weight for a block during verification, a window of ~200 ancestor blocks is used. Currently, for each block being verified, we iterate using get_block_parent. This adds an unnecessary overhead for the calculation, as during the sync blocks are received sequentially, and therefore the window of ancestor blocks is mostly the same as the one from the previously verified block, except for its first and last elements.

We can leverage this by creating a cache that is a sliding block window. Whenever we calculate the weight for a block, we check the the cached window is usable by it, and simply pop its head and append to its tail to update the window, instead of re-fetching all ~200 ancestors.

Here's a tracing graph for the processing of 3 blocks in master:

Screenshot 2024-08-29 at 13 22 17

And here's the same graph for this branch:

Screenshot 2024-08-29 at 13 22 29

We can observe how the proportion of time spent in verification over all time processing a new vertex severely decreases.

Future improvements

We can further optimize the weight calculation by caching the solvetimes and weights instead of the blocks themselves, as they're the only values derived from the window. I didn't do it in this PR as this change is a bit more convoluted and the performance overhead caused by it is not that significant.

Acceptance Criteria

Add a sliding block window cache in DAA to prevent calling get_parent_block ~200 times per verification.
Remove sorting assertion as it's guaranteed that increasing blocks have increasing timestamps, as another smaller performance improvement.
Change DEFAULT_STREAMING_LIMIT from 1,000 to 10,000.

Checklist

[x] If you are requesting a merge into master, confirm this code is production-ready and can be included in future releases as soon as it gets merged

github-actions[bot] commented 1 month ago

Bencher Report

Branch	perf/sync-improvements
Testbed	ubuntu-22.04

🚨 1 ALERT: Threshold Boundary Limit exceeded!

Benchmark	Measure Units	View	Benchmark Result (Result Δ%)	Lower Boundary (Limit %)	Upper Boundary (Limit %)
sync-v2 (up to 20000 blocks)	Latency nanoseconds (ns)	📈 plot 🚨 alert 🚷 threshold	48,856,385,056.48 (-50.97%)	89,674,168,696.37 (183.55%)	109,601,761,740.01 (44.58%)

Click to view all benchmark results

Benchmark	Latency	Benchmark Result nanoseconds (ns) (Result Δ%)	Lower Boundary nanoseconds (ns) (Limit %)	Upper Boundary nanoseconds (ns) (Limit %)
sync-v2 (up to 20000 blocks)	📈 view plot 🚨 view alert 🚷 view threshold	48,856,385,056.48 (-50.97%)	89,674,168,696.37 (183.55%)	109,601,761,740.01 (44.58%)

🐰 View full continuous benchmarking report in Bencher

codecov[bot] commented 1 month ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 84.37%. Comparing base (7f53032) to head (5008eb5).

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #1127 +/- ## ========================================== + Coverage 84.33% 84.37% +0.03% ========================================== Files 317 317 Lines 24297 24302 +5 Branches 3697 3698 +1 ========================================== + Hits 20491 20504 +13 + Misses 3083 3078 -5 + Partials 723 720 -3 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

glevco commented 15 hours ago

Putting this on hold for now.

HathorNetwork / hathor-core