[Performance]: Restructure slashing for more performance

ValarDragon commented 4 months ago

Summary

Currently slashing makes two state reads per validator, always one write, sometimes two writes, every block. This causes excess time for:

Fetching this data
IAVL committing to this data

In Osmosis this is 150 to 300 state writes per block! We see this on average being around 175 writes.

We've just started benchmarking and aiming to reduce our number of writes per block. After we changed x/distribution to only distribute staking rewards once every 50 blocks, it preliminarily looks like were at around 1000 writes per block, so this is a big contribution to database Commit time!

Problem Definition

No response

Proposed Feature

Right now I'm concerned with lowering the number of writes at hand here. (But later on would like to reduce the data fetch time too)

There are two data writes right now:

We always write SlashingInfo for every validator. What this is doing right now is just incrementing an offset, and then tracking missed blocks if you changed how many you missed.
If your missed status for the current block is different than your last block at that offset, we update the bitmap.

I see 3 paths forward here:

[ ] (Low Complexity) Remove the offset incrementing in SetInfo. It just doesn't need to happen, you can derive it from a "start offset", and tracking the block height pretty easily. If folks are aligned with this approach I'll happily make a PR, and this should drop the number of writes to just be roughly 2 * num_missed_validator_sigs.
[ ] (Low-Medium Complexity) Theres a lot of data locality were just kinda missing here. We could just not update liveness tracking info every single block, and instead update it every "10" blocks. This would require:
- Every block writing to state the list of votes (unless stored already). (NOTE: It is stored in staking)
- Every "interval" run the existing vote logic.
- All the data gets will be in cache improving speed
- We will only do 1 write per validator + num validator missed blocks, but the number of missed block should be in the same validator bitmap chunk, so still likely just 2 writes.
[ ] (Medium Complexity) Change the data structure so that all validator's bitmap entry's that get processed together are in the same IAVL leaf. (This would lower the read and write cost significantly, as instead of reads and writes for 150 validators, its O(1) DB reads/writes, just each capturing 150 bits of data)

I think we should do both solutions (1) and (2) personally, but I see (2) as potentially contentious. (I think its fine, but I'd understand if others disagreed) I'm happy to PR a fix for #1, if other folks agree. I believe it would be:

Simple
A notable win

However I would not be taking on the fully burdened work scope of QA'ing the migration in the mainline SDK. I would bring it to the Osmosis fork if were aligned, but I would only include it in Osmosis if the mainline SDK folks wanted to do it.

ValarDragon commented 4 months ago

This also may not make sense with the SDK's plans for a staking/slashing refactor, hence my asking if were aligned on problem/fix

alexanderbez commented 4 months ago

Nice, thanks for writing this up @ValarDragon! I'm really stoked to see you guys getting into the nitty gritty details of improving I/O, specifically writes. Tbh, I'm less concerned with reads, especially with store/v2 on the horizon.

Now with regards to lowering number of writes, I think (1) should have zero contentious and provides a case where you may have no writes at all (assuming no missed blocks). So I have no objections there. I think (2) is a bit more contentious of a change.

@tac0turtle wdyt?

ValarDragon commented 4 months ago

Sweet! #1 was shockingly easier than I expected, making a draft PR right now. I just have to get proto compiling working on my new laptop haha

ValarDragon commented 4 months ago

On our latest benchmarks, using IAVL v1 & no fast nodes, we are now seeing the Reads from slashing as taking 11% of our sync time! I am still shocked that this is so high tbh, I don't really understand why its not hitting caches. Were doing 300 reads, 150 writes in last SDK release. So maybe its possible we consume the full cache size every block, and thus get a lot of misses?

(we need ~30% more speed improvements in state machine side, over the current IAVL v1 no fast nodes to hit our target goal. Getting pretty close!!)

As we want to push towards lower block times though, it does seem a bit silly for us to not batch this computation more. Would a batching PR be accepted? (Batching as in process missed votes for jailing once every N blocks) This relates to #19474 . I don't really want to make Osmosis' code diverge from mainline that much in slashing, so I would only do it if theres alignment here. (Right now its just backporting my commits to their v0.47.x equivalent, which we'll drop once were on an SDK version containing it!)

tac0turtle commented 4 months ago

if we did batching how would you track it over a period of time, store it in some cache?

ValarDragon commented 4 months ago

If i wanted to process every 5th block, I just need to do the following pseudocode:

if blockheight % 5 == 0 && blockHeight > 0 {
  for i := blockheight - 4; i <= blockHieght; i++ {
    header, err := getBlockHeaderAtHeight(i)
    if err != nil { // edge case where chain started in the last 5 blocks, skip until we get height
        continue
    } 
    for vote in header.LastCommit {
        // existing code for handling votes, just pass in blockheight instead of reading from ctx
    } 
  }
}

Staking already stores those headers!

alexanderbez commented 4 months ago

Yup -- @ValarDragon's proposal makes sense. This doesn't require epochs though, as you've pointed out, you can just perform modulo (that modulo must be divisible by the signing window though).

cosmos / cosmos-sdk