Investigate Two Queue Cache on GetBlock

frrist commented 5 years ago

Description

In #2675 we discovered that we spend ~42% of our cumulative execution time in the GetBlock method when performing the initial chain sync: https://github.com/filecoin-project/go-filecoin/blob/522cefefb2b6db8537d8742a4b86801b5ed4e939/chain/default_store.go#L283-L291 Lets looks at that calls GetBlock the most:

Tipset Iterator

Here is the definition of TipsetIterator. Its Next() method calls GetParentTipSet(), which calls GetBlock in a loop: https://github.com/filecoin-project/go-filecoin/blob/522cefefb2b6db8537d8742a4b86801b5ed4e939/chain/traversal.go#L16-L34 Now let's see how and where the TipsetIterator is used.

syncOne

syncOne is called for each new tipset a node receives. Here we will show that it walks the entire blockchain twice indirectly using GetBlock to fetch the blocks as it traverses:

syncOne first call's GetRecentAncestors: https://github.com/filecoin-project/go-filecoin/blob/522cefefb2b6db8537d8742a4b86801b5ed4e939/chain/default_syncer.go#L206-L215

AncestorRouncsNeeded is currently set to 20100, it is the sum of ProvingPeriodBlocks and GracePeriodBlocks: https://github.com/filecoin-project/go-filecoin/blob/522cefefb2b6db8537d8742a4b86801b5ed4e939/consensus/expected.go#L70-L72

Notice on line 65 in the below code snippet IterAncestors is called and returns a TipsetIterator which we have defined above. The call to CollectTipSetsOfHeightAtLeast will use the iterator when walking the chain.

https://github.com/filecoin-project/go-filecoin/blob/522cefefb2b6db8537d8742a4b86801b5ed4e939/chain/get_ancestors.go#L51-L90

The above code shows that GetRecentAncestors will always walk back at most 20100, and when the chain is less than 20100 blocks long, GetRecentAncestors will walk the whole chain. We call this method everytime we receive a new block from the network. This method uses a TipsetIterator to walk the chain, and we have shown that the iterator makes calls to GetBlock.

Next, let's look at what syncOne does if the tipset we received is heavier than out current: https://github.com/filecoin-project/go-filecoin/blob/522cefefb2b6db8537d8742a4b86801b5ed4e939/chain/default_syncer.go#L258-L268 we call CollectTipSetsOfHeightAtLeast passing in 0 as the value of minHeight: https://github.com/filecoin-project/go-filecoin/blob/522cefefb2b6db8537d8742a4b86801b5ed4e939/chain/get_ancestors.go#L92-L112

This means that CollectTipSetsOfHeightAtLeast will walk the entire chain since every block has a height > 0. CollectTipSetsOfHeightAtLeast uses a IterAncestors to traverse the chain. The means more calls to GetBlock.

Why is GetBlock so slow?

Calls to GetBlock are slow for 2 reasons:

GetBlock uses a block service to fetch the blocks from disk. Going to disk is slow :snail:.
GetBlock will decode the block everytime it fetches it from disk. As we showed in #2675, decoding cbor is slow.

I believe caching the decoded blocks in a TwoQueueCache will improve performance of this operation. There are block that we access frequently (genesis, genesis+1, genesis+2, ...) and there are blocks that we access recently (head, head-1, head-2, ...), a TwoQueueCache is designed for this type of access.

Acceptance criteria

Instrument TwoQueueCache from https://github.com/hashicorp/golang-lru/ on GetBlock method such that it caches the decoded blocks.
Compare performance of cache implementation with non-cache implementation to ensure the above outline is sound.
Initially the size of the cache will be static. If we confirm that caching yields a performance improvement a follow on will be created to make the cache configurable. Currently it is not clear if we should configure cache based on size (bytes) or number of blocks.

Risks + pitfalls

Caching could just be a bandaid on what is a larger problem and therefore may not be worth the time investment. Are there changes down the road that will negate the requirement that we must walk the entire chain (Gut says yes :crossed_fingers: ). However I would be willing to make the argument that the time needed to instrument said cache is relatively short.

Related issues:

The above issue with chain performance have been previously outlined in #2151 and #966.

ZenGround0 commented 5 years ago

Overall looks awesome, really excited to watch the performance accelerate with caching 🏎
Are there changes down the road that will negate the requirement that we must walk the entire chain (Gut says yes 🤞 )

Gut is correct! For the GetRecentAncestors case: Proving period is likely going down by a factor of 20 when that parameter is set for mainnet and we should reduce it speculatively right now anyway. Additionally #2223 opens up possible designs that require traversing much less data (traverse ticket chain separately from blockchain in advance or random access of tickets from state machine).

For the calculate reorg case: I have a patch in the works that got delayed because of the release blocking message pool bug. This patch stops traversing really soon in the common case because it can find the common ancestor without stupidly walking the whole chain. It will take ~1 day to get this in. In the longer term once we have a finality bound this common ancestor traversal will be bounded to a reasonable traversal length (< 1000 blocks) even in the uncommon case where the two chains diverged as far back as genesis.

Caching blocks on the chain store also paves the way for long needed upgrades to our system (e.g. #642 and #556). In particular we will need something like what you propose above to stop keeping all tipsets in memory.

frrist commented 5 years ago

Result of initial performance testing

I did a bit of testing to compare GetBlock performance with and without a cache, below is my setup and initial findings.

Setup

I added a TwoQueueCache around the GetBlock method and set the cache size to 1000. Next I added some metrics to track cache hits and misses. Lastly, some timers around calls to syncOne.

I then created a local network using the localnet fast binary and spawned off 2 different network:

network with chain length 1000

Data Point	GetBlock With Cache	GetBlock Without Cache
Cache Hits	1075717	0
Cache Miss	25707	1101424
SyncOne Average	25.56ms	47.00 ms
SyncOne Total	27.78s	49.16s

network with chain of length 5000

Data Point	GetBlock With Cache	GetBlock Without Cache
Cache Hits	6652743	0
Cache Miss	18622184	25274927
SyncOne Average	99.93ms	117.18ms
SyncOne Total	501.79s	588.84s

Initial observations indicate that chain processing performs really well when everything fits in the cache.

Unfortunately this won't always be the case and the second table with chain length 5000 demonstrates this. Ideally the size of the cache could be configured as a function of AncestorRoundsNeeded and minHeight but the likelihood of that is unclear at the moment.

Next steps might be adding a metric that tracks average size of blocks. This would help to determine a reasonable size for the GetBlock cache if we decide to move forward with it. Another step could be looking at ways to combine the GetBlock cache with the tipIndex cache as they contain similar pieces of data.

anorth commented 5 years ago

I'm moving this to backlog, as I think we've identified algorithmic changes to limit chain traversal that will initially be higher impact (by avoiding GetBlock altogether).

anorth commented 5 years ago

I recently made changes so that most chain traversals will hit the tipset cache (in memory) rather than the store. I expect this to have removed most of this decoding overhead. If profiling proves this out, I propose closing this issue for now.

frrist commented 4 years ago

Recent profiles taken against filecoin nodes on the testnet_0.3.1 indicate that this change is no longer needed. Closing.

filecoin-project / venus