Investigate high level chain sync performance

anorth commented 5 years ago

Description

Initial chain syncing for a node that is new or far behind the consensus head is very slow: it can take many hours to sync a weeks-old chain. Sync is a bit vague a term, but I here mean all the work necessary for a node to have processed ~all messages and be ready to mine and send other meaningful messages. It includes at least fetching and storing the blocks (which currently include the messages inline), validating they form a well-formed blockchain, and computing the actor state tree. There may be other nontrivial work in and around those pieces.

At a high level, we want to understand the costs and bottlenecks in the sync process. Which parts dominate the end-to-end latency? Are they waiting for network or disk? What comprises the major CPU workloads involved? What opportunities exist for greater parallelism? What obvious redundancies or silly things are there?

This is is a request for a coarse, high level understanding, not a detailed profile. The answers will help us focus subsequent optimisation work on the most fruitful avenues.

Note that separately but related, we want to implement a new fetching protocol, which may or may not play a large role in the total latency.

Acceptance criteria

High level notes here and possibly detailed reproduction steps in another document
Follow-up issues are filed identifying concrete work steps we could undertake, with their expected impact

Risks + pitfalls

Avoid going too deep down one rabbit hole; instead, tell us where the rabbit holes are

Where to begin

Start with an understanding of the discussion in #2050.

Be aware also of #2674, #2065 and eventual plans to fetch blocks with graphsync.

anorth commented 5 years ago

I believe that @frrist is already doing some of this investigation, so assigning this to him now. CC also @ZenGround0 @whyrusleeping.

frrist commented 5 years ago

It has been my experience that the time it takes a filecoin node to process its chain dwarfs the time it spends downloading the chain. Because of this I started looking for rabbit holes in the chain processing code.

I have uploaded a performance profile of a filecoin node connected to the user devnet here. In addition to the profile data, you will find a HOW-TO-VIEW.md that explains how to view it and a pprof.svg you may view in your browser. This profile was taken over the course of 5 mins while the node was in the processing phase of syncing. During the time the profiler was running the node was syncing the first 6000 blocks or so. For the sake of keeping this issue clean I will be linking to a gist that has the relevant profiling information.

Looking at the top 20 cumulative calls, we spend a lot of time marshaling/unmarshaling, walking the chain, getting blocks, and using the reflection package.

Showing top 20 nodes out of 294
      flat  flat%   sum%        cum   cum%
     1.14s  0.35%  0.35%    187.84s 57.14%  gx/ipfs/QmdBzoMxsBpojBfN1cv5GnKtB7sfYBMoLH7p9qSyEVYXcu/refmt/shared.TokenPump.Run
     0.03s 0.0091%  0.36%    152.68s 46.45%  gx/ipfs/QmdBzoMxsBpojBfN1cv5GnKtB7sfYBMoLH7p9qSyEVYXcu/refmt/cbor.(*Unmarshaller).Unmarshal
         0     0%  0.36%    146.46s 44.55%  github.com/filecoin-project/go-filecoin/chain.(*DefaultStore).BlockHistory.func1
     0.03s 0.0091%  0.37%    146.46s 44.55%  github.com/filecoin-project/go-filecoin/chain.(*DefaultStore).walkChain
     0.03s 0.0091%  0.37%    139.79s 42.52%  github.com/filecoin-project/go-filecoin/chain.(*DefaultStore).GetBlock
         0     0%  0.37%    138.96s 42.27%  gx/ipfs/QmcZLyosDwMKdB6NLRsiss9HXzDPhVhhRtPy67JFKTDQDX/go-ipld-cbor.DecodeInto
     0.01s 0.003%  0.38%    138.96s 42.27%  gx/ipfs/QmcZLyosDwMKdB6NLRsiss9HXzDPhVhhRtPy67JFKTDQDX/go-ipld-cbor/encoding.(*PooledUnmarshaller).Unmarshal
     0.01s 0.003%  0.38%    138.87s 42.24%  gx/ipfs/QmcZLyosDwMKdB6NLRsiss9HXzDPhVhhRtPy67JFKTDQDX/go-ipld-cbor/encoding.(*Unmarshaller).Unmarshal
     0.04s 0.012%  0.39%    138.71s 42.20%  gx/ipfs/QmcZLyosDwMKdB6NLRsiss9HXzDPhVhhRtPy67JFKTDQDX/go-ipld-cbor/encoding.(*Unmarshaller).Decode
         0     0%  0.39%    134.03s 40.77%  github.com/filecoin-project/go-filecoin/types.DecodeBlock
     0.41s  0.12%  0.52%    126.53s 38.49%  reflect.Value.Call
     4.92s  1.50%  2.01%    126.02s 38.34%  reflect.Value.call

I traced one path of chain processing execution down to the marshaling code, this table summarizes my findings and links to the profiled methods:

Method	Cum%
syncOne	30%
RunStateTransition	28%
runMessages	28%
ProcessBlock	20%
ApplyMessagesAndPayReward	25%
ApplyMessages	25%
attemptApplyMessage	25%
Send	26%
CreateChannel	13%
withPayerChannel	11%
Commit	9%
Put	9%
WrapObject	11%

Commit and Put call WrapObject which is the entry point of TokenPump.Run. My reasoning for exploring CreateChannel calls comes from the percentage of messages in blocks:

Method	# of Messages	%
"createChannel"	454231	69.2%
"commitSector"	169065	25.7%
""	31223	4.4%
"addAsk"	637	> 1%
"createMiner"	451	> 1%
"submitPoSt"	179	> 1%
"reclaim"	10	> 1%
"updatePeerID"	8	> 1%
"redeem"	1	> 1%

how I obtained the data in this table

One of our bottlenecks in the chain processing code appears to be the VM's marshing & reflection calls -- this is a :rabbit: :hole:. From the conversation in #2050 this isn't new information, but I thought it was worth sharing the data here.

Another source of slowness is the GetBlock method in the chain store. We spend ~ 42% of our execution time in this method. Adding a LRU-ish cache here would likely speed things up as the cbor decoding in the method is a major source of slowness. At first glance this looks like some low hanging :banana: :kiwi_fruit: :cherries:

Lastly from the the pprof.svg linked above another areas of slowness appears to be CGO calls, these occurs when we verify messages with the commitSector method, this is another :rabbit: :hole:. I have little to no insight into improvements that can be made here.

ROUTINE ======================== github.com/filecoin-project/go-filecoin/proofs.(*RustVerifier).VerifySeal in /home/frrist/src/github.com/filecoin-project/go-filecoin/proofs/rustverifier.go
         0     34.12s (flat, cum) 10.38% of Total
         .          .     54:   if err != nil {
         .          .     55:           return VerifySealResponse{}, err
         .          .     56:   }
         .          .     57:
         .          .     58:   // a mutable pointer to a VerifySealResponse C-struct
         .     34.12s     59:   resPtr := (*C.VerifySealResponse)(unsafe.Pointer(C.verify_seal(
         .          .     60:           cfg,
         .          .     61:           (*[32]C.uint8_t)(commRCBytes),
         .          .     62:           (*[32]C.uint8_t)(commDCBytes),
         .          .     63:           (*[32]C.uint8_t)(commRStarCBytes),
         .          .     64:           (*[31]C.uint8_t)(proverIDCBytes),

Takeaways from this comment are:

We spend a ton of time in VM marshaling data to and from cbor.
The default_store's GetBlock method could benefit from a LRU cache.
Calls to CGO methods are slow but currently appear to be unavoidable.

frrist commented 5 years ago

Another rabbit hole that could use exploring is the merkleizaion of messages in blocks. Merklizing messages will yield the following benefits:

Decrease the size of the chain we keep in memory when syncing from the network: When a node joins the network for the first time, it collects the entire chain and iterates over it. Because messages are nested in blocks this structure can grow to be very large (several GB or more). Replacing the message structure with a CID would allow for only the messages being processed to be kept in memory.
Decrease the size of the tipIndex cache used by the chain store. This cache contains blocks with all their messages -- meaning it can grow to be several GB or more, for reason discussed above.

These are the benefits that jump out to me right now, but I am sure there are more benefits to this than what I have listed here.

anorth commented 5 years ago

Thanks @frrist this is a great start.

The LRU cache sounds like a great first thing we can do. I note that it will only provide benefit if the same small set of blocks are fetched many times, but if the shape of the code suggests this is likely, we probably need only a relatively brief design sketch to communicate and inform that implementation.

Your last comment is more about memory pressure than chain sync performance, though that is also a pressing issue. I note that we could separate the messages from the blocks internally independent of any merklization or other protocol-level changes: it's just a matter of coding to abstract the network-level block structure from code that accesses it. This abstraction is likely to be helpful and necessary for future model-for-change work to simultaneously support different block structures across protocol upgrades.

anorth commented 5 years ago

Since this issue we have made great progress through fixing of technical debt which had large performance impacts. When this issue was filed, validating the chain took many times longer than downloading it. Validation performance has improved such that, depending on network latency, these are now on par.

Let's re-profile on the 0.3.2 user devnet in a week or so, to let us know whether we need to do more critical work right now, and what that is. And then close this issue for now.

frrist commented 5 years ago

Below are the results of a profile ran against a go-filecoin node on 0.3.2 while it was processing the chain after joining the network for the first time (this does not include the download part of syncing). The duration of this profile lasted 5 mins. Again let's look at the top 20 cumulative calls

Showing top 20 nodes out of 277
      flat  flat%   sum%        cum   cum%
     0.01s 0.0031% 0.0031%    245.92s 77.21%  github.com/libp2p/go-libp2p/p2p/host/basic.(*BasicHost).SetStreamHandler.func1
         0     0% 0.0031%    245.02s 76.93%  github.com/filecoin-project/go-filecoin/protocol/hello.(*Handler).handleNewStream
         0     0% 0.0031%    245.01s 76.92%  github.com/filecoin-project/go-filecoin/chain.(*Syncer).HandleNewTipset
         0     0% 0.0031%    245.01s 76.92%  github.com/filecoin-project/go-filecoin/chain.(*Syncer).syncOne
         0     0% 0.0031%    245.01s 76.92%  github.com/filecoin-project/go-filecoin/node.(*Node).Start.func1
         0     0% 0.0031%    245.01s 76.92%  github.com/filecoin-project/go-filecoin/protocol/hello.(*Handler).processHelloMessage
         0     0% 0.0031%    231.48s 72.68%  github.com/filecoin-project/go-filecoin/consensus.(*Expected).RunStateTransition
         0     0% 0.0031%    225.19s 70.70%  github.com/filecoin-project/go-filecoin/consensus.(*Expected).runMessages
     0.18s 0.057%  0.06%    215.66s 67.71%  reflect.Value.Call
     2.19s  0.69%  0.75%    215.56s 67.68%  reflect.Value.call
     0.01s 0.0031%  0.75%    210.78s 66.18%  github.com/filecoin-project/go-filecoin/vm.Send (inline)
     0.02s 0.0063%  0.76%    210.78s 66.18%  github.com/filecoin-project/go-filecoin/vm.send
         0     0%  0.76%    210.73s 66.16%  github.com/filecoin-project/go-filecoin/consensus.(*DefaultProcessor).ApplyMessagesAndPayRewards
         0     0%  0.76%    210.67s 66.14%  github.com/filecoin-project/go-filecoin/consensus.(*DefaultProcessor).ApplyMessage
     0.01s 0.0031%  0.76%    210.60s 66.12%  github.com/filecoin-project/go-filecoin/actor.MakeTypedExport.func2
         0     0%  0.76%    208.75s 65.54%  github.com/filecoin-project/go-filecoin/consensus.(*DefaultProcessor).attemptApplyMessage
         0     0%  0.76%    171.57s 53.87%  github.com/filecoin-project/go-filecoin/consensus.(*DefaultProcessor).ProcessBlock
         0     0%  0.76%    152.60s 47.91%  github.com/filecoin-project/go-filecoin/actor/builtin/paymentbroker.(*Actor).CreateChannel
         0     0%  0.76%    152.57s 47.90%  github.com/filecoin-project/go-filecoin/actor/builtin/paymentbroker.withPayerChannels
     0.76s  0.24%     1%    126.07s 39.58%  github.com/polydawn/refmt/shared.TokenPump.Run

The above marks an improvement since the last profile was taken -- we no longer spend a majority of our time walking the chain and decoding blocks. We do spend a lot of time in the in the functions that make up chain processing but this is expected given the time the profile was taken over. Below are 3 areas that could use investigation to improve performance. They are in order based on how easy (guessing) they are to address.

Double Message Application via `runMessages`

Calling runMessages will call ProcessBlock for each block in a tipset and will also call ProcessTipSet. Both ProcessBlock and ProcessTipSet call ApplyMessagesAndPayReward. This means we are processing each message we receive twice. @ZenGround0 has proposed a solution to this in the comments:

// multiblock tipsets require reapplying messages to get aggregate state // NOTE: It is possible to optimize further by applying block validation // in sorted order to reuse first block transitions as the starting state // for the tipSetProcessor.

Hello Protocol

We spend a lot of time handling Hello Protocol messages. We lack a means to blacklist/back-off peers that we cannot sync with (due to invalid genesis or version) so we are likely reprocessing the same invalid messages over and over again since we disconnect from them when they fail only to reconnect to them again later (rinse repeat). This existed in the previous profile but is now more obvious here due to the perf improvements we made in walking the chain. There is an issue in libp2p to address this here.

VM Send

We spend a decent amount of time in the VM -- this is likely due to the inefficiency of using reflection (e.g. method name to function call) among other operations involving modification to actor state. Scope here is broad and I'd prefer to punt on investigating this further since it was already looked at in https://github.com/filecoin-project/go-filecoin/issues/2050

anorth commented 5 years ago

Thanks!

Refmt is still huge. Can you determine which are the frequent call paths into shared.TokenPump.Run?

Otherwise, thanks great stuff. I think we can consider this done, and we have some starting points for VM optimisations. FYI @acruikshank @ingar

frrist commented 5 years ago

Areas in Purple represent calls to shared.TokenPump.Run flamepump

Similar data as a graph here (I have hidden many of the calls that lead to Run in this image) Screenshot from 2019-07-17 11-41-54

Take away here is that we have many paths into the TokenPump.Run call -- I think it will be easier to see the relation by interacting with these graphs manually, the screenshots don't tell the full story. At a high level, any time we marshal or unmarshal cbor data we run the TokenPump.Run method.

acruikshank commented 5 years ago

I suspect what we're seeing here is that shared.TokenPump.Run is the consumer of a stream to the data store, and it's getting blamed by the profiler for slow IO.

Part of the reason I suspect this is that PaymentBroker.ChannelChannel is among the simplest of our actor methods and a payment channel is a relatively simple structure. Payment broker is probably managing a huge database of payment channels at this point, though. We might want to investigate how the cbor-ipld-hamt is performing against our data store at different sizes before we dig too deeply into refmt optimizations.

filecoin-project / venus