btcsuite / btcd

An alternative full node bitcoin implementation written in Go (golang)
https://github.com/btcsuite/btcd/blob/master/README.md
ISC License
6.23k stars 2.36k forks source link

bug: daemon stalled due to circular waiting dependency between server can `SyncManager` #2258

Open Roasbeef opened 2 weeks ago

Roasbeef commented 2 weeks ago

Deadlock Scenario

I've seen a deadlock arise a few times on one of my nodes, primarily on testnet3, presumably triggered more often due to the existence of block storms.

Here's a sample goroutine dump:

goroutine profile: total 21172
11514 @ 0x43beee 0x407625 0x407277 0x8a69f3 0x46e961
#    0x8a69f2    main.(*server).peerDoneHandler+0x52    /root/gocode/src/github.com/btcsuite/btcd/server.go:2211

9563 @ 0x43beee 0x407625 0x407277 0x8b6f75 0x8b6f5c 0x82f9a2 0x82ffd2 0x83013f 0x8308e9 0x46e961
#    0x8b6f74    main.(*server).AddPeer+0x34                        /root/gocode/src/github.com/btcsuite/btcd/server.go:2330
#    0x8b6f5b    main.(*serverPeer).OnVerAck+0x1b                    /root/gocode/src/github.com/btcsuite/btcd/server.go:533
#    0x82f9a1    github.com/btcsuite/btcd/peer.(*Peer).processRemoteVerAckMsg+0xa1    /root/gocode/src/github.com/btcsuite/btcd/peer/peer.go:2068
#    0x82ffd1    github.com/btcsuite/btcd/peer.(*Peer).waitToFinishNegotiation+0x1b1    /root/gocode/src/github.com/btcsuite/btcd/peer/peer.go:2195
#    0x83013e    github.com/btcsuite/btcd/peer.(*Peer).negotiateInboundProtocol+0x11e    /root/gocode/src/github.com/btcsuite/btcd/peer/peer.go:2241
#    0x8308e8    github.com/btcsuite/btcd/peer.(*Peer).start.func1+0x28            /root/gocode/src/github.com/btcsuite/btcd/peer/peer.go:2289

1 @ 0x43beee 0x407625 0x407277 0x8a29b7 0x8a2977 0x8a2a47 0x8562cf 0x77f0cd 0x76eea5 0x76ecc6 0x771630 0x76bb7c 0x77f59e 0x77f993 0x8526ae 0x855c85 0x46e961
#    0x8a29b6    main.(*server).RelayInventory+0xb6                            /root/gocode/src/github.com/btcsuite/btcd/server.go:2341
#    0x8a2976    main.(*server).relayTransactions+0x76                            /root/gocode/src/github.com/btcsuite/btcd/server.go:1497
#    0x8a2a46    main.(*server).AnnounceNewTransactions+0x26                        /root/gocode/src/github.com/btcsuite/btcd/server.go:1508
#    0x8562ce    github.com/btcsuite/btcd/netsync.(*SyncManager).handleBlockchainNotification+0x32e    /root/gocode/src/github.com/btcsuite/btcd/netsync/manager.go:1483
#    0x77f0cc    github.com/btcsuite/btcd/blockchain.(*BlockChain).sendNotification+0xcc            /root/gocode/src/github.com/btcsuite/btcd/blockchain/notifications.go:78
#    0x76eea4    github.com/btcsuite/btcd/blockchain.(*BlockChain).connectBlock.func2+0x84        /root/gocode/src/github.com/btcsuite/btcd/blockchain/chain.go:708
#    0x76ecc5    github.com/btcsuite/btcd/blockchain.(*BlockChain).connectBlock+0x3c5            /root/gocode/src/github.com/btcsuite/btcd/blockchain/chain.go:709
#    0x77162f    github.com/btcsuite/btcd/blockchain.(*BlockChain).connectBestChain+0x28f        /root/gocode/src/github.com/btcsuite/btcd/blockchain/chain.go:1219
#    0x76bb7b    github.com/btcsuite/btcd/blockchain.(*BlockChain).maybeAcceptBlock+0x19b        /root/gocode/src/github.com/btcsuite/btcd/blockchain/accept.go:79
#    0x77f59d    github.com/btcsuite/btcd/blockchain.(*BlockChain).processOrphans+0x27d            /root/gocode/src/github.com/btcsuite/btcd/blockchain/process.go:118
#    0x77f992    github.com/btcsuite/btcd/blockchain.(*BlockChain).ProcessBlock+0x332            /root/gocode/src/github.com/btcsuite/btcd/blockchain/process.go:236
#    0x8526ad    github.com/btcsuite/btcd/netsync.(*SyncManager).handleBlockMsg+0x2ad            /root/gocode/src/github.com/btcsuite/btcd/netsync/manager.go:745
#    0x855c84    github.com/btcsuite/btcd/netsync.(*SyncManager).blockHandler+0x424            /root/gocode/src/github.com/btcsuite/btcd/netsync/manager.go:1365

What we see here is that we had a combination of many peers connecting and disconnecting. When a peer connects, we go to notify the server to update the various book keeping information it stores: https://github.com/btcsuite/btcd/blob/67b8efd3ba53b60ff0eba5d79babe2c3d82f6c54/server.go#L1783-L1784

The server then goes to update the SyncManager, to see if we need to pick this peer as a sync node or not, while also setting up some book keeping information: https://github.com/btcsuite/btcd/blob/67b8efd3ba53b60ff0eba5d79babe2c3d82f6c54/netsync/manager.go#L455-L478

While one of these requests is pending, it's possible that the SyncManager is notified of a new block. When receiving a new block, the sync manager will ask the chain to process it: https://github.com/btcsuite/btcd/blob/67b8efd3ba53b60ff0eba5d79babe2c3d82f6c54/netsync/manager.go#L743-L746

Indirectly during processing, a new event is emitted to notify the server that there's a new block: https://github.com/btcsuite/btcd/blob/67b8efd3ba53b60ff0eba5d79babe2c3d82f6c54/netsync/manager.go#L1699

Herein lies our deadlock: peer -> server -> sync manager -> server -> peer:

This halts all new incoming peer handling, and also some RPC calls that want to cal into the SyncManager to query state.

Resolution Paths

One easy path comes to mind: can we just make the call from server to sync manager sync? So:

go s.syncManager.NewPeer(sp.Peer)

The state in the sync manager will be cleaned up once the peerDoneHandler exits: https://github.com/btcsuite/btcd/blob/67b8efd3ba53b60ff0eba5d79babe2c3d82f6c54/server.go#L2215.

Roasbeef commented 2 weeks ago
graph TD
    A[Server: RelayInventory]
    B[Server: relayTransactions]
    C[Server: AnnounceNewTransactions]
    D[SyncManager: handleBlockchainNotification]
    E[BlockChain: sendNotification]
    F[BlockChain: connectBlock]
    G[BlockChain: connectBestChain]
    H[BlockChain: maybeAcceptBlock]
    I[BlockChain: processOrphans]
    J[BlockChain: ProcessBlock]
    K[SyncManager: handleBlockMsg]
    L[SyncManager: blockHandler]

    D --> C
    C --> B
    B --> A
    D --> E
    E --> F
    F --> G
    G --> H
    H --> I
    I --> J
    J --> K
    K --> L
    A --> L

    M[Server: peerDoneHandler] --> A
    N[Server: AddPeer] --> A
    O[SyncManager: NewPeer] --> A