Closed robinbryce closed 5 years ago
We have reproduced this on 2.3.0.
node's 1 & 3 are happy. Node 2 forked at block 8264, logs and added 7 further blocks before halting its chain.
The cluster is under very heavy load when this occurs and we do see a lot of "rafthttp: the clock difference against peer N is to high" messages prior to the fork event. And high is rediculously high, we are seeing as much as 40s. The cluster vms are azure hosted and ntp enabled. Presumably this due to transit time.
The fork occurs on node 2 almost imediately on restart. But we believe the logs from the fork event are accurate. We don't have logs from the 'good' peers at this time window.
DEBUG[10-16|21:03:59.321|geth/main.go:241] Sanitizing Go's GC trigger percent=100
INFO [10-16|21:03:59.322|cmd/utils/flags.go:982] Maximum peer count ETH=25 LES=0 total=25
DEBUG[10-16|21:03:59.323|accounts/keystore/file_cache.go:87] FS scan times list=325.802µs set=5µs diff=3.4µs
TRACE[10-16|21:03:59.323|accounts/keystore/watch.go:73] Started watching keystore folder path=/var/lib/quorum/node/keystore
TRACE[10-16|21:03:59.323|accounts/keystore/account_cache.go:296] Handled keystore changes time=435.601µs
INFO [10-16|21:03:59.324|node/node.go:169] Starting peer-to-peer node instance=Geth/soak-1-avid/v1.8.18-stable-99f7fd67(quorum-v2.3.0)/linux-amd64/go1.11.13
INFO [10-16|21:03:59.324|cmd/utils/flags.go:1226] set gcmode=archive for Raft
INFO [10-16|21:04:18.628|eth/backend.go:133] Initialised chain configuration config="{ChainID: 99 Homestead: 0 DAO: <nil> DAOSupport: false EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 IsQuorum: true Constantinople: <nil> Engine: unknown}"
INFO [10-16|21:03:59.324|ethdb/database.go:73] Allocated cache and file handles database=/var/lib/quorum/node/geth/chaindata cache=768 handles=524288
WARN [10-16|21:04:18.628|eth/backend.go:281] Ethash used in full fake mode
INFO [10-16|21:04:18.628|eth/backend.go:168] Initialising Ethereum protocol versions="[63 62]" network=99
INFO [10-16|21:04:18.629|core/blockchain.go:273] Loaded most recent local full block number=8263 hash=6c5feb…abeb64 td=1083047936 age=0
INFO [10-16|21:04:18.629|core/blockchain.go:272] Loaded most recent local header number=8263 hash=6c5feb…abeb64 td=1083047936 age=0
INFO [10-16|21:04:18.629|core/blockchain.go:274] Loaded most recent local fast block number=8263 hash=6c5feb…abeb64 td=1083047936 age=0
DEBUG[10-16|21:04:18.629|core/tx_pool.go:430] Reinjecting stale transactions count=0
INFO [10-16|21:04:19.525|core/tx_pool.go:715] Setting new local account address=0x87eD0b2Be296fF4a8b5Ef207d74A5Ff1851D8F36
TRACE[10-16|21:04:19.525|core/tx_pool.go:721] Pooled new future transaction hash=2b9200…7c2c44 from=0x87eD0b2Be296fF4a8b5Ef207d74A5Ff1851D8F36 to=0x13d3A5204590920ebc01953cDed1b4786a1Fe340
TRACE[10-16|21:04:19.525|core/tx_pool.go:721] Pooled new future transaction hash=94bf4e…2a0e6f from=0x87eD0b2Be296fF4a8b5Ef207d74A5Ff1851D8F36 to=0x13d3A5204590920ebc01953cDed1b4786a1Fe340
TRACE[10-16|21:04:19.526|core/tx_pool.go:721] Pooled new future transaction hash=14ee35…cdf67d from=0x87eD0b2Be296fF4a8b5Ef207d74A5Ff1851D8F36 to=0x13d3A5204590920ebc01953cDed1b4786a1Fe340
TRACE[10-16|21:04:19.527|core/tx_pool.go:721] Pooled new future transaction hash=b2e501…848541 from=0x87eD0b2Be296fF4a8b5Ef207d74A5Ff1851D8F36 to=0x13d3A5204590920ebc01953cDed1b4786a1Fe340
TRACE[10-16|21:04:19.526|core/tx_pool.go:721] Pooled new future transaction hash=c3c8ee…a7c720 from=0x87eD0b2Be296fF4a8b5Ef207d74A5Ff1851D8F36 to=0x13d3A5204590920ebc01953cDed1b4786a1Fe340
TRACE[10-16|21:04:19.527|core/tx_pool.go:721] Pooled new future transaction hash=0ee30b…e28e1e from=0x87eD0b2Be296fF4a8b5Ef207d74A5Ff1851D8F36 to=0x13d3A5204590920ebc01953cDed1b4786a1Fe340
TRACE[10-16|21:04:19.528|core/tx_pool.go:721] Pooled new future transaction hash=c3965a…aa0052 from=0x87eD0b2Be296fF4a8b5Ef207d74A5Ff1851D8F36 to=0x13d3A5204590920ebc01953cDed1b4786a1Fe340
TRACE[10-16|21:04:19.528|core/tx_pool.go:721] Pooled new future transaction hash=67906c…42a241 from=0x87eD0b2Be296fF4a8b5Ef207d74A5Ff1851D8F36 to=0x13d3A5204590920ebc01953cDed1b4786a1Fe340
TRACE[10-16|21:04:19.528|core/tx_pool.go:721] Pooled new future transaction hash=8a9d7c…f0b2e9 from=0x87eD0b2Be296fF4a8b5Ef207d74A5Ff1851D8F36 to=0x13d3A5204590920ebc01953cDed1b4786a1Fe340
TRACE[10-16|21:04:19.529|core/tx_pool.go:721] Pooled new future transaction hash=3c3bfc…261726 from=0x87eD0b2Be296fF4a8b5Ef207d74A5Ff1851D8F36 to=0x13d3A5204590920ebc01953cDed1b4786a1Fe340
TRACE[10-16|21:04:19.529|core/tx_pool.go:721] Pooled new future transaction hash=946596…bb8716 from=0x87eD0b2Be296fF4a8b5Ef207d74A5Ff1851D8F36 to=0x13d3A5204590920ebc01953cDed1b4786a1Fe340
TRACE[10-16|21:04:19.530|core/tx_pool.go:721] Pooled new future transaction hash=c2b612…8beb81 from=0x87eD0b2Be296fF4a8b5Ef207d74A5Ff1851D8F36 to=0x13d3A5204590920ebc01953cDed1b4786a1Fe340
TRACE[10-16|21:04:19.530|core/tx_pool.go:721] Pooled new future transaction hash=7c8f25…fac2d4 from=0x87eD0b2Be296fF4a8b5Ef207d74A5Ff1851D8F36 to=0x13d3A5204590920ebc01953cDed1b4786a1Fe340
TRACE[10-16|21:04:19.530|core/tx_pool.go:998] Promoting queued transaction hash=2b9200…7c2c44
TRACE[10-16|21:04:19.530|core/tx_pool.go:996] Promoting queued transaction hash=2b9200…7c2c44
TRACE[10-16|21:04:19.530|core/tx_pool.go:996] Promoting queued transaction hash=94bf4e…2a0e6f
TRACE[10-16|21:04:19.530|core/tx_pool.go:996] Promoting queued transaction hash=14ee35…cdf67d
TRACE[10-16|21:04:19.530|core/tx_pool.go:998] Promoting queued transaction hash=94bf4e…2a0e6f
TRACE[10-16|21:04:19.530|core/tx_pool.go:998] Promoting queued transaction hash=14ee35…cdf67d
TRACE[10-16|21:04:19.530|core/tx_pool.go:996] Promoting queued transaction hash=c3c8ee…a7c720
TRACE[10-16|21:04:19.530|core/tx_pool.go:998] Promoting queued transaction hash=c3c8ee…a7c720
TRACE[10-16|21:04:19.530|core/tx_pool.go:996] Promoting queued transaction hash=b2e501…848541
Oct 16, 2019 @ 22:24:51.806
TRACE[10-16|21:04:19.530|core/tx_pool.go:996] Promoting queued transaction hash=0ee30b…e28e1e
TRACE[10-16|21:04:19.530|core/tx_pool.go:998] Promoting queued transaction hash=b2e501…848541
TRACE[10-16|21:04:19.530|core/tx_pool.go:998] Promoting queued transaction hash=0ee30b…e28e1e
TRACE[10-16|21:04:19.530|core/tx_pool.go:996] Promoting queued transaction hash=c3965a…aa0052
TRACE[10-16|21:04:19.530|core/tx_pool.go:998] Promoting queued transaction hash=c3965a…aa0052
TRACE[10-16|21:04:19.530|core/tx_pool.go:996] Promoting queued transaction hash=67906c…42a241
TRACE[10-16|21:04:19.530|core/tx_pool.go:998] Promoting queued transaction hash=67906c…42a241
TRACE[10-16|21:04:19.530|core/tx_pool.go:998] Promoting queued transaction hash=8a9d7c…f0b2e9
TRACE[10-16|21:04:19.530|core/tx_pool.go:996] Promoting queued transaction hash=8a9d7c…f0b2e9
TRACE[10-16|21:04:19.530|core/tx_pool.go:998] Promoting queued transaction hash=946596…bb8716
TRACE[10-16|21:04:19.530|core/tx_pool.go:996] Promoting queued transaction hash=946596…bb8716
TRACE[10-16|21:04:19.531|core/tx_pool.go:998] Promoting queued transaction hash=3c3bfc…261726
TRACE[10-16|21:04:19.530|core/tx_pool.go:996] Promoting queued transaction hash=3c3bfc…261726
TRACE[10-16|21:04:19.531|core/tx_pool.go:996] Promoting queued transaction hash=c2b612…8beb81
TRACE[10-16|21:04:19.531|core/tx_pool.go:996] Promoting queued transaction hash=7c8f25…fac2d4
TRACE[10-16|21:04:19.531|core/tx_pool.go:998] Promoting queued transaction hash=c2b612…8beb81
TRACE[10-16|21:04:19.531|core/tx_pool.go:998] Promoting queued transaction hash=7c8f25…fac2d4
INFO [10-16|21:04:19.613|core/tx_journal.go:114] Loaded local transaction journal transactions=592 dropped=579
INFO [10-16|21:04:19.614|core/tx_journal.go:166] Regenerated local transaction journal transactions=13 accounts=1
DEBUG[10-16|21:04:19.615|eth/downloader/downloader.go:1630] Recalculated downloader QoS values rtt=20s confidence=1.000 ttl=1m0s
DEBUG[10-16|21:04:19.694|p2p/server.go:829] TCP listener up addr=[::]:21000
INFO [10-16|21:04:19.695|raft/handler.go:138] starting raft protocol handler
INFO [10-16|21:04:19.696|p2p/server.go:609] Started P2P networking self="enode://08b00ed309ef5de9f4d8d8ca2669e8af350d065696e818d7e323e15b9a5685da7a16b1e3fd07b9938873feca3f99ba73d31905a00a0b4001045ded808bea9c8c@127.0.0.1:21000?discport=0"
INFO [10-16|21:04:19.696|p2p/enode/localnode.go:240] New local node record seq=4 id=853e17cae5b9df59 ip=127.0.0.1 udp=0 tcp=21000
TRACE[10-16|21:04:19.696|p2p/server.go:641] New dial task task="staticdial 9104e25e7f1d50b0 40.91.225.80:21000"
WARN [10-16|21:04:19.696|p2p/dial.go:199] Removing static dial candidate id=0x8edd20 addr=52.166.166.244:21000 err="is self"
INFO [10-16|21:04:19.698|raft/persistence.go:47] loaded the latest applied index lastAppliedIndex=8318
TRACE[10-16|21:04:19.712|p2p/server.go:641] New dial task task="staticdial 89c2d44d423570ec 40.118.105.71:21000"
INFO [10-16|21:04:19.713|raft/snapshot.go:182] loading snapshot
INFO [10-16|21:04:19.713|raft/snapshot.go:112] updating cluster membership per raft snapshot
INFO [10-16|21:04:19.713|raft/snapshot.go:262] applying snapshot to raft storage
2019-10-16 21:04:19.714186 I | rafthttp: starting peer 1...
INFO [10-16|21:04:19.713|raft/snapshot.go:154] adding new raft peer raft id=1
TRACE[10-16|21:04:19.715|p2p/server.go:669] Adding static node node="enode://0646f311c9762d85f07cad8b39ea61e785205d5031a0d83584e28268f35bde7ff394a821a9f058d3aa91390ba3dd0d94a76acf6fa66a236066f225e1265f796c@40.91.225.80:21000?discport=0&raftport=50400"
2019-10-16 21:04:19.714338 I | rafthttp: started HTTP pipelining with peer 1
2019-10-16 21:04:19.715591 I | rafthttp: started streaming with peer 1 (writer)
2019-10-16 21:04:19.715399 I | rafthttp: started streaming with peer 1 (writer)
2019-10-16 21:04:19.716213 I | rafthttp: started peer 1
INFO [10-16|21:04:19.716|raft/snapshot.go:154] adding new raft peer raft id=3
2019-10-16 21:04:19.716242 I | rafthttp: added peer 1
2019-10-16 21:04:19.716266 I | rafthttp: started streaming with peer 1 (stream MsgApp v2 reader)
2019-10-16 21:04:19.716414 I | rafthttp: started HTTP pipelining with peer 3
2019-10-16 21:04:19.716397 I | rafthttp: starting peer 3...
2019-10-16 21:04:19.717097 I | rafthttp: started streaming with peer 3 (writer)
2019-10-16 21:04:19.716465 I | rafthttp: started streaming with peer 1 (stream Message reader)
2019-10-16 21:04:19.717448 I | rafthttp: started streaming with peer 3 (writer)
TRACE[10-16|21:04:19.716|p2p/server.go:669] Adding static node node="enode://6a174feeb3c0c2b816a40bc6077dadecb9698f59b94a6aad9aa862930f843d16ca51967ddb23374b42fac46574edc7f634f54383f36edab5647ba8d89da965f8@40.118.105.71:21000?discport=0&raftport=50400"
2019-10-16 21:04:19.717929 I | rafthttp: started peer 3
2019-10-16 21:04:19.717971 I | rafthttp: started streaming with peer 3 (stream MsgApp v2 reader)
2019-10-16 21:04:19.717955 I | rafthttp: added peer 3
INFO [10-16|21:04:19.717|raft/snapshot.go:164] updated cluster membership
2019-10-16 21:04:19.717989 I | rafthttp: started streaming with peer 3 (stream Message reader)
INFO [10-16|21:04:19.719|raft/snapshot.go:282] blockchain is caught up; no need to synchronize
INFO [10-16|21:04:19.719|raft/wal.go:27] loading WAL term=0 index=0
INFO [10-16|21:04:19.719|raft/wal.go:42] replaying WAL
TRACE[10-16|21:04:19.721|p2p/server.go:933] Quorum permissioning id=89c2d44d423570ec addr=40.118.105.71:21000 conn=staticdial EnableNodePermission=false DataDir=/var/lib/quorum/node Current Node ID=853e17cae5b9df59c4db5fa82ab4503086288ba981b21ede0812c96d9e39ac28 Node Name=Geth/soak-1-avid/v1.8.18-stable-99f7fd67(quorum-v2.3.0)/linux-amd64/go1.11.13 Dialed Dest="enode://6a174feeb3c0c2b816a40bc6077dadecb9698f59b94a6aad9aa862930f843d16ca51967ddb23374b42fac46574edc7f634f54383f36edab5647ba8d89da965f8@40.118.105.71:21000?discport=0&raftport=50400" Connection ID=89c2d44d423570ec Connection String=89c2d44d423570ec2ba30828e91acb7b6180b9880634583ac5f1d2fbe5486fdd
2019-10-16 21:04:19.725011 I | rafthttp: peer 3 became active
TRACE[10-16|21:04:19.721|p2p/server.go:956] Node Permissioning is Disabled. id=89c2d44d423570ec addr=40.118.105.71:21000 conn=staticdial
2019-10-16 21:04:19.725102 I | rafthttp: established a TCP streaming connection with peer 3 (stream MsgApp v2 reader)
2019-10-16 21:04:19.725049 I | rafthttp: established a TCP streaming connection with peer 3 (stream Message reader)
DEBUG[10-16|21:04:19.726|p2p/server.go:737] Adding p2p peer name=Geth/soak-2-avid/v1.... addr=40.118.105.71:21000 peers=1
DEBUG[10-16|21:04:19.726|eth/handler.go:282] Ethereum peer connected id=89c2d44d423570ec conn=staticdial name=Geth/soak-2-avid/v1.8.18-stable-99f7fd67(quorum-v2.3.0)/linux-amd64/go1.11.13
TRACE[10-16|21:04:19.726|p2p/peer.go:359] Starting protocol eth/63 id=89c2d44d423570ec conn=staticdial
TRACE[10-16|21:04:19.726|eth/downloader/downloader.go:278] Registering sync peer peer=89c2d44d423570ec
TRACE[10-16|21:04:19.812|p2p/server.go:988] connection set up id=89c2d44d423570ec addr=40.118.105.71:21000 conn=staticdial inbound=false
TRACE[10-16|21:04:19.726|eth/sync.go:89] Sending batch of transactions id=89c2d44d423570ec conn=staticdial count=13 bytes=36.25kB
TRACE[10-16|21:04:19.812|p2p/server.go:708] Dial task done task="staticdial 89c2d44d423570ec 40.118.105.71:21000"
TRACE[10-16|21:04:19.813|p2p/server.go:956] Node Permissioning is Disabled. id=9104e25e7f1d50b0 addr=40.91.225.80:21000 conn=staticdial
TRACE[10-16|21:04:19.813|p2p/server.go:933] Quorum permissioning id=9104e25e7f1d50b0 addr=40.91.225.80:21000 conn=staticdial EnableNodePermission=false DataDir=/var/lib/quorum/node Current Node ID=853e17cae5b9df59c4db5fa82ab4503086288ba981b21ede0812c96d9e39ac28 Node Name=Geth/soak-1-avid/v1.8.18-stable-99f7fd67(quorum-v2.3.0)/linux-amd64/go1.11.13 Dialed Dest="enode://0646f311c9762d85f07cad8b39ea61e785205d5031a0d83584e28268f35bde7ff394a821a9f058d3aa91390ba3dd0d94a76acf6fa66a236066f225e1265f796c@40.91.225.80:21000?discport=0&raftport=50400" Connection ID=9104e25e7f1d50b0 Connection String=9104e25e7f1d50b0f8e411a47f0331f9c58cc0958079b93d0e99b37cb976cb66
TRACE[10-16|21:04:19.814|p2p/server.go:988] connection set up id=9104e25e7f1d50b0 addr=40.91.225.80:21000 conn=staticdial inbound=false
DEBUG[10-16|21:04:19.814|p2p/server.go:737] Adding p2p peer name=Geth/soak-0-avid/v1.... addr=40.91.225.80:21000 peers=2
TRACE[10-16|21:04:19.814|p2p/server.go:708] Dial task done task="staticdial 9104e25e7f1d50b0 40.91.225.80:21000"
TRACE[10-16|21:04:19.814|p2p/peer.go:359] Starting protocol eth/63 id=9104e25e7f1d50b0 conn=staticdial
TRACE[10-16|21:04:19.814|p2p/server.go:641] New dial task task="wait for dial hist expire (29.998601695s)"
TRACE[10-16|21:04:19.817|eth/downloader/downloader.go:278] Registering sync peer peer=9104e25e7f1d50b0
DEBUG[10-16|21:04:19.814|eth/handler.go:282] Ethereum peer connected id=9104e25e7f1d50b0 conn=staticdial name=Geth/soak-0-avid/v1.8.18-stable-99f7fd67(quorum-v2.3.0)/linux-amd64/go1.11.13
DEBUG[10-16|21:04:19.818|eth/downloader/downloader.go:1664] Relaxed downloader QoS values rtt=20s confidence=0.500 ttl=1m0s
INFO [10-16|21:04:20.220|raft/handler.go:468] startRaft raft ID=2
TRACE[10-16|21:04:19.818|eth/sync.go:89] Sending batch of transactions id=9104e25e7f1d50b0 conn=staticdial count=13 bytes=36.25kB
INFO [10-16|21:04:20.220|raft/handler.go:471] remounting an existing raft log; connecting to peers.
raft2019/10/16 21:04:20 INFO: newRaft 2 [peers: [1,2,3], term: 198, commit: 8318, applied: 8318, lastindex: 8319, lastterm: 198]
raft2019/10/16 21:04:20 INFO: 2 became follower at term 198
<snip>
[10-16|21:04:20.228|rpc/endpoints.go:39] HTTP registered namespace=net
[10-16|21:04:20.228|node/node.go:346] HTTP endpoint opened url=http://0.0.0.0:22000 cors=* vhosts=*
2019-10-16 21:04:20.242976 I | rafthttp: established a TCP streaming connection with peer 3 (stream Message writer)
raft2019/10/16 21:04:20 INFO: 2 became follower at term 199
raft2019/10/16 21:04:20 INFO: 2 [term: 198] received a MsgHeartbeat message with higher term from 1 [term: 199]
raft2019/10/16 21:04:20 INFO: raft.node: 2 elected leader 1 at term 199
INFO [10-16|21:04:20.250|raft/handler.go:367] peer is currently unreachable peer id=1
INFO [10-16|21:04:20.250|raft/handler.go:367] peer is currently unreachable peer id=1
INFO [10-16|21:04:20.250|raft/handler.go:367] peer is currently unreachable peer id=1
INFO [10-16|21:04:20.288|raft/handler.go:367] peer is currently unreachable peer id=1
2019-10-16 21:04:20.313228 I | rafthttp: peer 1 became active
2019-10-16 21:04:20.313257 I | rafthttp: established a TCP streaming connection with peer 1 (stream MsgApp v2 writer)
2019-10-16 21:04:20.331809 E | rafthttp: failed to dial 1 on stream MsgApp v2 (EOF)
INFO [10-16|21:04:20.388|raft/handler.go:367] peer is currently unreachable peer id=1
2019-10-16 21:04:20.331983 I | rafthttp: peer 1 became inactive
2019-10-16 21:04:20.407586 I | rafthttp: peer 1 became active
2019-10-16 21:04:20.407609 I | rafthttp: established a TCP streaming connection with peer 1 (stream Message writer)
2019-10-16 21:04:20.434286 I | rafthttp: peer 1 became inactive
2019-10-16 21:04:20.434265 E | rafthttp: failed to dial 1 on stream Message (EOF)
INFO [10-16|21:04:20.596|raft/persistence.go:53] persisted the latest applied index index=8319
WARN [10-16|21:04:20.596|raft/handler.go:750] not applying already-applied block block hash=ef3c79…143d38 parent=6c5feb…abeb64 head=6c5feb…abeb64
INFO [10-16|21:04:20.596|raft/handler.go:876] Non-extending block block=fc7cc7…fa4156 parent=ef3c79…143d38 head=6c5feb…abeb64
INFO [10-16|21:04:20.596|raft/persistence.go:53] persisted the latest applied index index=8320
INFO [10-16|21:04:20.596|raft/persistence.go:53] persisted the latest applied index index=8321
INFO [10-16|21:04:20.597|raft/minter.go:146] Someone else mined invalid block; ignoring block=fc7cc7…fa4156
INFO [10-16|21:04:20.597|raft/minter.go:139] Handling InvalidRaftOrdering invalid block=fc7cc7…fa4156 current head=6c5feb…abeb64
INFO [10-16|21:04:20.604|raft/handler.go:876] Non-extending block block=418bdc…fb3aaf parent=fc7cc7…fa4156 head=6c5feb…abeb64
INFO [10-16|21:04:20.605|raft/persistence.go:53] persisted the latest applied index index=8322
INFO [10-16|21:04:20.605|raft/minter.go:146] Someone else mined invalid block; ignoring block=418bdc…fb3aaf
INFO [10-16|21:04:20.605|raft/minter.go:139] Handling InvalidRaftOrdering invalid block=418bdc…fb3aaf current head=6c5feb…abeb64
Note: we are filtering out messages matching "Discarding invalid" or "Failed to add journaled transaction"
And the raft clock errors RAFT ID 2
2019-10-16 09:59:48.548168 W | rafthttp: the clock difference against peer 1 is too high [27.435101706s > 1s]
2019-10-16 09:59:48.548168 W | rafthttp: the clock difference against peer 1 is too high [27.435101706s > 1s]
2019-10-16 16:49:40.112827 W | rafthttp: the clock difference against peer 1 is too high [1.015866136s > 1s]
2019-10-16 16:49:40.115093 W | rafthttp: the clock difference against peer 3 is too high [1.021036979s > 1s]
2019-10-17 00:39:33.716701 W | rafthttp: the clock difference against peer 3 is too high [6.481870686s > 1s]
2019-10-17 00:39:33.716701 W | rafthttp: the clock difference against peer 3 is too high [6.481870686s > 1s]
RAFT ID 1
2019-10-16 12:13:14.813774 W | rafthttp: the clock difference against peer 3 is too high [45.062941552s > 1s]
2019-10-16 12:37:51.115223 W | rafthttp: the clock difference against peer 3 is too high [45.375664332s > 1s]
2019-10-16 14:25:32.728978 W | rafthttp: the clock difference against peer 3 is too high [1m2.287023109s > 1s]
2019-10-16 14:25:33.516052 W | rafthttp: the clock difference against peer 2 is too high [1m2.222141763s > 1s]
2019-10-16 14:28:31.110694 W | rafthttp: the clock difference against peer 2 is too high [56.91821301s > 1s]
2019-10-16 14:28:31.110733 W | rafthttp: the clock difference against peer 3 is too high [56.873696849s > 1s]
2019-10-16 17:03:05.116390 W | rafthttp: the clock difference against peer 3 is too high [8.875462204s > 1s]
2019-10-16 17:03:05.117400 W | rafthttp: the clock difference against peer 2 is too high [7.944704472s > 1s]
2019-10-16 17:16:25.417371 W | rafthttp: the clock difference against peer 3 is too high [36.583285569s > 1s]
2019-10-16 17:16:25.419877 W | rafthttp: the clock difference against peer 2 is too high [1m2.751224032s > 1s]
2019-10-16 17:20:00.208480 W | rafthttp: the clock difference against peer 2 is too high [20.349888606s > 1s]
2019-10-16 17:20:37.210882 W | rafthttp: the clock difference against peer 2 is too high [5.669596817s > 1s]
2019-10-16 17:20:37.412883 W | rafthttp: the clock difference against peer 3 is too high [6.158095999s > 1s]
2019-10-16 17:21:11.508418 W | rafthttp: the clock difference against peer 3 is too high [24.538630398s > 1s]
2019-10-16 17:21:11.617807 W | rafthttp: the clock difference against peer 2 is too high [22.789322003s > 1s]
2019-10-16 17:21:48.012392 W | rafthttp: the clock difference against peer 2 is too high [28.443694523s > 1s]
2019-10-16 17:21:48.016292 W | rafthttp: the clock difference against peer 3 is too high [24.538630398s > 1s]
2019-10-16 17:23:19.114783 W | rafthttp: the clock difference against peer 2 is too high [1m3.430416129s > 1s]
2019-10-16 17:23:19.114830 W | rafthttp: the clock difference against peer 3 is too high [1m2.412497441s > 1s]
2019-10-16 17:24:42.109883 W | rafthttp: the clock difference against peer 3 is too high [5.655539993s > 1s]
2019-10-16 17:24:42.116441 W | rafthttp: the clock difference against peer 2 is too high [5.891067439s > 1s]
2019-10-16 17:26:16.312790 W | rafthttp: the clock difference against peer 3 is too high [38.925873747s > 1s]
2019-10-16 17:26:16.312813 W | rafthttp: the clock difference against peer 2 is too high [39.019775262s > 1s]
We aim to provide a standalone reproduction of this.
So in the above log on the forked node we see
RAFT ID 2
WARN [10-16|21:04:20.596|raft/handler.go:750] not applying already-applied block block hash=ef3c79…143d38 parent=6c5feb…abeb64 head=6c5feb…abeb64
Yet, if I now geth attach and dump the hashes of the blocks I get
RAFT ID 2
8263 0x6c5febbf975a82800e274384f8a98b3f61fea4595b24b0657b4df195f3abeb64
8264 0x290756e5af8aeae91960f90aea12b8bb09997c5cf7c00166c17334ad584d94cf
8265 0x1d9614ba6ccac2364bc21848975abb20d8c2fb07a2136f8262266861b2be304a
RAFT ID 1
8263 0x6c5febbf975a82800e274384f8a98b3f61fea4595b24b0657b4df195f3abeb64
8264 0xef3c7932f818a4c3820275621e37385c80c07520522c4c0be63be52cf1143d38
8265 0xfc7cc700e43c1fed8467ccdbd4e1cbacdf0c9429ffe193f08cc4c4dee2fa4156
So RAFT ID 2 is rejecting the block, claiming it already applied it, yet later the 'already applied' block is not there. speculative minting ?
So it looks like we have two block 8264's on RAFT ID 2. Handling of pending transactions on startup is implicated. The first block 8264 is received around 20:57 GMT and has a sinlge tx in it. The second appears around 22:15 and includes 385 tx's. The first tx of 'all' block 8264's on all nodes is the same hash. The only tx in block 8265 on the healthy chain, matches the 2nd tx in the 'phantom' block.
web3.eth.getBlock(8263-8265) on each node
RAFT ID 1
8263 1 0x6c5febbf975a82800e274384f8a98b3f61fea4595b24b0657b4df195f3abeb64
8264 1 0xef3c7932f818a4c3820275621e37385c80c07520522c4c0be63be52cf1143d38
8265 1 0xfc7cc700e43c1fed8467ccdbd4e1cbacdf0c9429ffe193f08cc4c4dee2fa4156
RAFT ID 2
8263 1 0x6c5febbf975a82800e274384f8a98b3f61fea4595b24b0657b4df195f3abeb64
8264 385 0x290756e5af8aeae91960f90aea12b8bb09997c5cf7c00166c17334ad584d94cf
8265 182 0x1d9614ba6ccac2364bc21848975abb20d8c2fb07a2136f8262266861b2be304a
RAFT ID 3
8263 1 0x6c5febbf975a82800e274384f8a98b3f61fea4595b24b0657b4df195f3abeb64
8264 1 0xef3c7932f818a4c3820275621e37385c80c07520522c4c0be63be52cf1143d38
8265 1 0xfc7cc700e43c1fed8467ccdbd4e1cbacdf0c9429ffe193f08cc4c4dee2fa4156
The various tx's
(soak-0 is RAFT ID 1, sorry)
All tx [0]'s the same
gethjex.py soak-0 "web3.eth.getBlock(8264).transactions[0]" "0x2b9200ef289cf361ab54c0bfcceff368bb2d1339f0083eca596b17a1b27c2c44"
gethjex.py soak-1 "web3.eth.getBlock(8264).transactions[0]" "0x2b9200ef289cf361ab54c0bfcceff368bb2d1339f0083eca596b17a1b27c2c44"
gethjex.py soak-2 "web3.eth.getBlock(8264).transactions[0]" "0x2b9200ef289cf361ab54c0bfcceff368bb2d1339f0083eca596b17a1b27c2c44"
soak-1 8264 tx[1] == soak-0 8265 tx[0]
gethjex.py soak-1 "web3.eth.getBlock(8264).transactions[1]" "0x94bf4e48bdb18134c7774d698f98eccca86f1b3524e1585e5843a171022a0e6f"
gethjex.py soak-0 "web3.eth.getBlock(8265).transactions[0]" "0x94bf4e48bdb18134c7774d698f98eccca86f1b3524e1585e5843a171022a0e6f"
The two blocks with block number 8264.
{
difficulty: 131072,
extraData: "0x0000000000000000000000000000000000000000000000000000000000000000f84432b841b66f9d2dec6fa6601092707c9417dfe899321f55eed355ec6409d3a0f6072a374b2d740a026e254b5c013d4061ed92fb94917db9efef9c0beb8289a98cba80e000",
gasLimit: 3000000000,
gasUsed: 1944208,
hash: "0xef3c7932f818a4c3820275621e37385c80c07520522c4c0be63be52cf1143d38",
logsBloom: "0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000004000000008000000000000000000000000000000000000000000000000028000000000000000000800000000000000000000000010000000000000000000000000000000002100000000000000000000000000000000000000000002000000000400000001000000000000000000000000000000000020000000000002000000000000000000000020000000000000000000000000000020000000000000000004000040000000000000000000000000000000000000040000",
miner: "0x0000000000000000000000000000000000000000",
mixHash: "0x0000000000000000000000000000000000000000000000000000000000000000",
nonce: "0x0000000000000000",
number: 8264,
parentHash: "0x6c5febbf975a82800e274384f8a98b3f61fea4595b24b0657b4df195f3abeb64",
receiptsRoot: "0x75987861c66cd68c4500e1f74f51f796fca777660ca64d65d5d30326506dedc4",
sha3Uncles: "0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347",
size: 3741,
stateRoot: "0x473718e1a7f814a4b90fb09c496acdd16edf79ad33ab2ee384553bb3dbf396b7",
timestamp: 1571259467322762200,
totalDifficulty: 1083179008,
transactions: ["0x2b9200ef289cf361ab54c0bfcceff368bb2d1339f0083eca596b17a1b27c2c44"],
transactionsRoot: "0x4b56937d30876e6d0848f08957d7b04236419965e396d20fe70c84dceef576af",
uncles: []
}
{
difficulty: 131072,
extraData: "0x0000000000000000000000000000000000000000000000000000000000000000f84432b841dcf9d7fc39e29dc1b7b8144c3c08bd7a38d6ea7ab2b157b074e5bf8fec93280b39e37db1b3e1ce0ebb0b3f50b00bd2f2a87c9d1474c3b29b1bf14642673dc50e01",
gasLimit: 3000000000,
gasUsed: 456170013,
hash: "0x290756e5af8aeae91960f90aea12b8bb09997c5cf7c00166c17334ad584d94cf",
logsBloom: "0xb1048920a004244a1a502b00d04238b00208b0662c82016296402384481854890000106012404d800800201e00609e48234c40c8226201a2c1005c0020b0803281510ac294c10950610c300824ec0cd500000a080290804a08931047d3c8d052009d5812ae880990840900162262c9e4438120920485021d10825a1481508402038433a6100002802800c4546930908c9214020010c1416d0106160114b050086202230c0139401680095e710c60010c05450409014c0a0084448c5048740045988ea35b800813000511061b48a08d2898048404120008480904440c3f40b20a142461230b94c8151429ed4080b16a47880e8310023a0325aa20518666a48200",
miner: "0x0000000000000000000000000000000000000000",
mixHash: "0x0000000000000000000000000000000000000000000000000000000000000000",
nonce: "0x0000000000000000",
number: 8264,
parentHash: "0x6c5febbf975a82800e274384f8a98b3f61fea4595b24b0657b4df195f3abeb64",
receiptsRoot: "0xa4fc7a7049007926657173e15f067af1e780ced6d9e2cad8597a43610f37a980",
sha3Uncles: "0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347",
size: 1100413,
stateRoot: "0x4ce60013e947385f7e6441b0b53316993ee3bac55c3bc135d3d65dd83e8dbf2e",
timestamp: 1571264131626095800,
totalDifficulty: 1083179008,
transactions: ["0x2b9200ef289cf361ab54c0bfcceff368bb2d1339f0083eca596b17a1b27c2c44", "0x94bf4e48bdb18134c7774d698f98eccca86f1b3524e1585e5843a171022a0e6f"
We are suspicious of the at this line which passes the block number but which the cache check in the implementation ignores https://github.com/jpmorganchase/quorum/blob/890365de86cd485464a85440baab9453456c7187/raft/handler.go#L738
vs
Hi @robinbryce, are you running the network in a dockerized environment?
Azure AKS. Managed kubernetes, docker images. In an istio service mesh.
In my testing I have seen messages like 2019-10-16 14:25:32.728978 W | rafthttp: the clock difference against peer 3 is too high [1m2.287023109s > 1s]
happen in dockerized environment and the cause was that under heavy load, the containers exhaust the memory assigned to them and docker kills/restarts the nodes. This could cause the network to stall. Do you see a killed
log entry in the logs.
Also, is it possible to run the same test in a non dockerized environment and see if the issue happens there as well.
On the cluster we are requesting an initial 1GB ram per quorum node with a burst limit of 2GB. Prior to that we were getting OOMKilled. We don't see evidence of that now. We are working on a repro case seperate from our cluster infrastructure, but that would also (most easily) use docker. Are you saying its unreasonable for us to expect geth/quorum to deal with that scenario or are you just after an extra data point ?
In my previous testing using docker, I noticed the the clock difference
messages after docker killed and restarted the geth
nodes and when the node came back up it stalled, was wondering if you are seeing the same behavior
We are wondering if it might be appropriate to merge this with #852 or simply close this one but preserve the logs for contribution to #852 ?
Closing this issue. Refer to #852 for tracking. Fix pending for the next client release.
After losing contact with one of 3 nodes, the remaining two have mined inconsistent blocks and continued minting. We had expected that the fork was not possible and that at worst the network would stop accepting transactions (mining blocks)
System information
Geth Version: 1.8.18-stable Git Commit: c894c2d70eacf30740d03b53ed2fb39e42641295 Quorum Version: 2.2.5 Architecture: amd64 Protocol Versions: [63 62] Network Id: 1337 Go Version: go1.11.12 Operating System: linux
Expected behaviour
As per the remarks here about transaction finality and absence of forks https://docs.goquorum.com/en/latest/Consensus/raft/ " I expected the chains would never fork.
In a 3-node network I would expect that if 1 node goes offline and the remaining two nodes have an unstable network connection then transactions would not get accepted unless the two nodes can establish connection. Moreover, if I partition 1 node in a 3-node network, I would expect that partitioned node to not be able to mint new blocks. However, we have encountered a situation where a partitioned single node was able to mint blocks.
Actual behaviour
A fork is created at block 31545.
Node 1 becomes completely unresponsive
Node 2 advances ~100 blocks and stops.
Node 3 continues minting past 37213
Node 3 is logging 'Handling InvalidRaftOrdering'
Steps to reproduce the behaviour
Backtrace
Our log data from when the fork occured is incomplete. We do have a logs covering the remaining two nodes from about 12 hours after the fork