AleoNet / anf-snarkOS

A Decentralized Operating System for Zero-Knowledge Applications
http://snarkos.org
Apache License 2.0
15 stars 4 forks source link

[Bug] Validator stops taking active part of block production #9

Open WietzeSlagman opened 7 months ago

WietzeSlagman commented 7 months ago

πŸ› Bug Report

Validator (aleo1z838c4zukt8qpwndz3gnjd374skg6zucd8q6pz30r3v2gcuh4q8qvdf8p9) on Canary net stopped participating in consensus after round 37666, and stopped producing blocks (never elected as leader) and thus also potentially stopped accruing rewards. The node itself was not down and kept being insync with tip, receiving new blocks and propagating them to it's clients. EDIT: Rewards seem to be unimpacted

Following this change in behaviour the logs produced by the validator skyrocketed, produced the same amount of log data in 2 days as it did in the previous 10. This could be related to it having to receive all memorypool requests over and over again without being able to process any of it into blocks.

Steps to Reproduce

logs around the time of occurence:

Round 337666 elected a leader - aleo1jtmdv03zle4q9wc6tjn7pnhjezhjthjae6rqt669t3a97r58sygs3sg79u

2024-04-22T16:05:29.172167Z  INFO snarkos_node_bft::helpers::storage: Starting round 337667...
2024-04-22T16:05:29.172185Z DEBUG snarkos_node_bft::primary: Primary is ready to propose the next round
2024-04-22T16:05:29.172294Z  INFO snarkos_node_bft::primary: Proposing a batch with 0 transmissions for round 337667...
2024-04-22T16:05:29.175265Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchPropose' to '34.16.172.142:5000'
2024-04-22T16:05:29.175307Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchPropose' to '147.28.147.81:5000'
2024-04-22T16:05:29.175336Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchPropose' to '35.186.144.105:5000'
2024-04-22T16:05:29.175345Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchPropose' to '35.196.17.175:5000'
2024-04-22T16:05:29.175355Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchPropose' to '65.108.121.160:5000'
2024-04-22T16:05:29.175363Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchPropose' to '34.72.164.175:5000'
2024-04-22T16:05:29.175371Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchPropose' to '3.111.151.121:5000'
2024-04-22T16:05:29.175380Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchPropose' to '69.84.134.115:5000'
2024-04-22T16:05:29.175388Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchPropose' to '104.61.63.153:5000'
2024-04-22T16:05:29.206283Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'CertificateRequest' from '65.108.121.160:5000'
2024-04-22T16:05:29.206376Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'CertificateResponse' to '65.108.121.160:5000'
2024-04-22T16:05:29.242409Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BatchSignature' from '65.108.121.160:5000'
2024-04-22T16:05:29.243931Z  INFO snarkos_node_bft::primary: Received a batch signature for round 337667 from '65.108.121.160:5000'
2024-04-22T16:05:29.249559Z TRACE snarkos_node_sync::block_sync: Block sync is removing all block requests to peer 147.28.147.81:5000...
2024-04-22T16:05:29.257186Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BatchPropose' from '35.196.17.175:5000'
2024-04-22T16:05:29.264173Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchSignature' to '35.196.17.175:5000'
2024-04-22T16:05:29.264237Z DEBUG snarkos_node_bft::primary: Signed a batch for round 337667 from '35.196.17.175:5000'
2024-04-22T16:05:29.284465Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BatchPropose' from '104.61.63.153:5000'
2024-04-22T16:05:29.289561Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BatchSignature' from '34.72.164.175:5000'
2024-04-22T16:05:29.290592Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchSignature' to '104.61.63.153:5000'
2024-04-22T16:05:29.290650Z DEBUG snarkos_node_bft::primary: Signed a batch for round 337667 from '104.61.63.153:5000'
2024-04-22T16:05:29.290779Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BatchPropose' from '35.186.144.105:5000'
2024-04-22T16:05:29.291298Z  INFO snarkos_node_bft::primary: Received a batch signature for round 337667 from '34.72.164.175:5000'
2024-04-22T16:05:29.296556Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchSignature' to '35.186.144.105:5000'
2024-04-22T16:05:29.296592Z DEBUG snarkos_node_bft::primary: Signed a batch for round 337667 from '35.186.144.105:5000'
2024-04-22T16:05:29.307417Z TRACE snarkos_node_consensus: Received unconfirmed transaction 'at1cgthspn0d02dl..' in the queue
2024-04-22T16:05:29.307462Z TRACE snarkos_node_consensus: Adding unconfirmed transaction 'at1cgthspn0d02dl..' to the memory pool...
2024-04-22T16:05:29.307484Z TRACE snarkos_node_bft::primary: Primary - Received an unconfirmed transaction 'at1cgthspn0d02dl..'
2024-04-22T16:05:29.310742Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BatchPropose' from '3.111.151.121:5000'
2024-04-22T16:05:29.314459Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchSignature' to '3.111.151.121:5000'
2024-04-22T16:05:29.314495Z DEBUG snarkos_node_bft::primary: Signed a batch for round 337667 from '3.111.151.121:5000'
2024-04-22T16:05:29.326674Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BatchPropose' from '34.16.172.142:5000'
2024-04-22T16:05:29.331012Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'TransmissionRequest' to '34.16.172.142:5000'
2024-04-22T16:05:29.363825Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BatchSignature' from '35.196.17.175:5000'
2024-04-22T16:05:29.364801Z  INFO snarkos_node_bft::primary: Received a batch signature for round 337667 from '35.196.17.175:5000'
2024-04-22T16:05:29.395634Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BatchSignature' from '104.61.63.153:5000'
2024-04-22T16:05:29.397709Z  INFO snarkos_node_bft::primary: Received a batch signature for round 337667 from '104.61.63.153:5000'
2024-04-22T16:05:29.429796Z TRACE snarkos_node_bft::worker: Worker 0 - Added unconfirmed transaction 'at1cgthspn0d02dl..'
2024-04-22T16:05:29.437228Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BatchPropose' from '34.72.164.175:5000'
2024-04-22T16:05:29.443604Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchSignature' to '34.72.164.175:5000'
2024-04-22T16:05:29.443669Z DEBUG snarkos_node_bft::primary: Signed a batch for round 337667 from '34.72.164.175:5000'
2024-04-22T16:05:29.451410Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BatchSignature' from '3.111.151.121:5000'
2024-04-22T16:05:29.453652Z  INFO snarkos_node_bft::primary: Received a batch signature for round 337667 from '3.111.151.121:5000'
2024-04-22T16:05:29.492530Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BatchSignature' from '34.16.172.142:5000'
2024-04-22T16:05:29.494833Z  INFO snarkos_node_bft::primary: Received a batch signature for round 337667 from '34.16.172.142:5000'
2024-04-22T16:05:29.495420Z  INFO snarkos_node_bft::primary: Quorum threshold reached - Preparing to certify our batch for round 337667...
2024-04-22T16:05:29.499583Z DEBUG snarkos_node_bft::primary: Stored a batch certificate for round 337667
2024-04-22T16:05:29.499685Z TRACE snarkos_node_bft::bft: Checking if the leader is ready to be committed for round 337666...
2024-04-22T16:05:29.499743Z TRACE snarkos_node_bft::bft: BFT is not ready to commit 337666
2024-04-22T16:05:29.499781Z  INFO snarkos_node_bft::primary: 

Our batch with 0 transmissions for round 337667 was certified!

2024-04-22T16:05:29.499842Z TRACE snarkos_node_bft::bft: BFT failed reach quorum threshold in odd round 337667. 
2024-04-22T16:05:29.499878Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchCertified' to '34.16.172.142:5000'
2024-04-22T16:05:29.499901Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchCertified' to '35.186.144.105:5000'
2024-04-22T16:05:29.499920Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchCertified' to '35.196.17.175:5000'
2024-04-22T16:05:29.499924Z DEBUG snarkos_node_bft::primary: Primary is not ready to propose the next round
2024-04-22T16:05:29.499939Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchCertified' to '65.108.121.160:5000'
2024-04-22T16:05:29.500000Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchCertified' to '34.72.164.175:5000'
2024-04-22T16:05:29.500017Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchCertified' to '3.111.151.121:5000'
2024-04-22T16:05:29.500035Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchCertified' to '69.84.134.115:5000'
2024-04-22T16:05:29.500055Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchCertified' to '104.61.63.153:5000'
2024-04-22T16:05:29.539719Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BatchSignature' from '35.186.144.105:5000'
2024-04-22T16:05:29.567724Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'PrimaryPing' to '34.16.172.142:5000'
2024-04-22T16:05:29.568443Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'PrimaryPing' to '35.186.144.105:5000'
2024-04-22T16:05:29.568483Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'PrimaryPing' to '35.196.17.175:5000'
2024-04-22T16:05:29.568505Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'PrimaryPing' to '65.108.121.160:5000'
2024-04-22T16:05:29.568525Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'PrimaryPing' to '34.72.164.175:5000'
2024-04-22T16:05:29.568545Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'PrimaryPing' to '3.111.151.121:5000'
2024-04-22T16:05:29.568575Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'PrimaryPing' to '69.84.134.115:5000'
2024-04-22T16:05:29.568618Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'PrimaryPing' to '104.61.63.153:5000'
2024-04-22T16:05:29.657280Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'TransmissionResponse' from '34.16.172.142:5000'
2024-04-22T16:05:29.671325Z TRACE snarkos_node_bft::gateway: [MemoryPool] Sending 'BatchSignature' to '34.16.172.142:5000'
2024-04-22T16:05:29.671368Z DEBUG snarkos_node_bft::primary: Signed a batch for round 337667 from '34.16.172.142:5000'
2024-04-22T16:05:29.682270Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BatchCertified' from '35.196.17.175:5000'
2024-04-22T16:05:29.693162Z DEBUG snarkos_node_bft::primary: Stored a batch certificate for round 337667 from '35.196.17.175:5000'
2024-04-22T16:05:29.693209Z TRACE snarkos_node_bft::bft: Checking if the leader is ready to be committed for round 337666...
2024-04-22T16:05:29.693251Z TRACE snarkos_node_bft::bft: BFT is not ready to commit 337666
2024-04-22T16:05:29.701138Z TRACE snarkos_node_sync::block_sync: Block sync is removing all block requests to peer 34.16.172.142:5000...
2024-04-22T16:05:29.786003Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'PrimaryPing' from '35.186.144.105:5000'
2024-04-22T16:05:29.787105Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BatchCertified' from '35.186.144.105:5000'
2024-04-22T16:05:29.798732Z DEBUG snarkos_node_bft::primary: Stored a batch certificate for round 337667 from '35.186.144.105:5000'
2024-04-22T16:05:29.798773Z TRACE snarkos_node_bft::bft: Checking if the leader is ready to be committed for round 337666...
2024-04-22T16:05:29.798816Z  INFO snarkos_node_bft::bft: Proceeding to commit round 337666 with leader 'aleo1jtmdv03zle4..'
2024-04-22T16:05:29.996729Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'PrimaryPing' from '35.196.17.175:5000'
2024-04-22T16:05:30.281334Z  INFO snarkos_node_bft_ledger_service::ledger: 

Advanced to block 160961 at round 337666 - ab1yulrzl0u7rarnjzkjug4ydrc8g3h09jkmv73ts72lswwd9lfmvqqrf9w45

2024-04-22T16:05:30.281771Z  INFO snarkos_node_bft::bft: 

Committing a subdag from round 337666 with 3 transmissions: [(337663, 1), (337664, 7), (337665, 7), (337666, 1)]

2024-04-22T16:05:30.361767Z TRACE snarkos_node_bft::gateway: [MemoryPool] Received 'BatchCertified' from '104.61.63.153:5000'
2024-04-22T16:05:30.372579Z DEBUG snarkos_node_bft::primary: Stored a batch certificate for round 337667 from '104.61.63.153:5000'
2024-04-22T16:05:30.408900Z TRACE snarkos_node_bft::bft: BFT failed reach quorum threshold in odd round 337667.

the message this log ends with can be seen repeating for a few hours but eventually also stopping:

2024-04-23T00:17:00.889032Z TRACE snarkos_node_bft::bft: BFT failed reach quorum threshold in odd round 337667. 

Log file with a larger timeframe: validator-consenus-anomaly.log

The validator was restarted when the anomaly was detected, this did fix the issue and the validator proceeded to take part in block production again

Expected Behavior

Expected the node to continue participating in block production and not require a restart to do so.

Your Environment

iamalwaysuncomfortable commented 7 months ago

Thanks for the report and the logs, I'll look into this when I get the chance!

I also encourage everyone participating in CanaryNet to also look at the codepaths where this was emitted, usually that helps make it apparent why this happened. Let's all attempt to look into this and see if we can discover why this happened!