bnb-chain / bsc

A BNB Smart Chain client based on the go-ethereum fork
GNU Lesser General Public License v3.0
2.71k stars 1.56k forks source link

Syncing issues after upgrade to v1.1.21 #1437

Closed Aracki closed 1 year ago

Aracki commented 1 year ago

After we upgraded from 1.1.20 to 1.1.21, the nodes are randomly becoming stuck. No new blocks at all:

This is logs example. You can see that after ~21:44 it became unhealthy & not getting any new blocks.

INFO [04-10|21:43:08.507] Imported new chain segment               blocks=20 txs=1689  mgas=182.628  elapsed=5.672s    mgasps=32.193  number=27,237,294 hash=6331a2..f6a57d dirty=2.25MiB    ignored=1
INFO [04-10|21:43:13.026] Deep froze chain segment                 blocks=20 elapsed=22.551ms  number=27,147,294 hash=88801b..dff61c
INFO [04-10|21:44:02.862] Downloader queue stats                   receiptTasks=0 blockTasks=0 itemSize=23.17KiB throttle=8192
ERROR[04-10|21:44:04.925] Invalid merkle root                      remote=1b9937..aaef75 local=b604ae..6e2805
WARN [04-10|21:44:04.927] Synchronisation failed, dropping peer    peer=448fccf3fcaaad193fc352f3eb4be4166e8ee05f235d847604def13addb73b5a err="retrieved hash chain is invalid: invalid merkle root (remote: 1b99374ba56e64254d0a9d68555ac74ed9e36d92268b97bedf12186d3caaef75 local: b604aee4eed8459ca3ae4061bd98739a8c44b6115b6b9afd1c89f6ea296e2805)"
WARN [04-10|21:44:04.928] Synchronisation failed, retrying         peer=448fccf3fcaaad193fc352f3eb4be4166e8ee05f235d847604def13addb73b5a err="shutting down"
WARN [04-10|21:44:04.929] Synchronisation failed, retrying         peer=448fccf3fcaaad193fc352f3eb4be4166e8ee05f235d847604def13addb73b5a err="write tcp 10.2.21.177:45634->142.132.208.184:30309: use of closed network connection"
WARN [04-10|21:44:04.929] Synchronisation failed, retrying         peer=448fccf3fcaaad193fc352f3eb4be4166e8ee05f235d847604def13addb73b5a err="shutting down"
WARN [04-10|21:44:04.929] Synchronisation failed, retrying         peer=448fccf3fcaaad193fc352f3eb4be4166e8ee05f235d847604def13addb73b5a err="peer is unknown or unhealthy"
INFO [04-10|21:44:13.047] Deep froze chain segment                 blocks=8  elapsed=19.940ms  number=27,147,302 hash=9a5c96..66fb82
INFO [04-10|22:09:29.965] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-10|22:09:29.968] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-10|22:09:43.627] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.662s
INFO [04-10|23:09:29.965] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-10|23:09:29.966] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-10|23:09:43.101] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.135s
INFO [04-11|00:09:29.965] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|00:09:29.966] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-11|00:09:43.030] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.064s
INFO [04-11|01:09:29.966] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|01:09:29.967] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-11|01:09:43.273] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.306s
INFO [04-11|02:09:29.965] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|02:09:29.966] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-11|02:09:43.361] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.395s
INFO [04-11|03:09:29.966] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|03:09:29.967] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-11|03:09:43.146] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.179s
INFO [04-11|04:09:29.966] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|04:09:29.966] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-11|04:09:43.364] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.397s
INFO [04-11|05:09:29.966] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|05:09:29.966] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-11|05:09:43.403] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.437s
INFO [04-11|06:09:29.965] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|06:09:29.967] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-11|06:09:42.979] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.013s
INFO [04-11|07:09:29.966] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|07:09:29.966] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-11|07:09:43.384] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.417s
INFO [04-11|08:09:29.966] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|08:09:29.966] Regenerated local transaction journal    transactions=0 accounts=0
nebojsa-ristovic commented 1 year ago

Same happened to us. Anyone knows how to fix this?

deepcrazy commented 1 year ago

After we upgraded from 1.1.20 to 1.1.21, the nodes are randomly becoming stuck. No new blocks at all:

This is logs example. You can see that after ~21:44 it became unhealthy & not getting any new blocks.

INFO [04-10|21:43:08.507] Imported new chain segment               blocks=20 txs=1689  mgas=182.628  elapsed=5.672s    mgasps=32.193  number=27,237,294 hash=6331a2..f6a57d dirty=2.25MiB    ignored=1
INFO [04-10|21:43:13.026] Deep froze chain segment                 blocks=20 elapsed=22.551ms  number=27,147,294 hash=88801b..dff61c
INFO [04-10|21:44:02.862] Downloader queue stats                   receiptTasks=0 blockTasks=0 itemSize=23.17KiB throttle=8192
ERROR[04-10|21:44:04.925] Invalid merkle root                      remote=1b9937..aaef75 local=b604ae..6e2805
WARN [04-10|21:44:04.927] Synchronisation failed, dropping peer    peer=448fccf3fcaaad193fc352f3eb4be4166e8ee05f235d847604def13addb73b5a err="retrieved hash chain is invalid: invalid merkle root (remote: 1b99374ba56e64254d0a9d68555ac74ed9e36d92268b97bedf12186d3caaef75 local: b604aee4eed8459ca3ae4061bd98739a8c44b6115b6b9afd1c89f6ea296e2805)"
WARN [04-10|21:44:04.928] Synchronisation failed, retrying         peer=448fccf3fcaaad193fc352f3eb4be4166e8ee05f235d847604def13addb73b5a err="shutting down"
WARN [04-10|21:44:04.929] Synchronisation failed, retrying         peer=448fccf3fcaaad193fc352f3eb4be4166e8ee05f235d847604def13addb73b5a err="write tcp 10.2.21.177:45634->142.132.208.184:30309: use of closed network connection"
WARN [04-10|21:44:04.929] Synchronisation failed, retrying         peer=448fccf3fcaaad193fc352f3eb4be4166e8ee05f235d847604def13addb73b5a err="shutting down"
WARN [04-10|21:44:04.929] Synchronisation failed, retrying         peer=448fccf3fcaaad193fc352f3eb4be4166e8ee05f235d847604def13addb73b5a err="peer is unknown or unhealthy"
INFO [04-10|21:44:13.047] Deep froze chain segment                 blocks=8  elapsed=19.940ms  number=27,147,302 hash=9a5c96..66fb82
INFO [04-10|22:09:29.965] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-10|22:09:29.968] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-10|22:09:43.627] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.662s
INFO [04-10|23:09:29.965] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-10|23:09:29.966] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-10|23:09:43.101] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.135s
INFO [04-11|00:09:29.965] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|00:09:29.966] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-11|00:09:43.030] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.064s
INFO [04-11|01:09:29.966] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|01:09:29.967] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-11|01:09:43.273] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.306s
INFO [04-11|02:09:29.965] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|02:09:29.966] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-11|02:09:43.361] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.395s
INFO [04-11|03:09:29.966] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|03:09:29.967] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-11|03:09:43.146] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.179s
INFO [04-11|04:09:29.966] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|04:09:29.966] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-11|04:09:43.364] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.397s
INFO [04-11|05:09:29.966] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|05:09:29.966] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-11|05:09:43.403] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.437s
INFO [04-11|06:09:29.965] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|06:09:29.967] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-11|06:09:42.979] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.013s
INFO [04-11|07:09:29.966] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|07:09:29.966] Regenerated local transaction journal    transactions=0 accounts=0
INFO [04-11|07:09:43.384] Persisted the clean trie cache           path=/bsc/.ethereum/geth/triecache elapsed=13.417s
INFO [04-11|08:09:29.966] Writing clean trie cache to disk         path=/bsc/.ethereum/geth/triecache threads=1
INFO [04-11|08:09:29.966] Regenerated local transaction journal    transactions=0 accounts=0

Hi there, what's your start up command? geth version? And Which snapshot you're using?

Also, Please share eth.syncing, eth.blockNumber, and net.peerCount outputs? Thanks.

PanJ commented 1 year ago

Happen to me too. There seem to be some bad peers. I need to remove all the peers and start the node again.

geth --exec 'reset = function (){ for (i=0; i<admin.peers.length; i++) admin.removePeer(admin.peers[0]["enode"]) }; reset()' attach data/geth.ipc
Aracki commented 1 year ago

@deepcrazy

We are using ghcr.io/bnb-chain/bsc:1.1.21 version. Startup command args are like this:

            "--networkid=56",
            "--datadir=/bsc/.ethereum/",
            "--config=/bsc/.ethereum/config.toml",
            "--syncmode=full",
            "--gcmode=archive",
            "--cache=65536",
            "--maxpeers=100",
            "--http",
            "--http.addr=0.0.0.0",
            "--http.port=8545",
            "--http.api=eth,net,web3,admin,debug,personal,txpool",
            "--http.corsdomain=*",
            "--http.vhosts=*",
            "--ws",
            "--ws.addr=0.0.0.0",
            "--ws.port=8546",
            "--ws.api=eth,net,web3,admin,debug,personal,miner,txpool",
            "--ws.origins=*",
            "--port=30311",
            "--txlookuplimit=0",
            "--rpc.allow-unprotected-txs",
            "--cache.preimages",
            "--bootnodes=enode://1cc4534b14cfe351ab740a1418ab944a234ca2f702915eadb7e558a02010cb7c5a8c295a3b56bcefa7701c07752acd5539cb13df2aab8ae2d98934d712611443@52.71.43.172:30311,enode://28b1d16562dac280dacaaf45d54516b85bc6c994252a9825c5cc4e080d3e53446d05f63ba495ea7d44d6c316b54cd92b245c5c328c37da24605c4a93a0d099c4@34.246.65.14:30311,enode://5a7b996048d1b0a07683a949662c87c09b55247ce774aeee10bb886892e586e3c604564393292e38ef43c023ee9981e1f8b335766ec4f0f256e57f8640b079d5@35.73.137.11:30311",

For eth.syncing I'm not sure what was the value at the time, as we don't have a metric for that. eth.blockNumber was stuck at 27.237.302 net.peerCount was stuck at 33 (although before the issue it was always in the range of 34-36)

What do you exactly mean by "which snapshot we are using"...?

deepcrazy commented 1 year ago

@deepcrazy

We are using ghcr.io/bnb-chain/bsc:1.1.21 version. Startup command args are like this:

            "--networkid=56",
            "--datadir=/bsc/.ethereum/",
            "--config=/bsc/.ethereum/config.toml",
            "--syncmode=full",
            "--gcmode=archive",
            "--cache=65536",
            "--maxpeers=100",
            "--http",
            "--http.addr=0.0.0.0",
            "--http.port=8545",
            "--http.api=eth,net,web3,admin,debug,personal,txpool",
            "--http.corsdomain=*",
            "--http.vhosts=*",
            "--ws",
            "--ws.addr=0.0.0.0",
            "--ws.port=8546",
            "--ws.api=eth,net,web3,admin,debug,personal,miner,txpool",
            "--ws.origins=*",
            "--port=30311",
            "--txlookuplimit=0",
            "--rpc.allow-unprotected-txs",
            "--cache.preimages",
            "--bootnodes=enode://1cc4534b14cfe351ab740a1418ab944a234ca2f702915eadb7e558a02010cb7c5a8c295a3b56bcefa7701c07752acd5539cb13df2aab8ae2d98934d712611443@52.71.43.172:30311,enode://28b1d16562dac280dacaaf45d54516b85bc6c994252a9825c5cc4e080d3e53446d05f63ba495ea7d44d6c316b54cd92b245c5c328c37da24605c4a93a0d099c4@34.246.65.14:30311,enode://5a7b996048d1b0a07683a949662c87c09b55247ce774aeee10bb886892e586e3c604564393292e38ef43c023ee9981e1f8b335766ec4f0f256e57f8640b079d5@35.73.137.11:30311",

For eth.syncing I'm not sure what was the value at the time, as we don't have a metric for that. eth.blockNumber was stuck at 27.237.302 net.peerCount was stuck at 33 (although before the issue it was always in the range of 34-36)

What do you exactly mean by "which snapshot we are using"...?

Hi there, In this case, please try to set the maxpeers to 200,300 or 500 and Also, append the static nodes from here: https://api.binance.org/v1/discovery/peers. Then restart the node and let it run for a while.

gituser commented 1 year ago

Try doing this:

scottsisco commented 1 year ago

I am having the same issue as everyone else. I have tried @gituser recommendation of loading StaticPeers and BootNodes out of the mainnet.zip linked above. I disabled peer discovery to weed out all of the bad nodes running older versions of BNB and I have been able to connect to 2 out of the 32 static nodes I grabbed out of the config.toml in the mainnet.zip.

> admin.peers
[{
    caps: ["diff/1", "eth/66", "eth/67", "les/2", "les/3", "les/4", "snap/1"],
    enode: "enode://ace8e3b7e96290392a9c6238e005539dd1d1ca7c18aeedd979087789812e0656407e2c4baabcd83b50d583b583aa959ff42e5c95571d76b84995aad722a9a85c@44.198.55.182:30311",
    id: "25b173c00d58eac53442faee5a67f1543e496bd7df818e062cee202b8ec5582e",
    name: "Geth/v1.1.21-6e06ed10/linux-amd64/go1.19.7",
    network: {
      inbound: false,
      localAddress: "172.20.0.15:60284",
      remoteAddress: "44.198.55.182:30311",
      static: true,
      trusted: false
    },
    protocols: {
      diff: {
        diff_sync: false,
        version: 1
      },
      eth: {
        difficulty: 54357728,
        head: "0xd79d41baf9ebac436b2fd3279dc1d764e1eafe88d839d6f4a1ce7842e2ff3ff9",
        version: 67
      },
      snap: {
        version: 1
      }
    }
}, {
    caps: ["diff/1", "eth/66", "eth/67", "les/2", "les/3", "les/4", "snap/1"],
    enode: "enode://bdbcb42ff17a52af7ac120f23ee86f525ffbd854ce76e54bad858cf741fcd524d0f810be399d437bea939682a919c5af5df31811c43ccc270d04485caf4ddaeb@52.206.226.126:30311",
    id: "67950c087377c72d2ec8f38f3a5fc0683097756e994ee546ef5fee3b53400b2a",
    name: "Geth/v1.1.21-6e06ed10/linux-amd64/go1.19.7",
    network: {
      inbound: false,
      localAddress: "172.20.0.15:60570",
      remoteAddress: "52.206.226.126:30311",
      static: true,
      trusted: false
    },
    protocols: {
      diff: {
        diff_sync: false,
        version: 1
      },
      eth: {
        difficulty: 54357856,
        head: "0x9568f5c3fd2a2380e553c386bff7a3282061f7185827a31a0268743a10be17ed",
        version: 67
      },
      snap: {
        version: 1
      }
    }
}]

The node does appear to be working now, but only being able to connect to 2 peers is not ideal for a production environment. How would you recommend connecting to more than 2 peers out of the static nodes I have listed in my config.toml?

scottsisco commented 1 year ago

Okay, so maybe the answer is time. I let the node run over the weekend and eventually it connected to 10 peers and the node never locked up once.

mhd-alawieh commented 1 year ago

Happen to me too. There seem to be some bad peers. I need to remove all the peers and start the node again.

geth --exec 'reset = function (){ for (i=0; i<admin.peers.length; i++) admin.removePeer(admin.peers[0]["enode"]) }; reset()' attach data/geth.ipc

Any known fix? It seems impossible to keep my node synced because of this, it happens very frequently. Cleaning peers with the script gets it back alive for a while until it gets stuck again or starts importing blocks in batches instead of one by one every three seconds.