PoA network, all the sealers are waiting for each other after 2 months running, possible deadlock?

marcosmartinez7 commented 5 years ago

System information

My current version is:

Geth
Version: 1.8.17-stable
Git Commit: 8bbe72075e4e16442c4e28d999edee12e294329e
Architecture: amd64
Protocol Versions: [63 62]
Network Id: 1
Go Version: go1.10.1
Operating System: linux
GOPATH=
GOROOT=/usr/lib/go-1.10

Expected behaviour

Keep the normal signing .

Actual behaviour

I was running a go-ethereum private network with 6 sealers.

Each sealer is run by:

directory=/home/poa
command=/bin/bash -c 'geth --datadir sealer4/  --syncmode 'full' --port 30393 --rpc --rpcaddr 'localhost' --rpcport 8600 --rpcapi='net,web3,eth' --networkid 30 --gasprice '1' -unlock 'someaddress' --password sealer4/password.txt --mine '

The blockchain was running good for about 1-2 months.

Today i found that all the nodes were having issues. Each node was emmiting the message "Signed recently, must wait for others"

I check out the logs and i found this message every 1 hour, no more information, the nodes where not mining:

Regenerated local transaction journal transactions=0 accounts=0 Regenerated local transaction journal transactions=0 accounts=0 Regenerated local transaction journal transactions=0 accounts=0 Regenerated local transaction journal transactions=0 accounts=0

Experimenting the same issue with 6 sealers, i restarted each node but now im get stucked in

INFO [01-07|18:17:30.645] Etherbase automatically configured address=0x5Bc69DC4dba04b6955aC94BbdF129C3ce2d20D34 INFO [01-07|18:17:30.645] Commit new mining work number=488677 sealhash=a506ec…8cb403 uncles=0 txs=0 gas=0 fees=0 elapsed=133.76µs INFO [01-07|18:17:30.645] Signed recently, must wait for others

The first thing that is weird is that, some nodes are stucked on the 488677 and others are on 488676, this behaviour was reported on this issue https://github.com/ethereum/go-ethereum/issues/16406 same for the user @lyhbarry

Example: Signer 1

Signer 2

Note that there is no votes pending

So, right now, i shut down and restar each node, i have found that:

Each node is paired with the others
Each node is part of clique.getSigners()
Each node is waiting for another to sign...

INFO [01-07|18:41:56.134] Signed recently, must wait for others 
INFO [01-07|19:41:42.125] Regenerated local transaction journal    transactions=0 accounts=0
INFO [01-07|18:41:56.134] Signed recently, must wait for others

So, the syncronization fail but also i just can start signing again because each node is stucked waiting for the others, that means, the network is useless?

The comment of @tudyzhb on that issue mention that:

Ref clique-seal of v1.8.11, I think there is no an effective mechanism to retry seal, when an in-turn/out-of-turn seal fail occur. So our dev network useless easily.

After this problem, i take a look at the logs, each signer has this error messages:

Synchronisation failed, dropping peer peer=7875a002affc775b err="retrieved hash chain is invalid"

INFO [01-02|16:42:10.902] Signed recently, must wait for others 
WARN [01-02|16:42:11.960] Synchronisation failed, dropping peer    peer=7875a002affc775b err="retrieved hash chain is invalid"
INFO [01-02|16:42:12.128] Imported new chain segment               blocks=1  txs=0 mgas=0.000 elapsed=540.282µs mgasps=0.000  number=488116 hash=269920…afd3c7 cache=5.99kB
INFO [01-02|16:42:12.129] Commit new mining work                   number=488117 sealhash=f7b00c…787d5c uncles=2 txs=0 gas=0     fees=0          elapsed=307.314µs
INFO [01-02|16:42:20.929] Successfully sealed new block            number=488117 sealhash=f7b00c…787d5c hash=f17438…93ffe3 elapsed=8.800s
INFO [01-02|16:42:20.929] 🔨 mined potential block                  number=488117 hash=f17438…93ffe3
INFO [01-02|16:42:20.930] Commit new mining work                   number=488118 sealhash=b09b33…1526ba uncles=2 txs=0 gas=0     fees=0          elapsed=520.754µs
INFO [01-02|16:42:20.930] Signed recently, must wait for others 
INFO [01-02|16:42:31.679] Imported new chain segment               blocks=1  txs=0 mgas=0.000 elapsed=2.253ms   mgasps=0.000  number=488118 hash=763a32…a579f5 cache=5.99kB
INFO [01-02|16:42:31.680] 🔗 block reached canonical chain          number=488111 hash=3d44dc…df0be5
INFO [01-02|16:42:31.680] Commit new mining work                   number=488119 sealhash=c8a5e7…db78a1 uncles=2 txs=0 gas=0     fees=0          elapsed=214.155µs
INFO [01-02|16:42:31.680] Signed recently, must wait for others 
INFO [01-02|16:42:40.901] Imported new chain segment               blocks=1  txs=0 mgas=0.000 elapsed=808.903µs mgasps=0.000  number=488119 hash=accc3f…44bc4c cache=5.99kB
INFO [01-02|16:42:40.901] Commit new mining work                   number=488120 sealhash=f73978…c03fa7 uncles=2 txs=0 gas=0     fees=0          elapsed=275.72µs
INFO [01-02|16:42:40.901] Signed recently, must wait for others 
WARN [01-02|16:42:41.961] Synchronisation failed, dropping peer    peer=7875a002affc775b err="retrieved hash chain is invalid"

I also see some:

INFO [01-02|16:58:10.902] 😱 block lost number=488205 hash=1fb1c5…a41a42 This error about hash chain was just a warning, so the node keep mining until the 2th of january, then i saw this on each of the 6 nodes

I was looking that there are a lot of issues about this error, the most similar is the one i posted here but is unresolved.

Most of the issues workarrounds seems to be a restart, but in this case, the chain seems to be is in a unconsistent state and the nodes are always waiting for the others

So,

any ideas? peers are connected, accounts are unlocked, it just entered into a deadlock situation after 450k blocks
any logs that i can provide? i only see the warnings of the error described and the block lost, but nothing when the node stoped to be mining
Is this PR related? https://github.com/ethereum/go-ethereum/pull/18072
Maybe is related with the comment of @karalabe onthis issue https://github.com/ethereum/go-ethereum/issues/16406? 5 Upgrading from 1.8.17 to 1.8.20 will solve this?
In my opinion, seems like a race condition or something, since i have 2 chains, one running for 2 months, the other one for three months and is the first time this error happens

This are other related issues:

https://github.com/ethereum/go-ethereum/issues/16444 (Same issue but i dont have votes pending in my snapshot)

https://github.com/ethereum/go-ethereum/issues/14381#

https://github.com/ethereum/go-ethereum/issues/16825

https://github.com/ethereum/go-ethereum/issues/16406

fab-spring commented 5 years ago

any news on this issue?

marcosmartinez7 commented 5 years ago

any news on this issue?

There is a PR (https://github.com/ethereum/go-ethereum/pull/19239) but still needs to be reviewed

dwalintukan commented 5 years ago

Mine had the same issue recently. 5 sealers config.

One of them I don't know why but their block history got all messed up. The latest block was ~5100000 and this sealer was at ~3000000.

Two sealers stuck at 5125155 with matching block hashes. Other two sealers stuck at 5125155 with different matching block hashes.

I picked which two sealers I wanted the new head to be at. And for the other two stuck ones:

I dumped the blocks to a file using geth export blocks.dump 0 5125100. This made the sealers resync the last 55 blocks to the latest with the other sealers.
Deleted the chaindata dirs
geth init ...
Imported the blocks with geth import blocks.dump.

And the other weird sealer that was way off I just resynced from scratch. Everything is working good again. Just wondering why this happened in the first place.

ivica7 commented 5 years ago

We're experiencing this issue too. 5s block times and 4 sealers. it ran about 5,5 months withouth any problems, now we got the same deadlock situation like described here.

Is there a plan to have a fix for this in one of the upcoming geth releases? https://github.com/ethereum/go-ethereum/pull/19239 seems to be providing a fix, but it looks like it stalled.

Did we have such Deadlocks in Rinkeby too?

ivica7 commented 5 years ago

I was able to resolve the issue for us. First, I noticed that I have a fork with 2 sealers being on one side and the other 2 on the other side.

What I did: I picked one sealer node and removed the peer who agrees on the same fork. Then I used debug.setHead to rewind the chain for ~50 blocks. After that everything worked fine.

So the question for me is: could we work around this issue by having a sealer count which is not dividable in partitions of same size? Something like N=3, N=5, N=7, N=11, ... ?

However, witth N=5 for instance, there seems to be a problem too, like dwalintukan posted on 11 Sep, but I am not sure if it is the same problem.

KebAz commented 4 years ago

Hi, I have the same issue, I created a Blockchain with 2 sealers, after mining the first block the sealers still just waiting for each other. I'm using Geth Version: 1.9.6-stable:

Geth
Version: 1.9.6-stable
Git Commit: bd05968077f27f7eb083404dd8448157996a8788
Architecture: amd64
Protocol Versions: [63]
Network Id: 1
Go Version: go1.11.5
Operating System: linux
GOPATH=
GOROOT=/usr/lib/go-1.11

fnaticwang commented 4 years ago

i have the same issue. 1s block times and 4 sealers(not only 4 sealers, and now 5 is in the same problem). now is all sealers still just waiting for each others~

Geth version:V1.9.7 commit a718daa674a2e23cb0c8a6789f7e5467e705bbbd OS: linux red hat go version go1.10.2 linux/amd64

now we have some ideas?@holiman @jmank88 @karalabe @fjl

rajnishtech commented 4 years ago

When i start the mining the node.node got stuck INFO [01-27|10:26:23.850] Started P2P networking self=enode://84acace1e79f154df825955667061c79c37986a6594742999f3909fa7ea9a20b90d71cff58a66fc3ee9fa2f6be6d8f0bc455018be72532475d9405dd5cc79622@127.0.0.1:30311 INFO [01-27|10:26:24.784] Unlocked account address=0x1e8eD9837558819ffcBD8Fd20CE97976a4aB6D2f INFO [01-27|10:26:24.785] Transaction pool price threshold updated price=1 INFO [01-27|10:26:24.785] Transaction pool price threshold updated price=1 INFO [01-27|10:26:24.785] Etherbase automatically configured address=0x1e8eD9837558819ffcBD8Fd20CE97976a4aB6D2f INFO [01-27|10:26:24.785] Commit new mining work number=2 sealhash=3efe07…1cc8cf uncles=0 txs=0 gas=0 fees=0 elapsed=182.745µs INFO [01-27|10:26:24.785] Signed recently, must wait for others

marcosmartinez7 commented 4 years ago

@rajnishtech i dont think thats related with the issue described. That seems lika sealer waiting for another to sign. Maybe you forgot to add the other peers to that sealer? Or the other peers are not able to vote (clique.propose("0x...", true))

rajnishtech commented 4 years ago

@marcosmartinez7 i am geeting the other issue from the other node 1 INFO [01-28|12:21:03.401] Commit new mining work number=1 sealhash=b36db4…2358e5 uncles=0 txs=0 gas=0 fees=0 elapsed=329.832µs

marcosmartinez7 commented 4 years ago

@rajnishtech are you sure your two sealers have connectivity? What is the result of admin.peers on each node?

Btw i think you should create a new issue for your problem since it is not related with this issue in particular

LongJeongS commented 4 years ago

@marcosmartinez7 have you found any ways to prevent this issue from popping up? Would increasing the block time prevent this issue?

marcosmartinez7 commented 4 years ago

@LongJeongS this issue may be fixed by https://github.com/ethereum/go-ethereum/pull/19239

In my case, i stop using PoA 1 year ago, at that moment the solution was to use only 2 sealers.

If you have only 2, there is only one in turn and the other one is out of turn, so the deadlock cannot happen

But thats not a great architecture obviously..

am0xf commented 4 years ago

Has this problem been solved? I have the exact same issue with 2 nodes, both sealers (limited resource setup). Even for contract deployment I get a hash and contract address but the contract is not present in chain. A restart of the chain fixes it though. After that all transactions timeout. Both nodes are connected (admin.peers shows 1 peer)

marcosmartinez7 commented 4 years ago

Has this problem been solved? I have the exact same issue with 2 nodes, both sealers (limited resource setup). Even for contract deployment I get a hash and contract address but the contract is not present in chain. A restart of the chain fixes it though. After that all transactions timeout. Both nodes are connected (admin.peers shows 1 peer)

Are you sure that is the same problem? Because a restart should not fix it. The problem behind this issue makes a fork of the chain

am0xf commented 4 years ago

A restart only makes the initially deployed contract available. Without a restart I get the error "...is chain synced / contract deployed correctly?" I haven't found a fix for further transactions timing out.

am0xf commented 4 years ago

Can we not run a PoA network with 1 sealer? Logs show "Signed recently. Must wait for others"

cyrilnavessamuel commented 4 years ago

Hello ,

In my private Clique network with 4 Nodes (A,B,C,D) I noticed a fork in the chain for block period 1.

I noticed that it happens some times with block period 1 & 2.

I noticed that the fork happened at block height 1500 for example. Nodes A & D have a similar chain data meanwhile Nodes B& C have similar data. (Fork occurence)

At block 1500, I noticed the difference in data between 2 chains: 1) Block hashes are different 2) Block of one chain fork is an uncle block while for other chain fork has 5000 txs included 3) Both blocks have same difficulty 2 which means that it was mined in turn as well as same sealer(complication) 4) Another complication arises when I noticed it was the same sealer who sealed the block.

This results in fork of the network and stalling at the end which cannot undergo any reorg in this deadlock situation.

In previous comments I noticed that there was atleast different difficulty and different sealers at the same block height between the forks

Please can some one let me know if you faced this issue or a logical explanation of this issue

lmvirufan commented 4 years ago

Do we have a solution for this PR.

We have also encountered the same issue for our network which had 5 signers and worked good for almost 2 months. The block generation time was 1 sec.

It suddenly started to show message: "INFO [09-05|08:50:16.267] Looking for peers peercount=4 tried=4 static=0".

We tried to start the mining by using the miner.start( ) function from all the miners/signers but it does not started to mine in the network and 3 of the nodes showed the response something like:

INFO [09-05|08:53:23.471] Commit new mining work number=7961999 sealhash="d93ccf…cdb147" uncles=0 txs=0 gas=0 fees=0 elapsed="94.336µs" INFO [09-05|08:53:23.471] Signed recently, must wait for others INFO [09-05|08:57:23.483] Commit new mining work number=7961999 sealhash="c3f025…388121" uncles=0 txs=1 gas=21000 fees=0.0050589 elapsed="562.983µs" INFO [09-05|08:57:23.484] Signed recently, must wait for others

and rest 2 showed showed the same response with number = 7961998.

The amazing thing was transactions were showing different in the txpool 2 nodes was showing 3 transactions in pending status. 2 nodes was showing 1 transactions in pending status. 1 nodes was showing 0 transactions in pending status.

Can anyone suggest what should I do that all nodes start mining again? I've tried a few steps and solutions but it did not help.

q9f commented 4 years ago

Reviewed in team call: 5chdn suggestions are good. We could solve this by making out-of-turn difficulty more complicated. There should be some deterministic order to out-of-turn blocks. karalabe fears that this will introduce too much protocol complexity or large reorgs.

just came across this again. here's peter's comment on that matter: https://github.com/ethereum/EIPs/pull/2181

we drafted the EIPs 218{1,2,3} after EthCapeTown for consideration.

abramsymons commented 3 years ago

It seems the rinkeby network was stopped working for more than an hour 3 times in only one month:

time	block	stop time in minutes
12/28/2020, 09:28:25 AM	7797794	771
12/02/2020, 07:04:35 AM	7648430	101
11/30/2020, 02:47:51 PM	7639067	64

Are rinkeby down times related to this issue?

abramsymons commented 3 years ago

To reproduce deadlock:

Run a network with 5 sealers and stop one of them to have 4 sealers active
Use 1 second as block time
Rebuild geth by setting wiggleTime from 500 milliseconds to 1 millisecond to increase racing conditions

With such a configuration, you should have 2-3 deadlocks each hour.

abramsymons commented 3 years ago

We experienced such deadlocks on IDChain and solved the issue by running a deadlock resolver script on all sealer nodes that monitor the node and if chain stopped, uses debug.setHead to return the node state to n/2+1 blocks ago where n is the number of sealers. The only disadvantage of using such approach to resolve the deadlock is that it increases the number of blocks required to wait for finality from n/2+1 to n/2+2. This script uses eth rpc api to get the last blocks, clique to get number of signers to calculate n/2+1, debug to return the node state using debug.setHead and miner to restart miner after returning state.

uxname commented 2 years ago

We experienced such deadlocks on IDChain and solved the issue by running a deadlock resolver script on all sealer nodes that monitor the node and if chain stopped, uses debug.setHead to return the node state to n/2+1 blocks ago where n is the number of sealers. The only disadvantage of using such approach to resolve the deadlock is that it increases the number of blocks required to wait for finality from n/2+1 to n/2+2. This script uses eth rpc api to get the last blocks, clique to get number of signers to calculate n/2+1, debug to return the node state using debug.setHead and miner to restart miner after returning state.

But in this case, the last blocks disappear, right? For example, some transactions can be sent through Metamask, Metamask will show that the transaction has been completed, and then the block with this transaction can disappear? And the second question, how many nodes do you have in your project? Is there a possibility that the more nodes, the less chance of a deadlock?

pangzheng commented 2 years ago

We experienced such deadlocks on IDChain and solved the issue by running a deadlock resolver script on all sealer nodes that monitor the node and if chain stopped, uses debug.setHead to return the node state to n/2+1 blocks ago where n is the number of sealers. The only disadvantage of using such approach to resolve the deadlock is that it increases the number of blocks required to wait for finality from n/2+1 to n/2+2. This script uses eth rpc api to get the last blocks, clique to get number of signers to calculate n/2+1, debug to return the node state using debug.setHead and miner to restart miner after returning state.

But in this case, the last blocks disappear, right? For example, some transactions can be sent through Metamask, Metamask will show that the transaction has been completed, and then the block with this transaction can disappear? And the second question, how many nodes do you have in your project? Is there a possibility that the more nodes, the less chance of a deadlock?

Transactions should not go away

Gelassen commented 1 year ago

Have the same (or similar?) issue after migration from ethash to clique.

My scenario is 3 nodes, 7 seconds to mine a new block: get err="signed recently, must wait for others" at first blocks.

eth.getBlock("latest") shown the 1st node mined the first block and 2nd on top of it. Last two mined (or just share the ledger copy?) only 1st block. Check was done over hashes. 1st node has last record with difficulty 2, last two nodes has last record with difficulty of 1.

However, I haven't got yet why deadlock could happen in this case. Block has been mined and shared with the rest of the chain. All the rest nodes accept it and start execute next block calculations on the top of it. In case of race conditions algorithm chooses the block with the longest chain length, time of mining, order in queue, etc. Where is reason for a dead lock?

@jmank88 , it seems you have the most experience with this issue and workarounds regarding it. May I ask you to share more information about this, please?

jmank88 commented 1 year ago

In case of race conditions algorithm chooses the block with the longest chain length, time of mining, order in queue, etc. > Where is reason for a dead lock?

@jmank88 , it seems you have the most experience with this issue and workarounds regarding it. May I ask you to share more information about this, please?

I haven't dealt with this in a few years, but the general problem was that there is no fully deterministic choice in the case of same length - the tie-breaker was whichever block arrived first, which is subjective due to network latency/partitions/etc. and not representative of when blocks were created. The tie-breaker logic can be patched in the implementation as a workaround to avoid upgrading the difficulty protocol itself.

Gelassen commented 1 year ago

@jmank88 , Thank you for more details regarding this issue!

karalabe commented 4 months ago

This was a long standing issue, but with the drop of Clique, we can elegantly close it without addressing it.

marcosmartinez7 commented 4 months ago

This was a long standing issue, but with the drop of Clique, we can elegantly close it without addressing it.

This was undoubtedly the longest-standing issue of my career. It was enjoyable to see all the updates year after year.

ethereum / go-ethereum