Consensys / quorum

A permissioned implementation of Ethereum supporting data privacy
https://www.goquorum.com/
GNU Lesser General Public License v3.0
4.68k stars 1.29k forks source link

Istanbul: bigger blockperiod value cause blockchain stuck #583

Closed yjfcn closed 5 years ago

yjfcn commented 5 years ago

Geth Version: 1.8.12-stable Git Commit: 8c4aea54d13d7ca6f148287d09aea9e0b6b0d8e6 Quorum Version: 2.1.1 Architecture: amd64 Protocol Versions: [63 62] Network Id: 1337 Go Version: go1.10.1 Operating System: linux GOPATH= GOROOT=/usr/lib/go-1.10

I configured 3 quorum nodes with Istanbul consensus. If I set --istanbul.blockperiod 10, the whole blockchain will stuck after mining some blocks. The default blockperiod value which is 1 works fine. I tried to set blockperiod 5, it also works.

Here is the log: INFO [11-30|09:43:01.105] Committed address=0x525ad36D11aA5E8139FC21D82485EF66B54AddB2 hash=5efda5…a129aa number=4 INFO [11-30|09:43:01.107] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=618.701µs mgasps=0.000 number=4 hash=5efda5…a129aa cache=0.00B INFO [11-30|09:43:01.107] Commit new mining work number=5 txs=0 uncles=0 elapsed=159.2µs INFO [11-30|09:43:11.438] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=645.6µs mgasps=0.000 number=5 hash=f1a986…4a93c9 cache=0.00B INFO [11-30|09:43:11.439] Commit new mining work number=6 txs=0 uncles=0 elapsed=276.6µs INFO [11-30|09:43:21.002] Committed address=0x525ad36D11aA5E8139FC21D82485EF66B54AddB2 hash=e785ef…ddc5ac number=6 INFO [11-30|09:43:21.002] Successfully sealed new block number=6 hash=e785ef…ddc5ac INFO [11-30|09:43:21.002] block reached canonical chain number=1 hash=ca631e…874dd2 INFO [11-30|09:43:21.002] mined potential block number=6 hash=e785ef…ddc5ac INFO [11-30|09:43:21.003] Commit new mining work number=7 txs=0 uncles=0 elapsed=111.2µs INFO [11-30|09:43:31.301] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=623.101µs mgasps=0.000 number=7 hash=ea45e4…679276 cache=0.00B INFO [11-30|09:43:31.301] Commit new mining work number=8 txs=0 uncles=0 elapsed=150.2µs INFO [11-30|09:43:41.108] Committed address=0x525ad36D11aA5E8139FC21D82485EF66B54AddB2 hash=b7ab7a…547115 number=8 INFO [11-30|09:43:41.109] Successfully sealed new block number=8 hash=b7ab7a…547115 INFO [11-30|09:43:41.109] block reached canonical chain number=3 hash=967955…995fd8 INFO [11-30|09:43:41.109] mined potential block number=8 hash=b7ab7a…547115 INFO [11-30|09:43:41.109] Commit new mining work number=9 txs=0 uncles=0 elapsed=167.7µs INFO [11-30|09:43:41.438] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=539.7µs mgasps=0.000 number=8 hash=70df0c…773024 cache=0.00B

Looks like if there happens to be two blocks with same block number but different hash, the blockchain will stuck.

vietlq commented 5 years ago

I configured 3 quorum nodes with Istanbul consensus

@yjfcn For BFT you need to have N = 3f + 1, means you need at least 4 nodes to have minimum valid number of nodes to run Istanbul BFT. In this case you need N - f = 4 - 1 = 3 nodes to agree on one hash to make it viable. Anything less will cause deadlock.

fixanoid commented 5 years ago

305

yjfcn commented 5 years ago

I configured 3 quorum nodes with Istanbul consensus

@yjfcn For BFT you need to have N = 3f + 1, means you need at least 4 nodes to have minimum valid number of nodes to run Istanbul BFT. In this case you need N - f = 4 - 1 = 3 nodes to agree on one hash to make it viable. Anything less will cause deadlock.

Thanks for your reply, I will add more nodes to test again.

yjfcn commented 5 years ago

After adding some new nodes, now I have 7 nodes. But three of them have the same error message after running several hours: INFO [12-05|14:54:49.226] Committed address=0x4e595f8d4968372E19A36C5DD87597B76f9ece96 hash=875255…b64481 number=1867 INFO [12-05|14:54:49.230] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=1.727ms mgasps=0.000 number=1867 hash=875255…b64481 cache=0.00B INFO [12-05|14:54:49.231] Commit new mining work number=1868 txs=0 uncles=0 elapsed=321.996µs INFO [12-05|14:54:59.165] Committed address=0x4e595f8d4968372E19A36C5DD87597B76f9ece96 hash=cf1ab9…434ef1 number=1868 INFO [12-05|14:54:59.165] Successfully sealed new block number=1868 hash=cf1ab9…434ef1 INFO [12-05|14:54:59.166] mined potential block number=1868 hash=cf1ab9…434ef1 INFO [12-05|14:54:59.166] Commit new mining work number=1869 txs=0 uncles=0 elapsed=202.497µs INFO [12-05|14:55:09.333] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=1.906ms mgasps=0.000 number=1869 hash=37ce2d…094408 cache=0.00B INFO [12-05|14:55:09.333] Commit new mining work number=1870 txs=0 uncles=0 elapsed=277.096µs INFO [12-05|14:55:19.168] Committed address=0x4e595f8d4968372E19A36C5DD87597B76f9ece96 hash=c046bc…6b8491 number=1870 INFO [12-05|14:55:19.172] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=1.690ms mgasps=0.000 number=1870 hash=c046bc…6b8491 cache=0.00B INFO [12-05|14:55:19.172] Commit new mining work number=1871 txs=0 uncles=0 elapsed=160.697µs INFO [12-05|14:55:29.204] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=1.781ms mgasps=0.000 number=1871 hash=84cf74…259397 cache=0.00B INFO [12-05|14:55:29.204] Commit new mining work number=1872 txs=0 uncles=0 elapsed=618.992µs INFO [12-05|14:55:39.169] Committed address=0x4e595f8d4968372E19A36C5DD87597B76f9ece96 hash=0bf719…7dc254 number=1872 INFO [12-05|14:55:39.172] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=1.633ms mgasps=0.000 number=1872 hash=0bf719…7dc254 cache=0.00B INFO [12-05|14:55:39.173] Commit new mining work number=1873 txs=0 uncles=0 elapsed=188.598µs INFO [12-05|14:55:49.361] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=2.026ms mgasps=0.000 number=1873 hash=183c2a…ca2083 cache=0.00B INFO [12-05|14:55:49.362] Commit new mining work number=1874 txs=0 uncles=0 elapsed=334.095µs INFO [12-05|14:55:49.362] block reached canonical chain number=1868 hash=cf1ab9…434ef1 INFO [12-05|14:55:59.226] Committed address=0x4e595f8d4968372E19A36C5DD87597B76f9ece96 hash=c0686f…02defe number=1874 INFO [12-05|14:55:59.230] Imported new chain segment blocks=1 txs=0 mgas=0.000 elapsed=2.101ms mgasps=0.000 number=1874 hash=c0686f…02defe cache=0.00B INFO [12-05|14:55:59.231] Commit new mining work number=1875 txs=0 uncles=0 elapsed=213.197µs INFO [12-05|14:56:09.177] Committed address=0x4e595f8d4968372E19A36C5DD87597B76f9ece96 hash=40783e…9cd824 number=1875 INFO [12-05|14:56:09.177] Successfully sealed new block number=1875 hash=40783e…9cd824 INFO [12-05|14:56:09.178] mined potential block number=1875 hash=40783e…9cd824 INFO [12-05|14:56:09.178] Commit new mining work number=1876 txs=0 uncles=0 elapsed=253.196µs INFO [12-05|14:56:19.518] Committed address=0x4e595f8d4968372E19A36C5DD87597B76f9ece96 hash=e59aa4…281e80 number=1876 ERROR[12-05|14:56:19.519] not a valid address err="recovery failed" INFO [12-05|15:18:51.667] Regenerated local transaction journal transactions=0 accounts=0 INFO [12-05|16:18:51.667] Regenerated local transaction journal transactions=0 accounts=0 INFO [12-05|17:18:51.667] Regenerated local transaction journal transactions=0 accounts=0 INFO [12-05|18:18:51.667] Regenerated local transaction journal transactions=0 accounts=0

The strange thing is: I tested several times, it's always these 3 nodes which have this error message. Any one knows why?

jbhurat commented 5 years ago

Hi @yjfcn, can you give more details for the below

The error itself is returned while trying to recover the public key of the node which proposed the block

yjfcn commented 5 years ago

Hi @yjfcn, can you give more details for the below

  • Are you using istanbul-tools or any other tools to generate the network configuration
  • Each time you are testing, are you starting afresh (from block 0)
  • Does the issue occurs around the same block numbers
  • Do you see the above error message in all the 3 nodes

The error itself is returned while trying to recover the public key of the node which proposed the block

Hi @jbhurat ,

I searched log file on all 7 nodes, all nodes have the same error:

cat logs/geth.log | grep "not a valid address"

ERROR[12-06|07:23:11.577] not a valid address err="recovery failed" ERROR[12-06|08:24:41.587] not a valid address err="recovery failed" ERROR[12-06|08:37:11.567] not a valid address err="recovery failed" ERROR[12-06|09:08:31.551] not a valid address err="recovery failed" ERROR[12-06|09:47:11.564] not a valid address err="recovery failed" ERROR[12-06|09:47:12.302] not a valid address err="recovery failed" ERROR[12-06|09:47:20.321] not a valid address err="recovery failed" ERROR[12-06|09:50:11.630] not a valid address err="recovery failed" ERROR[12-06|12:02:11.556] not a valid address err="recovery failed" ERROR[12-06|13:04:11.497] not a valid address err="recovery failed" ERROR[12-06|13:10:11.618] not a valid address err="recovery failed" ERROR[12-06|13:13:31.607] not a valid address err="recovery failed" ERROR[12-06|13:22:01.579] not a valid address err="recovery failed" ERROR[12-06|13:56:01.519] not a valid address err="recovery failed" ERROR[12-06|14:24:03.595] not a valid address err="recovery failed" ERROR[12-06|14:33:13.591] not a valid address err="recovery failed" ERROR[12-06|14:53:03.557] not a valid address err="recovery failed" ERROR[12-06|15:14:03.567] not a valid address err="recovery failed" ERROR[12-06|15:30:13.568] not a valid address err="recovery failed" ERROR[12-06|15:45:13.645] not a valid address err="recovery failed" ERROR[12-06|15:48:13.576] not a valid address err="recovery failed" ERROR[12-06|15:51:43.530] not a valid address err="recovery failed" ERROR[12-06|16:35:33.558] not a valid address err="recovery failed" ERROR[12-06|16:43:13.562] not a valid address err="recovery failed" ERROR[12-06|17:03:23.510] not a valid address err="recovery failed" ERROR[12-06|17:57:13.531] not a valid address err="recovery failed" ERROR[12-06|18:00:43.530] not a valid address err="recovery failed" ERROR[12-06|18:02:13.643] not a valid address err="recovery failed" ERROR[12-06|18:51:53.495] not a valid address err="recovery failed" ERROR[12-06|19:05:53.480] not a valid address err="recovery failed" ERROR[12-06|19:22:13.481] not a valid address err="recovery failed" ERROR[12-06|19:29:13.484] not a valid address err="recovery failed" ERROR[12-06|19:35:23.561] not a valid address err="recovery failed" ERROR[12-06|20:38:33.559] not a valid address err="recovery failed" ERROR[12-06|21:23:53.542] not a valid address err="recovery failed" ERROR[12-06|21:56:23.543] not a valid address err="recovery failed" ERROR[12-06|22:15:13.500] not a valid address err="recovery failed" ERROR[12-06|22:49:53.490] not a valid address err="recovery failed" ERROR[12-06|23:10:43.539] not a valid address err="recovery failed" ERROR[12-06|23:13:23.574] not a valid address err="recovery failed" ERROR[12-06|23:16:23.502] not a valid address err="recovery failed" ERROR[12-06|23:19:53.489] not a valid address err="recovery failed" ERROR[12-06|23:23:43.579] not a valid address err="recovery failed" ERROR[12-06|23:25:53.641] not a valid address err="recovery failed" ERROR[12-06|23:46:23.497] not a valid address err="recovery failed" ERROR[12-07|01:56:23.570] not a valid address err="recovery failed" ERROR[12-07|02:13:53.572] not a valid address err="recovery failed" ERROR[12-07|02:31:23.552] not a valid address err="recovery failed"

Because I have 7 nodes, if 3 nodes of them show this error message at the same time, the blockchain will stuck. If only 1 or 2 shows, the blockchain still works. The latest test continued for a whole night, it seems OK for now although I saw this error message occasionally. I am not sure how long they can work because the frequency of this error occurrence is still high.

vietlq commented 5 years ago

Looks like this is related to https://github.com/getamis/istanbul-tools/issues/108

I'm playing with 7nodes example and trying to reproduce the issue.

@yjfcn could you please tell me how to reproduce the issue? Did you create any smart contract & sent any tx while testing? I have ran for a few times but still have not encountered this issue. Thanks.

I'm just curious if this related to using istanbul-tools in certain way :)

yjfcn commented 5 years ago

Looks like this is related to getamis/istanbul-tools#108

@yjfcn could you please tell me how to reproduce the issue? I have ran for a few times but still have not encountered this issue. Thanks.

I installed 7 nodes( different PCs), then use istanbul-tools to generate the configuration files: istanbul setup --num 7 --nodes --quorum --verbose --save

After that, copy nodekey to corresponding node, copy other configuration files to each node and modify the static-nodes.json and tm.conf files(replace the IP address).

When all these finished, I start the blockchain: geth --datadir /qdata/dd init /qdata/genesis.json geth --datadir /qdata/dd --syncmode full --mine --minerthreads 1 --istanbul.blockperiod 10 --rpc --rpcaddr 0.0.0.0 --rpcapi eth,net,web3,quorum --rpccorsdomain=* --nodiscover

After running for several hours, all nodes will have this error message. But the blockchain still works because there are 2 spare nodes.

yjfcn commented 5 years ago

Looks like this is related to getamis/istanbul-tools#108

I'm playing with 7nodes example and trying to reproduce the issue.

@yjfcn could you please tell me how to reproduce the issue? Did you create any smart contract & sent any tx while testing? I have ran for a few times but still have not encountered this issue. Thanks.

I'm just curious if this related to using istanbul-tools in certain way :)

I didn't create any smart contract & sent any tx while testing. You can add the parameter --istanbul.blockperiod 10 when you test. Maybe you can run for several hours then check the log.

yjfcn commented 5 years ago

Looks like this is related to getamis/istanbul-tools#108

I'm playing with 7nodes example and trying to reproduce the issue.

@yjfcn could you please tell me how to reproduce the issue? Did you create any smart contract & sent any tx while testing? I have ran for a few times but still have not encountered this issue. Thanks.

I'm just curious if this related to using istanbul-tools in certain way :)

I tried 7nodes example(set istanbul.blockperiod 10) , it also has the same error message: ERROR[12-07|07:53:44.048] not a valid address err="recovery failed"

Also, there is another warning message in 7nodes example: WARN [12-07|07:39:14.030] Ignore preprepare messages from non-proposer address=0xb131288F355BC27090E542aE0be213c20350B767 from=0xb912De287F9b047B4228436E94B5b78E3Ee16171 state="Accept request"

But this node actually is a proposer except that there are some upper case characters in the address:

istanbul.getValidators() ["0x6571d97f340c8495b661a823f2c2145ca47d63c2", "0x8157d4437104e3b8df4451a85f7b2438ef6699ff", "0xb131288f355bc27090e542ae0be213c20350b767", "0xb912de287f9b047b4228436e94b5b78e3ee16171", "0xd8dba507e85f116b1f7e231ca8525fc9008a6966", "0xe36cbeb565b061217930767886474e3cde903ac5", "0xf512a992f3fb749857d758ffda1330e590fa915e"]

jbhurat commented 5 years ago

@yjfcn do you also see ERROR[12-07|02:31:23.552] not a valid address err="recovery failed" messages when block period is set to 1 second

jbhurat commented 5 years ago

Can you also increase the verbosity and post the logs from all the nodes

jbhurat commented 5 years ago

But this node actually is a proposer except that there are some upper case characters in the address:

istanbul.getValidators() ["0x6571d97f340c8495b661a823f2c2145ca47d63c2", "0x8157d4437104e3b8df4451a85f7b2438ef6699ff", "0xb131288f355bc27090e542ae0be213c20350b767", "0xb912de287f9b047b4228436e94b5b78e3ee16171", "0xd8dba507e85f116b1f7e231ca8525fc9008a6966", "0xe36cbeb565b061217930767886474e3cde903ac5", "0xf512a992f3fb749857d758ffda1330e590fa915e"]

The node is a validator, but nothing in the logs suggests that the node is also the proposer. Also, the proposer changes every round. Can you increase the verbosity and paste the logs, that might give us hints on what the issue is

vietlq commented 5 years ago

Thanks @yjfcn for detailed guide. I set --istanbul.blockperiod 10 in the 7nodes example and after ~8 mins of running I could see the issue on Node 7:

ERROR[12-09|22:05:07] not a valid address                      err="recovery failed"

I haven't seen the dead-lock yet, still waiting. My wild guess that if the above issue err="recovery failed" happens across >= f + 1 validators during the same block the chain could see dead-lock. I'm looking at the code trying to understand why the error err="recovery failed" occurred in the first place.

https://github.com/jpmorganchase/quorum/blob/4b44bea1d7f38382d22f810e598f5f15aba5856e/crypto/secp256k1/secp256.go#L118

https://github.com/jpmorganchase/quorum/blob/4b44bea1d7f38382d22f810e598f5f15aba5856e/crypto/secp256k1/ext.h#L30

vietlq commented 5 years ago

Good news is I was able to reproduce the dead-lock after merely ~2 hours (block 778).

Because I have 7 nodes, if 3 nodes of them show this error message at the same time, the blockchain will stuck. If only 1 or 2 shows, the blockchain still works.

Just what I suspected: >= f + 1 validators with the error err="recovery failed" during the same block will dead-lock the chain. For my case I got issue on 3 nodes 2, 3 & 5. IBFT can tolerate up to f nodes (max 2 in this case).

Here's the latest logs of the 7nodes group:

node3_1       | ERROR[12-10|06:52:50] not a valid address                      err="recovery failed"
node3_1       | DEBUG[12-10|06:52:50] No more headers available                peer=ac6b1096ca56b9f6
node3_1       | DEBUG[12-10|06:52:50] Header download terminated               peer=ac6b1096ca56b9f6
node3_1       | ERROR[12-10|06:52:50]
node3_1       | ########## BAD BLOCK #########
node3_1       | Chain config: {ChainID: 10 Homestead: <nil> DAO: <nil> DAOSupport: false EIP150: 2 EIP155: 0 EIP158: 3 Byzantium: 1 IsQuorum: true Engine: istanbul}
node3_1       |
node3_1       | Number: 778
node3_1       | Hash: 0x4e49285f6da14ba6b7cfde5d09de26f38076df91a9374aafbb54ab02b3ed3a4b
node3_1       |
node3_1       |
node3_1       | Error: invalid signature
node3_1       | ##############################

node2_1       | ERROR[12-10|06:52:52] not a valid address                      err="recovery failed"
node2_1       | ERROR[12-10|06:52:52]
node2_1       | ########## BAD BLOCK #########
node2_1       | Chain config: {ChainID: 10 Homestead: <nil> DAO: <nil> DAOSupport: false EIP150: 2 EIP155: 0 EIP158: 3 Byzantium: 1 IsQuorum: true Engine: istanbul}
node2_1       |
node2_1       | Number: 778
node2_1       | Hash: 0x4e49285f6da14ba6b7cfde5d09de26f38076df91a9374aafbb54ab02b3ed3a4b
node2_1       |
node2_1       |
node2_1       | Error: invalid signature
node2_1       | ##############################

node5_1       | ERROR[12-10|06:52:54] not a valid address                      err="recovery failed"
node5_1       | ERROR[12-10|06:52:54]
node5_1       | ########## BAD BLOCK #########
node5_1       | Chain config: {ChainID: 10 Homestead: <nil> DAO: <nil> DAOSupport: false EIP150: 2 EIP155: 0 EIP158: 3 Byzantium: 1 IsQuorum: true Engine: istanbul}
node5_1       |
node5_1       | Number: 778
node5_1       | Hash: 0x4e49285f6da14ba6b7cfde5d09de26f38076df91a9374aafbb54ab02b3ed3a4b
node5_1       |
node5_1       |
node5_1       | Error: invalid signature
node5_1       | ##############################

node3_1       | ERROR[12-10|06:53:00] not a valid address                      err="recovery failed"
node3_1       | ERROR[12-10|06:53:00]
node3_1       | ########## BAD BLOCK #########
node3_1       | Chain config: {ChainID: 10 Homestead: <nil> DAO: <nil> DAOSupport: false EIP150: 2 EIP155: 0 EIP158: 3 Byzantium: 1 IsQuorum: true Engine: istanbul}
node3_1       |
node3_1       | Number: 778
node3_1       | Hash: 0x4e49285f6da14ba6b7cfde5d09de26f38076df91a9374aafbb54ab02b3ed3a4b
node3_1       |
node3_1       |
node3_1       | Error: invalid signature
node3_1       | ##############################

node2_1       | ERROR[12-10|06:53:02] not a valid address                      err="recovery failed"
node2_1       | ERROR[12-10|06:53:02]
node2_1       | ########## BAD BLOCK #########
node2_1       | Chain config: {ChainID: 10 Homestead: <nil> DAO: <nil> DAOSupport: false EIP150: 2 EIP155: 0 EIP158: 3 Byzantium: 1 IsQuorum: true Engine: istanbul}
node2_1       |
node2_1       | Number: 778
node2_1       | Hash: 0x4e49285f6da14ba6b7cfde5d09de26f38076df91a9374aafbb54ab02b3ed3a4b
node2_1       |
node2_1       |
node2_1       | Error: invalid signature
node2_1       | ##############################

node5_1       | ERROR[12-10|06:57:04] not a valid address                      err="recovery failed"
node5_1       | ERROR[12-10|06:57:04]
node5_1       | ########## BAD BLOCK #########
node5_1       | Chain config: {ChainID: 10 Homestead: <nil> DAO: <nil> DAOSupport: false EIP150: 2 EIP155: 0 EIP158: 3 Byzantium: 1 IsQuorum: true Engine: istanbul}
node5_1       |
node5_1       | Number: 778
node5_1       | Hash: 0x4e49285f6da14ba6b7cfde5d09de26f38076df91a9374aafbb54ab02b3ed3a4b
node5_1       |
node5_1       |
node5_1       | Error: invalid signature
node5_1       | ##############################
yjfcn commented 5 years ago

Thanks @yjfcn for detailed guide. I set --istanbul.blockperiod 10 in the 7nodes example and after ~8 mins of running I could see the issue on Node 7:

ERROR[12-09|22:05:07] not a valid address                      err="recovery failed"

I haven't seen the dead-lock yet, still waiting. My wild guess that if the above issue err="recovery failed" happens across >= f + 1 validators during the same block the chain could see dead-lock. I'm looking at the code trying to understand why the error err="recovery failed" occurred in the first place.

quorum/crypto/secp256k1/secp256.go

Line 118 in 4b44bea

if C.secp256k1_ext_ecdsa_recover(context, (*C.uchar)(unsafe.Pointer(&pubkey[0])), sigdata, msgdata) == 0 { quorum/crypto/secp256k1/ext.h

Line 30 in 4b44bea

static int secp256k1_ext_ecdsa_recover(

After running 7nodes example(--istanbul.blockperiod 10) for more than 6 hours, 3 nodes appeared the error err="recovery failed" which caused the blockchain to dead-lock.

cat 1.log|grep "not a valid address"

cat 2.log|grep "not a valid address"

cat 3.log|grep "not a valid address"

cat 4.log|grep "not a valid address"

ERROR[12-07|07:53:44.048] not a valid address err="recovery failed" ERROR[12-07|12:37:14.034] not a valid address err="recovery failed" ERROR[12-07|13:39:44.044] not a valid address err="recovery failed"

cat 5.log|grep "not a"

ERROR[12-07|13:39:44.045] not a valid address err="recovery failed"

cat 6.log|grep "not a"

ERROR[12-07|13:39:44.041] not a valid address err="recovery failed"

yjfcn commented 5 years ago

Can you also increase the verbosity and post the logs from all the nodes

OK, will post logs when dead lock appear.

jbhurat commented 5 years ago

@yjfcn are the nodes running under docker. If yes, can you try using a non dockerized version. Also, can you confirm if there is no time drift between the nodes. I have tried to replicate the deadlock in a non virtualized environment, but so far have not been able to

yjfcn commented 5 years ago

@yjfcn are the nodes running under docker. If yes, can you try using a non dockerized version. Also, can you confirm if there is no time drift between the nodes. I have tried to replicate the deadlock in a non virtualized environment, but so far have not been able to

@jbhurat These nodes are not running under docker. They are decentralized in different place. Maybe there is time drift between the nodes, but it's not more than 30ms. And I found another thing which is very interesting: If I set --istanbul.blockperiod 10, the error "not a valid address err="recovery failed" will appear on each nodes after running several hours. If using other number, such as 5, 9, 12 etc, this error will not appear on all nodes. 10 is a magic number!!! Can you try to set --istanbul.blockperiod 10 for testing?

yjfcn commented 5 years ago

@jbhurat Most likely it's --istanbul.blockperiod 10 cause this problem. If I use other number, all nodes work fine!

jbhurat commented 5 years ago

@yjfcn I have been running a 7 node network locally for over 24 hours, I do see not a valid address err="recovery failed" error message in the logs, but no deadlock. I am going through the verbose logs to see what is triggering that error. Yes, I see that message only when setting --istanbul.blockperiod 10 and no other number

jbhurat commented 5 years ago

@yjfcn I have reviewed the logs, and one thing I have noticed that the issue always happens, when there is a round change and depending on what state that particular node is in, it might fail to validate the header. Also, the default round change timeout istanbul.requesttimeout is 10000 milliseconds which is 10 seconds. This explains why we only see the error when istanbul.blockperiod is set to 10. Can you try setting --istanbul.requesttimeout 15000 and see if this fixes the issue. I am trying to do the same on my end and will let you know what I find.

yjfcn commented 5 years ago

@jbhurat Tried setting --istanbul.requesttimeout 15000 and --istanbul.blockperiod 10 for 24 hours, looks no problem.

jbhurat commented 5 years ago

Good to hear. I see the similar results on my tests.

I am closing the issue. Please reopen if you need any further assistance.