Closed vdamle closed 3 years ago
In some instances where the node doesn't throw missing parent
errors, we see a:
ERROR[01-19|08:11:16.626] Block receipts missing, can't freeze number=5123119 hash=226809…b04220
and the node is unable to push blocks into the ancients
DB after this.
This issue appears to be fixed in Geth v1.9.10
and subsequently in Geth v1.9.14
:
https://github.com/ethereum/go-ethereum/pull/20287 https://github.com/ethereum/go-ethereum/pull/21045
There's also a significant re-write involving the functionality for chain repair in Geth v1.9.20
:
hi @vdamle I tried to reproduce the issue at my end with both istanbul and raft consensus and I am not getting any errors.
I did the following to reproduce the issue:
Raft:
--immutabilitythreshold 200
so that old blocks could start freezing when the block height goes beyond 200 .kill -9 <pid>
commandIstanbul:
--immutabilitythreshold 200
so that old blocks could start freezing when the block height goes beyond 200 .kill -9 <pid>
commandCan you give me the following details about your network so that I could analyse further:
Hi @amalrajmani , thank you for taking a look at this. We have hit this error in multiple Kaleido environments running IBFT
. I'm using one of the recent occurrences (from which the original logs were posted) as reference for your questions:
IBFT
"config":{
"homesteadBlock":0,
"eip150Block":0,
"eip155Block":0,
"eip158Block":0,
"byzantiumBlock":0,
"isQuorum":true,
"istanbul":{"epoch":30000,"policy":0},
"chainId":2104950325,
"constantinopleBlock":3702732,
"petersburgBlock":5122843,
"istanbulBlock":5122843
}
--datadir <path> --nodiscover --nodekey <path>/nodekey --maxpeers 200 --txpool.pricelimit 0 --rpc --port 30303 --rpcport 8545 --rpcaddr 0.0.0.0 --ws --wsport 8546 --wsaddr 0.0.0.0 --unlock 0 --password <path>/passwords.txt --ipcpath=<path>/geth.ipc --permissioned --mine --rpcapi admin,db,eth,debug,miner,net,shh,txpool,personal,web3,istanbul --wsapi admin,db,eth,debug,miner,net,shh,txpool,personal,web3,istanbul --istanbul.blockperiod 5 --istanbul.requesttimeout 15000 --syncmode full --gcmode full --rpccorsdomain '*' --wsorigins '*' --rpcvhosts '*' --txpool.globalslots 4096 --txpool.accountslots 64 --txpool.globalqueue 1024 --txpool.accountqueue 256 --cache 64 --allow-insecure-unlock --miner.gastarget 804247552 --miner.gasprice 0 --immutabilitythreshold 0 --networkid 2104950325 --verbosity 4
Yes, transactions are being sent to the node (fairly low volume of txns, the tx queue probably had 8-10 transactions). We've also seen this occur on a node to which no transactions are sent during this time.
The node is running in Kubernetes with a specified resource limit on the geth container (memory/CPU). We've observed that the node hits the upper bound of the memory limit (around 2 GB) and Kubernetes terminates the pod as a result - k8s uses SIGKILL
to terminate the process (geth) that is being monitored.. Kubernetes auto-restarts the pod.
Most of the instances where we've seen this error has not resulted in a crash of the node, but it is stuck at a block that is way behind the chain HEAD. In one instance, we did see the following crash:
DEBUG[01-19|10:24:45.594] Dereferenced trie from memory database nodes=3 size=980.00B time=5.768µs gcnodes=5942 gcsize=1.27MiB gctime=10.122239ms livenodes=34 livesize=5.06KiB
DEBUG[01-19|10:24:45.594] Dereferenced trie from memory database nodes=16 size=2.49KiB time=14.018µs gcnodes=5958 gcsize=1.28MiB gctime=10.136154ms livenodes=18 livesize=2.57KiB
DEBUG[01-19|10:24:45.631] Privacy metadata root hash=000000…000000
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x60 pc=0xbe0de4]
goroutine 2118 [running]:
github.com/ethereum/go-ethereum/core/state.(*StateDB).Prepare(...)
/work/build/_workspace/src/github.com/ethereum/go-ethereum/core/state/statedb.go:802
github.com/ethereum/go-ethereum/core.(*statePrefetcher).Prefetch(0xc0001f8fa0, 0xc00127a120, 0x0, 0xc00517c000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
/work/build/_workspace/src/github.com/ethereum/go-ethereum/core/state_prefetcher.go:63 +0x1e4
github.com/ethereum/go-ethereum/core.(*BlockChain).insertChain.func1(0xc004ffe6c0, 0xc0010a6000, 0x1649858, 0xa655cc1b171fe856, 0x6ef8c092e64583ff, 0xc0ad6c991be0485b, 0x21b463e3b52f6201, 0xc00127a120, 0xc007bae1f0, 0xbff9cabb62b8319f, ...)
/work/build/_workspace/src/github.com/ethereum/go-ethereum/core/blockchain.go:1753 +0x198
created by github.com/ethereum/go-ethereum/core.(*BlockChain).insertChain
/work/build/_workspace/src/github.com/ethereum/go-ethereum/core/blockchain.go:1750 +0x33e2
DEBUG[01-19|10:25:02.037] Sanitizing Go's GC trigger percent=100
INFO [01-19|10:25:02.038] Maximum peer count ETH=200 LES=0 total=200
@vdamle can you share the full geth log of the failed node. The log should have all the messages from the time it was restarted.
Hi @amalrajmani sorry for the confusion caused by my previous response - the missing parent
error and the panic
I reported are from two different chains (not the same block height, genesis config etc.) .
missing parent
error, here is the genesis config on the node. I have also attached log from the node in the chain where the error was encountered:"config":{
"homesteadBlock":0,
"eip150Block":0,
"eip155Block":0,
"eip158Block":0,
"byzantiumBlock":0,
"constantinopleBlock":0,
"petersburgBlock":0,
"istanbulBlock":0,
"isQuorum":true,
"maxCodeSizeConfig":[{"block":0,"size":128}],
"istanbul":{"epoch":30000,"policy":0,"ceil2Nby3Block":0},
"chainId":1730712451
}
command line arguments in same as earlier (same blockperiod/syncmode/gcmode
, different chainID
)
panic
, the node logs are attached below. The genesis config is what I had posted in my response earlier:"config":{
"homesteadBlock":0,
"eip150Block":0,
"eip155Block":0,
"eip158Block":0,
"byzantiumBlock":0,
"isQuorum":true,
"istanbul":{"epoch":30000,"policy":0},
"chainId":2104950325,
"constantinopleBlock":3702732,
"petersburgBlock":5122843,
"istanbulBlock":5122843
}
hi @vdamle I am able to reproduce the issue at my end with docker. I am analysing the cause of the issue now. I will get back to you soon.
hi @vdamle This issue is due to core bug in upstream. It has been fixed in geth-1.9.20. This would occur if geth is restarted non-gracefully(kill -9 / start) when geth is running in gcmode=full and has blocks in freezerdb. When the node starts after the non-graceful stop it ends up in a scenario where there is a gap between leveldb and freezerdb. The node fails with “missing parent” error when this occurs.
If you run geth in gcmode=archive or do a graceful restart(docker stop/start. Don't use docker kill) this error won’t occur.
Thanks for confirming, @amalrajmani . Correct, I'm aware of the fix in Geth 1.9.20
as noted in my earlier comment: https://github.com/ConsenSys/quorum/issues/1117#issuecomment-765124559
I've checked in Quorum Slack about plans for moving to a release of Geth >= 1.9.20 and did not receive any response. I see a PR opened for moving to 1.9.8
.
As one would expect, most nodes do not run as archive
nodes, which makes the chances of hitting this error fairly likely, especially in an environment where nodes run with enforced with resource limits and the orchestrator such as k8s will restart (non-gracefully) nodes if bounds are exceeded. After encountering the error, there is no way to recover the node other than to initiate a full resync, which takes almost a day in long running chains (which is where this problem is likely to occur).
Do you have any estimate for when Quorum intends to incorporate a newer release of Geth that will address this issue?
hi @vdamle Quorum geth upgrade releases are managed by @nmvalera and I will let him comment on that.
re-opening as we still need to validate on the latest GoQuorum master
build with the upgrade to v1.9.20
Hi @vdamle. Can you test the reproduction of this issue using the latest
tag of the docker image for GoQuorum? thank you.
@ricardolyn - Thanks for the PRs to move the Geth version forward! I will test this in the next day or so and let you know.
@ricardolyn I've run into an unrelated issue using code from the latest master: https://go-quorum.slack.com/archives/C825QTQ1Z/p1616037777005000 . Would really like to resolve that before testing this, so that I don't have to test again with private transactions enabled. Will keep you posted on my progress.
@vdamle any update on this testing? thank you
Hi @ricardolyn - Apologies for the delay. I attempted to reproduce the issue with the changes in master and haven't been able to reproduce it. It seems ok to resolve this issue and re-visit with a new issue if we hit anything of this nature again. Thank you for the updates!
that's good news @vdamle! thank you.
we will be releasing soon this version after we finalise some validation.
System information
Geth version:
Geth/v1.9.7-stable-c6ed3ed2(quorum-v20.10.0)
OS & Version:
linux-amd64
Branch, Commit Hash or Release:
quorum-v20.10.0
Expected behaviour
On a non-graceful shutdown/restart of a
gcmode=full, syncmode=full
node, it is expected that the local full block may rewind to a block in the past. However, the node must be able to sync to latest by fetching missing headers/blocks and rebuild any missing state.Actual behaviour
The node must be able to fetch missing headers/blocks and rebuild state from the previous full block to catch up with the rest of the chain. Instead, we see that the node
Another instance:
Steps to reproduce the behaviour
Perform a non-graceful restart of a node with blocks in both Ancients/Freezer and Level DB on
Quorum v20.10.0
Backtrace
None