retrieved hash chain is invalid: sidechain ghost-state attack

anthonyoliai commented 2 years ago

System information

Geth version: v1.10.25 OS & Version: Ubuntu

Expected behaviour

I'm currently running a private POA network on AWS using Kubernetes EKS. I've successfully deployed all nodes, and the network is operational.

I'm currently running 3 miners, 1 RPC, and 1 full bootnode. Output from ethstats:

How my network currently operates is that there are most of the time empty blocks, with some burst of txs from time-to-time.

Actual behaviour

I notice that at some point, some of the nodes start to drop peers and unsync. I made sure that the hardware requirements are there, and I'm closely monitoring all my nodes through Prometheus/Grafana.

For example, yesterday, my RPC node stopped syncing at block 8000, and was therefore stuck at that block. Interestingly, statically adding the peers did not work either. I was forced to kill the node and have it restart from scratch.

The reason for failure is quite interesting however, this is example output:

DEBUG[10-21|10:01:30.341] Skeleton fill failed                     err="syncing canceled (requested)"
DEBUG[10-21|10:01:30.341] Skeleton chain invalid                   peer=a79227b1 err="syncing canceled (requested)"
DEBUG[10-21|10:01:30.341] Header download terminated               peer=a79227b1
DEBUG[10-21|10:01:30.341] Block body download terminated           err="syncing canceled (requested)"
DEBUG[10-21|10:01:30.341] Receipt download terminated              err="syncing canceled (requested)"
DEBUG[10-21|10:01:30.341] Synchronisation terminated               elapsed=18.303ms
WARN [10-21|10:01:30.341] Synchronisation failed, dropping peer    peer=a79227b144390386ffbc6ac44c1b158d92716fb29d93a3f2d9564511068ed7dd err="retrieved hash chain is invalid: sidechain ghost-state attack"
DEBUG[10-21|10:01:30.341] Message handling failed in snap        peer=a79227b1 err=EOF
DEBUG[10-21|10:01:30.341] Message handling failed in eth         id=a79227b153490386 conn=staticdial        err=EOF
DEBUG[10-21|10:01:30.341] Removing Ethereum peer                   peer=a79227b1 snap=true
DEBUG[10-21|10:01:30.341] Removing p2p peer                        peercount=1 id=a79227b153490386 duration=20.029s       req=false 
```err="useless peer"

I can't find much documentation on this!

What does retrieved hash chain is invalid: sidechain ghost-state attack mean? And how would I go about preventing this?

Steps to reproduce the behaviour

I can't share much information regarding how to reproduce this, but i have set up a POA network with a block time of 5 seconds, with 3x miners, 1 RPC and 1 full bootnode.

All nodes connect to the bootnode as entrypoint.

Backtrace

See up.

When submitting logs: please submit them as text and not screenshots.

anthonyoliai commented 2 years ago

Furthermore, i'm not sure if this is related, but could it be that due to this sidechain ghost-state attack, the following error occurs?

DEBUG[10-24|10:21:26.001] Discarded delivered header or block, too far away peer=d64565f54f769356073dbba162538f4b5d469209c29e5535fde022dbce540648 number=65812 hash=f73e9b..ff04bf distance=43496
DEBUG[10-24|10:21:26.001] Peer discarded announcement              peer=27ee362ec1dedff967343917a753d2998e9af0d99879bdc400b3d6354fbacd31 number=65812 hash=f73e9b..ff04bf distance=43496
DEBUG[10-24|10:21:26.001] Discarded delivered header or block, too far away peer=cc9586812a41bf3a7a9ac1a5aea898f942acaf73f4d2d0692fe93c739c09c964 number=65812 hash=f73e9b..ff04bf distance=43496

MariusVanDerWijden commented 2 years ago

I think the problem is that you have a very long sidechain where nothing happens, so when we import the long sidechain, but the state doesn't change, then we stop the chain import. This is only a concern in pow, in pos it can happen that multiple blocks don't update the state, so it might be okay to just import the blocks here? Would be good to talk a bit about this on triage

MariusVanDerWijden commented 2 years ago

For a quick fix you can probably delete the datadir of the affected nodes, so they sync the correct chain directly and not as a sidechain

anthonyoliai commented 2 years ago

I think the problem is that you have a very long sidechain where nothing happens, so when we import the long sidechain, but the state doesn't change, then we stop the chain import. This is only a concern in pow, in pos it can happen that multiple blocks don't update the state, so it might be okay to just import the blocks here? Would be good to talk a bit about this on triage

First of all thanks @MariusVanDerWijden ! That does make sense, often the miners are not incorporating state changes as the txs come in bursts. Hence, a lot of empty blocks are imported. It could be for example that for 100 blocks there are no state changes.

I'm not well versed with what you mean by sidechain in this context. Do you just refer to any other "state" coming from other peers other than the node itself? Just for clarification.

Looking at my nodes I do see that they are perfectly in sync at the start, as in, e.g everytime a miner mines a new block it gets properly propagated and imported. So I assume somehow at some point, due to there not being any state changes for a longer period of time this error might occur. (So far it happens around 24 hours in).

I'm just trying to understand exactly what is happening here. Looking at https://github.com/ethereum/go-ethereum/blob/067bac3f2409aec16994163e7a635d36bdb9b956/core/blockchain.go#L1851.

I assume that if we have for example state S, which already exists on the canon chain, and we import a new state N; somehow N == S and hence the state already exists, so we can't just proceed to importing these blocks.

I do have to mention though that I am running all these nodes on AWS, EKS, using proof of authority. They are running on seperate pods. Whenever I notice this "ghost state attack issue" I simply tear down the pod, hence, the container which runs the node is restarted, and the datadir is deleted. Doing so the node properly syncs back up, with the already running nodes.

My initial thought was to write a shell script which listens to

`> eth.syncing.currentBlock 22316

eth.syncing.highestBlock 65604`

and simply restarts if the absolute value between current and highest is very big. However, I would like to prevent this, as scheduled txs might be lost.

holiman commented 2 years ago

I have not yet found my old write-up, but I did find some shorter tldr;s about the issue

So, the TLDR; is, if we can

create a side chain, which is old enough so that the ancestor is pruned
- We get blocks [B..Bn] inserted into the database, with only header validation
Create a block Bx, which has the same stateroot as an existing state.
And then Geth will switch out the canonical chain for the invalid sidechain, if it has higher TD, despite not having validated the block or state on the blocks.

The attack needs to

start on a fork-point far enough back that the state is pruned, and
be long enough to reach head - 127.
and, of course, continue along in order to have higher TD than the chain to overtake.

anthonyoliai commented 2 years ago

I have not yet found my old write-up, but I did find some shorter tldr;s about the issue

So, the TLDR; is, if we can

create a side chain, which is old enough so that the ancestor is pruned

We get blocks [B..Bn] inserted into the database, with only header validation

Create a block Bx, which has the same stateroot as an existing state.

And then Geth will switch out the canonical chain for the invalid sidechain, if it has higher TD, despite not having validated the block or state on the blocks.

The attack needs to

start on a fork-point far enough back that the state is pruned, and

be long enough to reach head - 127.

and, of course, continue along in order to have higher TD than the chain to overtake.

Thanks, and interesting, I think it's very odd then that this error is happening. To give some more context; All geth nodes run on a seperate container. However, each container contains a geth image which was snapshotted. Meaning, that they have a carbon copy of the same data dir from initialization. (They are now individual dirs, but they were copied over). Thus, they all start at a specific a block. But after that, they stay in sync, and I don't see any nodes falling behind to the point that your second point is reached (head - 127). What do you mean with point 1? The fork point.

anthonyoliai commented 2 years ago

bump

ethereum / go-ethereum