Consensys / teku

Open-source Ethereum consensus client written in Java
https://consensys.io/teku
Apache License 2.0
668 stars 283 forks source link

Recover gracefully from missed deposits #6170

Open rolfyone opened 2 years ago

rolfyone commented 2 years ago

When processing eth1 deposits, there are failure scenarios where we are processing through a list and between batches the eth1 client has a problem, and comes back in a different state.

In this scenario, we need to be able to recover gracefully.

6155 is an example of such an occurrence, where we are in the middle of processing, and the client is no longer able to provide deposits that it should have.

A restart does fix the problem when it occurs, but it would be far better for teku to detect, wait, and retry.

shaddamciv commented 2 years ago

Adding here, since #6155 is closed. Also for anyone that runs into the same issue post restart -

I restarted both teku and besu, however teku still shows the same FATAL error message. However, after I upgraded to latest besu version of 22.7.6, the logs seem to have cleared.

teku version - teku/v22.9.1+4-ge61f57d/linux-x86_64/-privatebuild-openjdk64bitservervm-java-17

16:39 : 2022-10-11 07:16:39.562 INFO  - Loading deposits up to Eth1 block 15718546
Oct 11 07:16:40 : 2022-10-11 07:16:40.981 FATAL - PLEASE CHECK YOUR ETH1 NODE | Encountered a problem retrieving deposit events from eth1 endpoint: Expected next deposit at index 450831, but got 450832
Oct 11 07:16:40 : tech.pegasys.teku.ethereum.pow.api.InvalidDepositEventsException: Expected next deposit at index 450831, but got 450832
Oct 11 07:16:40 : at tech.pegasys.teku.ethereum.pow.api.InvalidDepositEventsException.expectedDepositAtIndex(InvalidDepositEventsException.java:26) ~[teku-ethereum-pow-api-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at tech.pegasys.teku.beacon.pow.ValidatingEth1EventsPublisher.assertDepositEventIsValid(ValidatingEth1EventsPublisher.java:56) ~[teku-beacon-pow-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at tech.pegasys.teku.beacon.pow.ValidatingEth1EventsPublisher.onDepositsFromBlock(ValidatingEth1EventsPublisher.java:46) ~[teku-beacon-pow-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at tech.pegasys.teku.beacon.pow.DepositFetcher.postDeposits(DepositFetcher.java:226) ~[teku-beacon-pow-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at tech.pegasys.teku.beacon.pow.DepositFetcher.postEventsForBlock(DepositFetcher.java:192) ~[teku-beacon-pow-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at tech.pegasys.teku.beacon.pow.DepositFetcher.postDepositEvents(DepositFetcher.java:163) ~[teku-beacon-pow-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at tech.pegasys.teku.beacon.pow.DepositFetcher.lambda$postDepositEvents$4(DepositFetcher.java:179) ~[teku-beacon-pow-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture$UniCompose.tryFire(CompletableFuture.java:1150) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[?:?]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.lambda$propagateResult$3(SafeFuture.java:147) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[?:?]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.lambda$propagateResult$3(SafeFuture.java:147) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:887) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2325) ~[?:?]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.whenComplete(SafeFuture.java:580) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.whenComplete(SafeFuture.java:32) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.propagateResult(SafeFuture.java:142) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.lambda$exceptionallyCompose$30(SafeFuture.java:399) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[?:?]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.lambda$propagateResult$3(SafeFuture.java:147) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:887) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2325) ~[?:?]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.whenComplete(SafeFuture.java:580) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.whenComplete(SafeFuture.java:32) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.propagateResult(SafeFuture.java:142) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.lambda$exceptionallyCompose$30(SafeFuture.java:399) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[?:?]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.lambda$propagateResult$3(SafeFuture.java:147) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.uniWhenCompleteStage(CompletableFuture.java:887) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.whenComplete(CompletableFuture.java:2325) ~[?:?]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.whenComplete(SafeFuture.java:580) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.whenComplete(SafeFuture.java:32) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.propagateResult(SafeFuture.java:142) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.lambda$exceptionallyCompose$30(SafeFuture.java:399) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[?:?]
Oct 11 07:16:40 : at tech.pegasys.teku.infrastructure.async.SafeFuture.lambda$propagateResult$3(SafeFuture.java:147) ~[teku-infrastructure-async-develop.jar:22.9.1+4-ge61f57d]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.uniWhenComplete(CompletableFuture.java:863) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture$UniWhenComplete.tryFire(CompletableFuture.java:841) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:510) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture.complete(CompletableFuture.java:2147) ~[?:?]
Oct 11 07:16:40 : at org.web3j.utils.Async.lambda$run$1(Async.java:38) ~[core-4.9.3.jar:?]
Oct 11 07:16:40 : at java.util.concurrent.CompletableFuture$AsyncRun.run(CompletableFuture.java:1804) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
Oct 11 07:16:40 : at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
Oct 11 07:16:40 : at java.lang.Thread.run(Thread.java:833) ~[?:?]
q9f commented 1 year ago

I'm facing the same issue which affects my Goerli validators. Please let me know if there is anything I can provide.

Besu 22.10.0, Teku 22.11.0

q9f commented 1 year ago

Is there any way to recover from this? My Prater validators are suffering big time. :(

ajsutton commented 1 year ago

@q9f Step one is to fix the EL to ensure it is returning all deposits. Sometimes this happens automatically because its as just syncing or temporary, sometimes there's other issues that need to be fixed. Then restart Teku and it will begin requesting deposits from the last one it successfully imported and should pick up the missed deposits.

q9f commented 1 year ago

There is a chicken and egg problem though. I cannot fix the EL without a CL connected.

Unfortunately, I have to switch to a different consensus client now, I can no longer maintain this setup. :(

benjaminion commented 1 year ago

@q9f I've raised this up the backlog, but we've been fully maxed on protocol updates and other high priority features for a while.

@ajsutton's point still stands - this situation only arises when the execution client misbehaves. If (1) it's temporary misbehaviour by the EL, then restarting Teku should clear it and all will be well. We could also try to recover automatically in this case, as per this ticket. However, (2) if the exec client is buggy or in a bad state, there's nothing we can do to recover on the Teku side. That's what Adrian means by "fix the EL" I think. This is the most common situation we see in the wild. But generally I'm not having any trouble with recent versions of Besu.

q9f commented 1 year ago

But it's a chicken an egg problem. Once the CL fails, the EL also stalls. There is no such thing as "fixing the EL" without fixing the CL.

And my Besu node is not complaining. It just basically sits and waits for Teku to come back.

benjaminion commented 1 year ago

Not really, in my experience. This issue occurs because the EL client has provided bad (incomplete) data, but it doesn't know that it has done so, hence it reports no errors. This can sometimes be resolved by restarting the EL client to clear whatever state it's got itself into, and then the CL client so that it re-requests the data. But sometimes it's due to an EL client bug and needs a genuine software fix. There's not a lot Teku can do in either of these situations. If it's simply a transient issue then it's possible that Teku can recover just by asking for the data again, and we'll look at implementing this, but I suspect it will likely help in only some circumstances.

q9f commented 1 year ago

Ok, thanks. I replaced Besu with Geth and see if this is more stable maybe.

smasher816 commented 1 year ago

I'm running into something similar.

Feb 05 18:56:11 hoenir teku[29816]: 18:56:11.003 INFO  - Syncing     *** Target slot: 5735679, Head slot: 5735674, Remaining slots: 5, Connected peers: 0
Feb 05 18:56:12 hoenir teku[29816]: 18:56:12.718 FATAL - PLEASE CHECK YOUR ETH1 NODE | Encountered a problem retrieving deposit events from eth1 endpoint: Expected next deposit at index 0, but got 192 tech.pegasys.teku.ethereum.pow.api.InvalidDepositEventsException: Expected next deposit at index 0, but got 192 (See log file for full stack trace)
Feb 05 18:56:14 hoenir teku[29816]: 18:56:14.837 FATAL - PLEASE CHECK YOUR ETH1 NODE | Encountered a problem retrieving deposit events from eth1 endpoint: Expected next deposit at index 0, but got 192 tech.pegasys.teku.ethereum.pow.api.InvalidDepositEventsException: Expected next deposit at index 0, but got 192 (See log file for full stack trace)
Feb 05 18:56:16 hoenir teku[29816]: 18:56:16.957 FATAL - PLEASE CHECK YOUR ETH1 NODE | Encountered a problem retrieving deposit events from eth1 endpoint: Expected next deposit at index 0, but got 192 tech.pegasys.teku.ethereum.pow.api.InvalidDepositEventsException: Expected next deposit at index 0, but got 192 (See log file for full stack trace)
Feb 05 18:56:19 hoenir teku[29816]: 18:56:19.080 FATAL - PLEASE CHECK YOUR ETH1 NODE | Encountered a problem retrieving deposit events from eth1 endpoint: Expected next deposit at index 0, but got 192 tech.pegasys.teku.ethereum.pow.api.InvalidDepositEventsException: Expected next deposit at index 0, but got 192 (See log file for full stack trace)
Feb 05 18:56:21 hoenir teku[29816]: 18:56:21.186 FATAL - PLEASE CHECK YOUR ETH1 NODE | Encountered a problem retrieving deposit events from eth1 endpoint: Expected next deposit at index 0, but got 192 tech.pegasys.teku.ethereum.pow.api.InvalidDepositEventsException: Expected next deposit at index 0, but got 192 (See log file for full stack trace)
Feb 05 18:56:23 hoenir teku[29816]: 18:56:23.316 FATAL - PLEASE CHECK YOUR ETH1 NODE | Encountered a problem retrieving deposit events from eth1 endpoint: Expected next deposit at index 0, but got 192 tech.pegasys.teku.ethereum.pow.api.InvalidDepositEventsException: Expected next deposit at index 0, but got 192 (See log file for full stack trace)
Feb 05 18:56:25 hoenir teku[29816]: 18:56:25.501 FATAL - PLEASE CHECK YOUR ETH1 NODE | Encountered a problem retrieving deposit events from eth1 endpoint: Expected next deposit at index 0, but got 192 tech.pegasys.teku.ethereum.pow.api.InvalidDepositEventsException: Expected next deposit at index 0, but got 192 (See log file for full stack trace)

I have opened up an issue with the EL (nethermind in this instance) https://github.com/NethermindEth/nethermind/issues/5235