fix: prevent panic on chain replay

karlem commented 3 weeks ago

Close #1196

Removing the dependency on the CometBFT client in favor of caching. When CometBFT is catching up—replaying from the beginning of the chain to synchronize with the ABCI app—it does not start the RPC API. Unfortunately, our ABCI app relied on the API during consensus events, which made it impossible to replay the chain.

Solved by relying on the exec and app states instead of making calls to CometBFT.

karlem commented 2 weeks ago

@LePremierHomme could you have another look please?

karlem commented 2 weeks ago

@karlem A quick question, how will the cache be recovered after crashes? Seems not queried when booting?

Yeah good point. We would loose it after the crash. Do you think maybe saving it to the app state might be a way to go?

cryptoAtwill commented 2 weeks ago

@karlem The rabbit hole gets deeper. But I think it's possible to read the state during boot up. The validators and gas limits are all stored in the contract or actor, so technically we can call the getters from the store directly.

raulk commented 2 weeks ago

@cryptoAtwill that's a good point; it would also take us one tiny step closer to our desired end state of having as much logic as possible in on-chain actors.

karlem commented 2 weeks ago

@cryptoAtwill Yeah, I think working with the contract makes more sense than this. I would ditch this in favor of using the contracts, as they are part of the app state—much more solid than this.

karlem commented 1 week ago

The validators and gas limits are all stored in the contract or actor, so technically we can call the getters from the store directly.

@cryptoAtwill , I believe the issue is that validators are only stored in the actor after top-down finality has been finalized, whereas we need them available earlier.

We have two potential approaches to resolve this:

Store the initial set of validators from genesis in the top-down finality actor/contract. This approach might introduce unintended side effects, so it may not be ideal.

Create a new actor specifically for storing these validators before top-down finality is achieved.

What are your thoughts on this? Also looping in @raulk for input.

karlem commented 1 week ago

karlem commented 2 days ago

@cryptoAtwill Changed the implementation to rely on app and exec states instead.

raulk commented 23 hours ago

I'm wrapping my head around the problem and the proposed solution to help push things forward. Here are some notes.

If I'm following correctly, this attempts to fix an edge case whereby during a CometBFT startup, the latter may attempts to perform a chain replay either from the WAL or from the network, by feeding blocks to the ABCI app. The issue arises because we recently introduced logic to fetch the block proposer's public key so we could credit gas premiums to their on-chain account. This happens in begin_block and was introduced in #1173. Because our local cache is empty, we fall back to CometBFT, whose RPC API has not started yet, therefore making us fail.

I think all of this can be greatly simplified if instead of querying CometBFT, we query our power table from the gateway actor and match on the block proposer's identity. The gateway already has public keys. Such a call would be entirely inside the state tree, so it has no dependencies on CometBFT, and would break the problematic cycle.

In fact, we already do this in end_block in order to calculate power table updates.

end_block calls interpreter.end(): https://github.com/consensus-shipyard/ipc/blob/c74a21cbb1d990ce19e8b5099854f3f3b18d479c/fendermint/app/src/app.rs#L802
When we create a checkpoint, it may come with power table updates: https://github.com/consensus-shipyard/ipc/blob/3ad2c0f52fff1323fd32c20d1d3507bcc12094a3/fendermint/vm/interpreter/src/fvm/exec.rs#L222
We query the current power table: https://github.com/consensus-shipyard/ipc/blob/3ad2c0f52fff1323fd32c20d1d3507bcc12094a3/fendermint/vm/interpreter/src/fvm/checkpoint.rs#L71
We already have a helper that queries the gateway actor in the state tree and returns domain objects, containing validator pubkeys: https://github.com/consensus-shipyard/ipc/blob/03e3ebdca9859bfcf9ea8a1fa69109fde451d6ed/fendermint/vm/interpreter/src/fvm/state/ipc.rs#L178

In a nutshell, I don't think we need one more cache here, nor to manage its lifecycle, nor anything like that. Assuming I'm right, we can kill the ValidatorTracker entirely and simply replace it with the above.

@cryptoAtwill does this sound right to you?

consensus-shipyard / ipc

fix: prevent panic on chain replay #1197