Agoric / agoric-sdk

monorepo for the Agoric Javascript smart contract platform
Apache License 2.0
322 stars 204 forks source link

Consensus failure on cosmos DB pruning #8354

Open mhofman opened 11 months ago

mhofman commented 11 months ago

Describe the bug

A validator reported experiencing a consensus failure with the following error:

ERR CONSENSUS FAILURE!!! err="unable to delete version 11686000 with 1 active readers" module=consensus stack="goroutine 7581"

Tracking the error message, it stems from cosmos IAVL pruning logic: https://github.com/cosmos/iavl/blob/v0.17.3/nodedb.go#L203

Our cosmos-sdk is using version v0.17.3 of that package, while the latest v0.47 of cosmos-sdk has bumped the dependency to v0.20.0. However searching for changes and issues on the iavl repo doesn't raise any changes in logic related deleting and existing readers.

There is a known issue regarding mismatching snapshot-interval and keep-interval configs in cosmos, but 1) they're supposed to be mitigated in our version of cosmos-sdk, and 2) the validator claims the node is not creating state-sync snapshots.

The config relating to pruning shared by the validator:

pruning = "custom"
pruning-keep-recent = "100"
pruning-keep-every = "0"
pruning-interval = "10"

A keep-recent of 100 should allow any potential state-sync snapshot of the cosmos DB to be performed. While the full snapshot may not yet be complete after 100 blocks since our snapshots usually take about 150 blocks to complete, the snapshot of the multistore is performed first and the read of the multistore closed before reaching the swingset extension which is where all the time is spent. See https://github.com/agoric-labs/cosmos-sdk/blob/v0.45.11-alpha.agoric.3/snapshots/manager.go#L176-L186

A somewhat related issue in cosmos-sdk regarding prune everything doesn't seem applicable since the keep-recent config is set to 100.

Expected behavior

No crash on pruning

Platform Environment

agoric-upgrade-11 on mainnet

JimLarson commented 11 months ago

Original thought: Underlying Cosmos issue - need to confirm. If it's broken-as-intended, at least make an FAQ entry.

@mhofman says that the reporter wasn't doing state sync exports, and even if they were, it's already mitigated.

Validator recovered on restart.