cosmos / cosmos-sdk

:chains: A Framework for Building High Value Public Blockchains :sparkles:
https://cosmos.network/
Apache License 2.0
6.24k stars 3.61k forks source link

bug: pruning while acitve reader is present #9570

Closed tac0turtle closed 3 years ago

tac0turtle commented 3 years ago

Summary of Bug

Node tried to prune with an active reader, this should not be possible

Version

0.42.5

Steps to Reproduce

you can reproduce by syncing an osmosis node with pruning set to everything.

logs:

Jun 23 13:27:20 angl-1 osmosisd[75297]: panic: unable to delete version 1500 with 1 active readers
Jun 23 13:27:20 angl-1 osmosisd[75297]: goroutine 150625 [running]:
Jun 23 13:27:20 angl-1 osmosisd[75297]: github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).pruneStores(0xc0000ee280)
Jun 23 13:27:20 angl-1 osmosisd[75297]:         github.com/cosmos/cosmos-sdk@v0.42.5/store/rootmulti/store.go:388 +0x26e
Jun 23 13:27:20 angl-1 osmosisd[75297]: github.com/cosmos/cosmos-sdk/store/rootmulti.(*Store).Commit(0xc0000ee280, 0x5e6, 0x0, 0xc0037bf8c0, 0x71374491c28a2f98)
Jun 23 13:27:20 angl-1 osmosisd[75297]:         github.com/cosmos/cosmos-sdk@v0.42.5/store/rootmulti/store.go:362 +0x1c6
Jun 23 13:27:20 angl-1 osmosisd[75297]: github.com/cosmos/cosmos-sdk/baseapp.(*BaseApp).Commit(0xc000dfb380, 0x0, 0x0, 0x0, 0x0)
Jun 23 13:27:20 angl-1 osmosisd[75297]:         github.com/cosmos/cosmos-sdk@v0.42.5/baseapp/abci.go:293 +0x27c
Jun 23 13:27:20 angl-1 osmosisd[75297]: github.com/tendermint/tendermint/abci/client.(*localClient).CommitSync(0xc000e753e0, 0x0, 0x0, 0x0)
Jun 23 13:27:20 angl-1 osmosisd[75297]:         github.com/tendermint/tendermint@v0.34.10/abci/client/local_client.go:258 +0xab
Jun 23 13:27:20 angl-1 osmosisd[75297]: github.com/tendermint/tendermint/proxy.(*appConnConsensus).CommitSync(0xc000e0b350, 0x0, 0x0, 0x10930ef)
Jun 23 13:27:20 angl-1 osmosisd[75297]:         github.com/tendermint/tendermint@v0.34.10/proxy/app_conn.go:93 +0x33
Jun 23 13:27:20 angl-1 osmosisd[75297]: github.com/tendermint/tendermint/state.(*BlockExecutor).Commit(0xc000114a80, 0xb, 0x1, 0x0, 0x0, 0xc00f041630, 0x9, 0x1, 0x5e6, 0xc019f77000, ...)
Jun 23 13:27:20 angl-1 osmosisd[75297]:         github.com/tendermint/tendermint@v0.34.10/state/execution.go:228 +0x244
Jun 23 13:27:20 angl-1 osmosisd[75297]: github.com/tendermint/tendermint/state.(*BlockExecutor).ApplyBlock(0xc000114a80, 0xb, 0x1, 0x0, 0x0, 0xc00f041630, 0x9, 0x1, 0x5e6, 0xc019f77000, ...)
Jun 23 13:27:20 angl-1 osmosisd[75297]:         github.com/tendermint/tendermint@v0.34.10/state/execution.go:180 +0x725
Jun 23 13:27:20 angl-1 osmosisd[75297]: github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).poolRoutine(0xc006e9a540, 0x0)
Jun 23 13:27:20 angl-1 osmosisd[75297]:         github.com/tendermint/tendermint@v0.34.10/blockchain/v0/reactor.go:398 +0x1033
Jun 23 13:27:20 angl-1 osmosisd[75297]: created by github.com/tendermint/tendermint/blockchain/v0.(*BlockchainReactor).OnStart
Jun 23 13:27:20 angl-1 osmosisd[75297]:         github.com/tendermint/tendermint@v0.34.10/blockchain/v0/reactor.go:110 +0x8c

cc @ValarDragon @sunnya97 not sure if this coming from one of your modules but should make note in a README.


For Admin Use

tac0turtle commented 3 years ago

This is a bug where snapshotting was configured to x and pruning was configured to y. It would be good to error out in the case where snapshotting is not happening on a multiple of pruning. this way you avoid getting to a hieght and erroring out

clevinson commented 3 years ago

Do you think this could be state machine breaking, or is can it be safely back ported to v0.42.x ?

alexanderbez commented 3 years ago

I thought we already added a check for this in BaseApp? Not sure if its in 0.42. I could be wrong here, but I thought we added a check to make sure the snapshotting configuration takes into consideration the pruning configuration.

alexanderbez commented 3 years ago

https://github.com/cosmos/cosmos-sdk/blob/master/baseapp/baseapp.go#L307-L318

tac0turtle commented 3 years ago

https://github.com/cosmos/cosmos-sdk/blob/master/baseapp/baseapp.go#L307-L318

ah then it should be backported.

amaury1093 commented 3 years ago

This is actually already in v0.42.x: https://github.com/cosmos/cosmos-sdk/blob/release/v0.42.x/baseapp/baseapp.go#L296-L308

so this issue still needs more investigating

alexanderbez commented 3 years ago

So can we close this?

amaury1093 commented 3 years ago

Or at least @marbar3778 could you help us reproduce this? @likhita-809 on our side ran an osmosis node with pruning = "everything" until ~17000 and still can't repro.

tac0turtle commented 3 years ago

Or at least @marbar3778 could you help us reproduce this? @likhita-809 on our side ran an osmosis node with pruning = "everything" until ~17000 and still can't repro.

then it seems there is some non-determinism somewhere. I was able to reproduce it on two nodes. Will try again

cyberbono3 commented 3 years ago

@marbar3778 can you leave any comments on this?

likhita-809 commented 3 years ago

@marbar3778 any progress on this ?

likhita-809 commented 3 years ago

Or at least @marbar3778 could you help us reproduce this? @likhita-809 on our side ran an osmosis node with pruning = "everything" until ~17000 and still can't repro.

then it seems there is some non-determinism somewhere. I was able to reproduce it on two nodes. Will try again

@marbar3778 did you try this again ?

tac0turtle commented 3 years ago

trying to connect to nodes now

tac0turtle commented 3 years ago

I wasnt able to reproduce, super weird. Will close for now