Open beer-1 opened 5 months ago
hmm... am I reading this correctly that SimulateTx never had explicit version control for its underlying branch store?
I don't think you have correct fix.
the proxy.NewLocalClientCreator
already has a mutex
I don't think you have correct fix.
the
proxy.NewLocalClientCreator
already has a mutex
simulation comes directly from the cosmos side grpc interface not through comet ABCI, so we need mutex in cosmos level. https://github.com/cosmos/cosmos-sdk/blob/6d2f6ff068c81c5783e01319beaa51c7dbb43edd/x/auth/tx/service.go#L76
CheckTx and Commit mutex hold by comet, but Simulate & CheckTx and Simulate & Commit still need mutex in cosmos level.
Hi @beer-1 , thanks for reporting. I am trying to understand the problem better. Can you share your use case, please? Do you have an error output or example messages?
When I look at the code in baseapp, I can see that the message execution is done in a cached store for each simulation or check operation. The state is not committed. Therefore, I assume you run into some collusion within the ante handlers. I have modified your example to reflect what I found in baseapp:
func (suite *KeeperTestSuite) Test_CacheContext() {
ctx := suite.ctx
// ante commits to ctx store for check TX but never for sims
checkCtx, _ := ctx.CacheContext()
simCtx, _ := ctx.CacheContext()
pubKey1 := ed25519.GenPrivKey().PubKey()
pubKey2 := ed25519.GenPrivKey().PubKey()
addr1 := sdk.AccAddress(pubKey1.Address())
addr2 := sdk.AccAddress(pubKey2.Address())
// create account with cache store
acc1 := suite.accountKeeper.NewAccountWithAddress(checkCtx, sdk.AccAddress(addr1))
acc1.SetAccountNumber(100)
suite.accountKeeper.SetAccount(checkCtx, acc1)
// create account with other cached store
acc2 := suite.accountKeeper.NewAccountWithAddress(simCtx, sdk.AccAddress(addr2))
acc2.SetAccountNumber(100)
suite.accountKeeper.SetAccount(simCtx, acc2)
}
let me give you more context.
let's see branching of all contexts.
Root: app.cms
checkState: app.cms.CacheMultiStore()
[ref]
checkTxContext(ante): app.cms.CacheMultiStore()
[ref1, ref2]
checkTxContext(runMsgs): app.cms.CacheMultiStore().CacheContext()
simulateContext: app.cms.CacheMultiStore().CacheContext()
so when the block commit happen checkState
will be affected and sequentially all states will be affected according to the below example.
func (suite *KeeperTestSuite) Test_CacheContext() {
ctx := suite.ctx
cacheCtx, _ := ctx.CacheContext()
pubKey1 := ed25519.GenPrivKey().PubKey()
pubKey2 := ed25519.GenPrivKey().PubKey()
addr1 := sdk.AccAddress(pubKey1.Address())
addr2 := sdk.AccAddress(pubKey2.Address())
// create account to ctx
acc1 := suite.accountKeeper.NewAccountWithAddress(ctx, sdk.AccAddress(addr1))
acc1.SetAccountNumber(100)
suite.accountKeeper.SetAccount(cacheCtx, acc1)
// create account to cacheCtx
acc2 := suite.accountKeeper.NewAccountWithAddress(cacheCtx, sdk.AccAddress(addr2))
acc2.SetAccountNumber(100)
suite.Require().Panics(func() {
suite.accountKeeper.SetAccount(cacheCtx, acc2)
})
}
The problem here is simulation is async from the commit and check tx because simulation will be called directly from cosmos grpc interface not through comet interface.
T1 - Simulation started T2 - block commit T3 - Simulation finished (state corrupted during simulation call)
T1 - Simulation started
T2 - CheckTx
ante commit the state update
T3 - Simulation finished (state corrupted during simulation call)
Thanks for the additional context. I still have some problems to follow your issue. As I tried to point out, both checkTx and simulations run on separate storage forks via CacheContext. Therefore no dirty read or state committed (except for checkTX ante handlers). This is where your code example is not correct. You use a single cache context for both operations. Only deliverTx persists the message execution.
Nevertheless, I would like to step back on the implementation details and talks about your use case and scenarios. What TX are you submitting? What was the error output?
maybe the context was not fully shared.
What I want to say is both checkTx and simulations run on separate storage forks
is not true.
branching simply
`app.cms` => `checkState` => checkTx -------------- +
^ ^ => simulation |
| | |
| +--------------- ante(update) ----- +
| => `finalizeState` ----+
| |
+---- block commit ------+
If state updates are committed to app.cms
or checkState
both checkTx
and simulation
will use dirty state.
Did you see these links [ref1, ref2, ref3]?
checkTx's ante
directly writes state updates to checkState
.
and block commit directly write state updates to app.cms
[ref].
I'm not sure what is the differences between this situation and the above test code.
You can think (ctx := suite.ctx
as app.cms
or checkState
) and (cacheCtx, _ := ctx.CacheContext()
as simulation cache ctx or check tx cache ctx).
More exact test case like this
func (suite *KeeperTestSuite) Test_CacheContext() {
ctx := suite.ctx // app.cms || app.cms || checkState
cacheCtx1, write := ctx.CacheContext() // finalizeBlock || finalizeBlock || checkTx(ante)
cacheCtx2, _ := ctx.CacheContext() // checkTx(ante) || simulation || simulation
pubKey1 := ed25519.GenPrivKey().PubKey()
pubKey2 := ed25519.GenPrivKey().PubKey()
addr1 := sdk.AccAddress(pubKey1.Address())
addr2 := sdk.AccAddress(pubKey2.Address())
// t: create account with cacheCtx1
acc1 := suite.accountKeeper.NewAccountWithAddress(cacheCtx1, sdk.AccAddress(addr1))
// t+1: create account with cacheCtx2 at time
acc2 := suite.accountKeeper.NewAccountWithAddress(cacheCtx2, sdk.AccAddress(addr2))
// t+2: set account with cacheCtx1 and commit write
suite.accountKeeper.SetAccount(cacheCtx1, acc1)
write()
// t+3: set account with cacheCtx2 failed
suite.Require().Panics(func() {
suite.accountKeeper.SetAccount(cacheCtx2, acc2)
})
}
NOTE: we already fixed the problem with this fix.
The error logs are uniqueness error from account number index of the account keeper. We deeply printed the logs and found the below situations.
T{N}: means thread-N.
app.cms
checkState
app.cms
Our chain customized the bank module to use our move coin contract, so ante
's fee payment also executes move contract and creates the account.
Thank you for the additional information! I also took a brief look into the referenced code examples in your project. What I noticed is that you let your moveVM assign account numbers for accounts to be created. That can easily go out of sync if you integrate with cosmos bank messages for example that create accounts for new addresses. But I don't have the full context. You may want to look at wasmd how accounts are managed.
To ensure there is is no issue on the SDK side, I have written a new test to cover different scenarios. Feel free to modify the test to make it fail.
@alpe Thanks for your test case.
Actually that does not cover this case, so I changed it to cover this case.
Please see the below links.
I have changed the bank MsgSend handler to show more accurate error case :) Also for test case 2, added ante handler to create account at check tx.
btw want to get more robust solution than this versioned iavl. seems it is slow I guess?
Also seems the simulate does not set ExecMode properly
@beer-1 thanks for the updates on the test scenarios. This helps to understand the problem better!
any plan or update on this?
Our temporal solution is adding 1_000_000
for check tx and 2_000_000
for simulate to NextAccountNumber at ante handler to update cache store, and then our check tx and simulate will hit this updated cache instead of looking up the parent store (which can be corrupted).
Also even some account are added on parent state, 1_000_000 and 2_000_000 make some buffer for collision.
Is there an existing issue for this?
What happened?
Cosmos is receiving simulation in async way(from the app side) and this can make
app.Commit()
orapp.CheckTx()
happened during simulation call. Then thecheckState
will be overwritten during simulation.To avoid it, we have to make
checkState
as versioned cms to make it a snapshot. Also we should introduceRwMutex
betweenapp.CheckTx
andapp.Simulate
to prevent thatapp.CheckTx
updatesapp.checkState
duringapp.Simulate
.Simulation comes directly from the cosmos side grpc interface not through comet ABCI, so we need mutex in cosmos level. https://github.com/cosmos/cosmos-sdk/blob/6d2f6ff068c81c5783e01319beaa51c7dbb43edd/x/auth/tx/service.go#L76
CheckTx and Commit mutex hold by comet, but Simulate & CheckTx and Simulate & Commit still need mutex in cosmos level.
Cosmos SDK Version
main, v0.50
How to reproduce?
Concurrently send txs which are creating multiple accounts and do tx simulation to create multiple accounts. (need some kind of time wait between authkeeper.NextAccountNumber and SetAccount)
This is simple test what happen in the cms when we overwrite the contents.