cosmos / cosmos-sdk

:chains: A Framework for Building High Value Public Blockchains :sparkles:
https://cosmos.network/
Apache License 2.0
6.3k stars 3.64k forks source link

[Bug]: simulation is async from the commit and checkTx, so it should use cms with version #20685

Open beer-1 opened 5 months ago

beer-1 commented 5 months ago

Is there an existing issue for this?

What happened?

Cosmos is receiving simulation in async way(from the app side) and this can make app.Commit() or app.CheckTx() happened during simulation call. Then the checkState will be overwritten during simulation.

To avoid it, we have to make checkState as versioned cms to make it a snapshot. Also we should introduce RwMutex between app.CheckTx and app.Simulate to prevent that app.CheckTx updates app.checkState during app.Simulate.

Simulation comes directly from the cosmos side grpc interface not through comet ABCI, so we need mutex in cosmos level. https://github.com/cosmos/cosmos-sdk/blob/6d2f6ff068c81c5783e01319beaa51c7dbb43edd/x/auth/tx/service.go#L76

CheckTx and Commit mutex hold by comet, but Simulate & CheckTx and Simulate & Commit still need mutex in cosmos level.

Cosmos SDK Version

main, v0.50

How to reproduce?

Concurrently send txs which are creating multiple accounts and do tx simulation to create multiple accounts. (need some kind of time wait between authkeeper.NextAccountNumber and SetAccount)

This is simple test what happen in the cms when we overwrite the contents.

func (suite *KeeperTestSuite) Test_CacheContext() {
    ctx := suite.ctx
    cacheCtx, _ := ctx.CacheContext()

    pubKey1 := ed25519.GenPrivKey().PubKey()
    pubKey2 := ed25519.GenPrivKey().PubKey()
    addr1 := sdk.AccAddress(pubKey1.Address())
    addr2 := sdk.AccAddress(pubKey2.Address())

    // create account to ctx
    acc1 := suite.accountKeeper.NewAccountWithAddress(ctx, sdk.AccAddress(addr1))
    acc1.SetAccountNumber(100)

    suite.accountKeeper.SetAccount(cacheCtx, acc1)

    // create account to cacheCtx
    acc2 := suite.accountKeeper.NewAccountWithAddress(cacheCtx, sdk.AccAddress(addr2))
    acc2.SetAccountNumber(100)

    suite.Require().Panics(func() {
        suite.accountKeeper.SetAccount(cacheCtx, acc2)
    })
}
kjessec commented 5 months ago

hmm... am I reading this correctly that SimulateTx never had explicit version control for its underlying branch store?

islishude commented 5 months ago

I don't think you have correct fix.

https://github.com/cosmos/cosmos-sdk/blob/62212dfc04b3790f3cc052cdc2fb757b32ca97d0/server/start.go#L382

the proxy.NewLocalClientCreator already has a mutex

beer-1 commented 5 months ago

I don't think you have correct fix.

https://github.com/cosmos/cosmos-sdk/blob/62212dfc04b3790f3cc052cdc2fb757b32ca97d0/server/start.go#L382

the proxy.NewLocalClientCreator already has a mutex

simulation comes directly from the cosmos side grpc interface not through comet ABCI, so we need mutex in cosmos level. https://github.com/cosmos/cosmos-sdk/blob/6d2f6ff068c81c5783e01319beaa51c7dbb43edd/x/auth/tx/service.go#L76

CheckTx and Commit mutex hold by comet, but Simulate & CheckTx and Simulate & Commit still need mutex in cosmos level.

alpe commented 5 months ago

Hi @beer-1 , thanks for reporting. I am trying to understand the problem better. Can you share your use case, please? Do you have an error output or example messages?

When I look at the code in baseapp, I can see that the message execution is done in a cached store for each simulation or check operation. The state is not committed. Therefore, I assume you run into some collusion within the ante handlers. I have modified your example to reflect what I found in baseapp:

func (suite *KeeperTestSuite) Test_CacheContext() {
    ctx := suite.ctx
    // ante commits to ctx store for check TX but never for sims
    checkCtx, _ := ctx.CacheContext()
    simCtx, _ := ctx.CacheContext()

    pubKey1 := ed25519.GenPrivKey().PubKey()
    pubKey2 := ed25519.GenPrivKey().PubKey()
    addr1 := sdk.AccAddress(pubKey1.Address())
    addr2 := sdk.AccAddress(pubKey2.Address())

    // create account with cache store
    acc1 := suite.accountKeeper.NewAccountWithAddress(checkCtx, sdk.AccAddress(addr1))
    acc1.SetAccountNumber(100)
    suite.accountKeeper.SetAccount(checkCtx, acc1)

    // create account with other cached store
    acc2 := suite.accountKeeper.NewAccountWithAddress(simCtx, sdk.AccAddress(addr2))
    acc2.SetAccountNumber(100)
    suite.accountKeeper.SetAccount(simCtx, acc2)
}
beer-1 commented 5 months ago

let me give you more context.

let's see branching of all contexts.

Root: app.cms checkState: app.cms.CacheMultiStore() [ref] checkTxContext(ante): app.cms.CacheMultiStore() [ref1, ref2] checkTxContext(runMsgs): app.cms.CacheMultiStore().CacheContext() simulateContext: app.cms.CacheMultiStore().CacheContext()

so when the block commit happen checkState will be affected and sequentially all states will be affected according to the below example.

func (suite *KeeperTestSuite) Test_CacheContext() {
    ctx := suite.ctx
    cacheCtx, _ := ctx.CacheContext()

    pubKey1 := ed25519.GenPrivKey().PubKey()
    pubKey2 := ed25519.GenPrivKey().PubKey()
    addr1 := sdk.AccAddress(pubKey1.Address())
    addr2 := sdk.AccAddress(pubKey2.Address())

    // create account to ctx
    acc1 := suite.accountKeeper.NewAccountWithAddress(ctx, sdk.AccAddress(addr1))
    acc1.SetAccountNumber(100)

    suite.accountKeeper.SetAccount(cacheCtx, acc1)

    // create account to cacheCtx
    acc2 := suite.accountKeeper.NewAccountWithAddress(cacheCtx, sdk.AccAddress(addr2))
    acc2.SetAccountNumber(100)

    suite.Require().Panics(func() {
        suite.accountKeeper.SetAccount(cacheCtx, acc2)
    })
}

The problem here is simulation is async from the commit and check tx because simulation will be called directly from cosmos grpc interface not through comet interface.

Problem Situation 1

T1 - Simulation started T2 - block commit T3 - Simulation finished (state corrupted during simulation call)

Problem Situation 2

T1 - Simulation started T2 - CheckTx ante commit the state update T3 - Simulation finished (state corrupted during simulation call)

alpe commented 5 months ago

Thanks for the additional context. I still have some problems to follow your issue. As I tried to point out, both checkTx and simulations run on separate storage forks via CacheContext. Therefore no dirty read or state committed (except for checkTX ante handlers). This is where your code example is not correct. You use a single cache context for both operations. Only deliverTx persists the message execution.

Nevertheless, I would like to step back on the implementation details and talks about your use case and scenarios. What TX are you submitting? What was the error output?

beer-1 commented 5 months ago

maybe the context was not fully shared.

What I want to say is both checkTx and simulations run on separate storage forks is not true.

branching simply

`app.cms` => `checkState`  => checkTx -------------- +
     ^          ^          => simulation             |
     |          |                                    |
     |          +---------------  ante(update) ----- +
     |    => `finalizeState` ----+
     |                           |
     +----  block commit   ------+

If state updates are committed to app.cms or checkState both checkTx and simulation will use dirty state.

Did you see these links [ref1, ref2, ref3]?

checkTx's ante directly writes state updates to checkState.

and block commit directly write state updates to app.cms [ref].

I'm not sure what is the differences between this situation and the above test code. You can think (ctx := suite.ctx as app.cms or checkState) and (cacheCtx, _ := ctx.CacheContext() as simulation cache ctx or check tx cache ctx).

More exact test case like this

func (suite *KeeperTestSuite) Test_CacheContext() {
    ctx := suite.ctx                       // app.cms       || app.cms       || checkState 
    cacheCtx1, write := ctx.CacheContext() // finalizeBlock || finalizeBlock || checkTx(ante)
    cacheCtx2, _ := ctx.CacheContext()     // checkTx(ante) || simulation    || simulation

    pubKey1 := ed25519.GenPrivKey().PubKey()
    pubKey2 := ed25519.GenPrivKey().PubKey()
    addr1 := sdk.AccAddress(pubKey1.Address())
    addr2 := sdk.AccAddress(pubKey2.Address())

    // t: create account with cacheCtx1
    acc1 := suite.accountKeeper.NewAccountWithAddress(cacheCtx1, sdk.AccAddress(addr1))

    // t+1: create account with cacheCtx2 at time
    acc2 := suite.accountKeeper.NewAccountWithAddress(cacheCtx2, sdk.AccAddress(addr2))

    // t+2: set account with cacheCtx1 and commit write
    suite.accountKeeper.SetAccount(cacheCtx1, acc1)
    write()

    // t+3: set account with cacheCtx2 failed
    suite.Require().Panics(func() {
        suite.accountKeeper.SetAccount(cacheCtx2, acc2)
    })
}

Here is our case

NOTE: we already fixed the problem with this fix.

  1. load next account number first before executing the move vm [ref].
  2. and execute move vm [ref].
  3. create new accounts which are created during the execution [ref].

The error logs are uniqueness error from account number index of the account keeper. We deeply printed the logs and found the below situations.

T{N}: means thread-N.

Case 1. (block commit state update during simulation)

Case 2. (check tx state update during simulation)

Case 3. (block commit state update during async recheck tx)

Our chain customized the bank module to use our move coin contract, so ante's fee payment also executes move contract and creates the account.

alpe commented 5 months ago

Thank you for the additional information! I also took a brief look into the referenced code examples in your project. What I noticed is that you let your moveVM assign account numbers for accounts to be created. That can easily go out of sync if you integrate with cosmos bank messages for example that create accounts for new addresses. But I don't have the full context. You may want to look at wasmd how accounts are managed.

To ensure there is is no issue on the SDK side, I have written a new test to cover different scenarios. Feel free to modify the test to make it fail.

beer-1 commented 5 months ago

@alpe Thanks for your test case.

Actually that does not cover this case, so I changed it to cover this case.

Please see the below links.

Test Case 1

Test Case 2

I have changed the bank MsgSend handler to show more accurate error case :) Also for test case 2, added ante handler to create account at check tx.

beer-1 commented 5 months ago

btw want to get more robust solution than this versioned iavl. seems it is slow I guess?

Also seems the simulate does not set ExecMode properly

alpe commented 5 months ago

@beer-1 thanks for the updates on the test scenarios. This helps to understand the problem better!

beer-1 commented 5 months ago

any plan or update on this?

beer-1 commented 4 months ago

Our temporal solution is adding 1_000_000 for check tx and 2_000_000 for simulate to NextAccountNumber at ante handler to update cache store, and then our check tx and simulate will hit this updated cache instead of looking up the parent store (which can be corrupted).

Also even some account are added on parent state, 1_000_000 and 2_000_000 make some buffer for collision.

https://github.com/initia-labs/initia/pull/220