NibiruChain / nibiru

Nibiru Chain: The breakthrough smart contract platform ushering in the next era of money. Nibiru powers an ecosystem of dApps including perps, RWAs, and more.
https://nibiru.fi
Apache License 2.0
179 stars 200 forks source link

bug(sequence_test): Data race condition when committing BaseApp.setState in gosdk test #1903

Open Unique-Divine opened 1 month ago

Unique-Divine commented 1 month ago

Summary

A data race was getting past some CI tests and creating inconsistency in successful runs. It occured in the following test that broadcasts many transactions in rapid succession, expecting each to succeed and be queriable by tx hash.

Reading the error logs, I found that:

  1. The write operation panicking with a data race was the setState method of the BaseApp from the Cosmos-SDK.
  2. The call stack shows that the write happens during the commit phase of the blockchain consensus process.
  3. A read operation occurs in the CreateQueryContext method of the BaseApp.
  4. The call stack indicates the data race read is happening during the handling of a gRPC query request related to the "x/auth" module.
[Expand to see test] ```go package gosdk_test import ( "encoding/json" "github.com/MakeNowJust/heredoc/v2" cmtcoretypes "github.com/cometbft/cometbft/rpc/core/types" "github.com/NibiruChain/nibiru/gosdk" ) // TestSequenceExpectations validates the behavior of account sequence numbers // and transaction finalization in a blockchain network. It ensures that sequence // numbers increment correctly with each transaction and that transactions can be // queried successfully after the blocks are completed. func (s *TestSuite) TestSequenceExpectations() { t := s.T() t.Log("Get sequence and block") // Go to next block _, err := s.network.WaitForNextBlockVerbose() s.Require().NoError(err) accAddr := s.val.Address getLatestAccNums := func() gosdk.AccountNumbers { accNums, err := s.nibiruSdk.GetAccountNumbers(accAddr.String()) s.NoError(err) return accNums } seq := getLatestAccNums().Sequence t.Logf("starting sequence %v should not change from waiting a block", seq) s.NoError(s.network.WaitForNextBlock()) newSeq := getLatestAccNums().Sequence s.EqualValues(seq, newSeq) t.Log("broadcast msg n times, expect sequence += n") numTxs := uint64(5) seqs := []uint64{} txResults := make(map[string]*cmtcoretypes.ResultTx) for broadcastCount := uint64(0); broadcastCount < numTxs; broadcastCount++ { s.NoError(s.network.WaitForNextBlock()) // Ensure block increment from, _, _, msgSend := s.msgSendVars() txResp, err := s.nibiruSdk.BroadcastMsgsGrpcWithSeq( from, seq+broadcastCount, msgSend, ) s.NoError(err) txHashHex := s.AssertTxResponseSuccess(txResp) s.T().Log(heredoc.Docf( `Query for tx %v should fail b/c it's the same block and finalization cannot have possibly occurred yet.`, broadcastCount)) txResult, err := s.nibiruSdk.TxByHash(txHashHex) jsonBz, _ := json.MarshalIndent(txResp, "", " ") s.Assert().Errorf(err, "txResp: %s", jsonBz) txResults[txHashHex] = txResult seqs = append(seqs, getLatestAccNums().Sequence) } s.T().Log("expect sequence += n") newNewSeq := getLatestAccNums().Sequence txResultsJson, _ := json.MarshalIndent(txResults, "", " ") s.EqualValuesf(int(seq+numTxs-1), int(newNewSeq), "seqs: %v\ntxResults: %s", seqs, txResultsJson) s.T().Log("After the blocks are completed, tx queries by hash should work.") for times := 0; times < 2; times++ { s.NoError(s.network.WaitForNextBlock()) } s.T().Log("Query each tx by hash (successfully)") for txHashHex := range txResults { _, err := s.nibiruSdk.TxByHash(txHashHex) s.NoError(err) } } ```

Ref: Failed GitHub Action on 2024-05-31

Unique-Divine commented 1 month ago

Date 2024-05-31: Some findings on e7e708d79353c85f4ad78862de242af2f928f8ce

Looking through the commit history, the problem seems to surface after the version changes for

Based purely on this, I tried an experiment where I run

just clean-cache && go mod tidy
just test-coverage-integration 2>&1
# and pipe the output to Neovim to read/debug

Test Results

Note that these results were inconclusive, as the race condition issue doesn't seem to be affected by the version increment of the Cosmos-SDK or CometBFT. I re-ran the below process 3 times with fresh cache state and found inconsistent results.

Unique-Divine commented 1 month ago

This is still happening pretty frequently

image

Example: https://github.com/NibiruChain/nibiru/actions/runs/9425664163/job/25967637912