cosmos / cosmos-sdk

:chains: A Framework for Building High Value Public Blockchains :sparkles:
https://cosmos.network/
Apache License 2.0
6.18k stars 3.57k forks source link

Flaky store test #19834

Closed julienrbrt closed 4 months ago

julienrbrt commented 5 months ago

https://github.com/cosmos/cosmos-sdk/actions/runs/8377413442/job/22977580333?pr=19804

panic: Log in goroutine after TestStore_Save has completed: 
    Error Trace:    /home/runner/work/cosmos-sdk/cosmos-sdk/store/snapshots/store_test.go:327
                                /nix/store/v2hwzp398da6s8vwgckf59xxx0va00pc-go-1.22.0/share/go/src/runtime/asm_amd64.s:1695
    Error:          Received unexpected error:

                    cosmossdk.io/store/v2/snapshots.(*Store).saveSnapshot
                        /home/runner/work/cosmos-sdk/cosmos-sdk/store/snapshots/store.go:341
                    cosmossdk.io/store/v2/snapshots.(*Store).Save
                        /home/runner/work/cosmos-sdk/cosmos-sdk/store/snapshots/store.go:294
                    cosmossdk.io/store/v2/snapshots_test.TestStore_Save.func1
                        /home/runner/work/cosmos-sdk/cosmos-sdk/store/snapshots/store_test.go:326
                    failed to write snapshot metadata: open /tmp/nix-shell.47Ml54/TestStore_Save3962288928/001/metadata/00000000000000000007-00000001: no such file or directory
    Test:           TestStore_Save

goroutine 78 [running]:
testing.(*common).logDepth(0xc000427040, {0xc00021e800, 0x38f}, 0x3)
    /nix/store/v2hwzp398da6s8vwgckf59xxx0va00pc-go-1.22.0/share/go/src/testing/testing.go:1028 +0x4c5
testing.(*common).log(...)
    /nix/store/v2hwzp398da6s8vwgckf59xxx0va00pc-go-1.22.0/share/go/src/testing/testing.go:1010
testing.(*common).Errorf(0xc000427040, {0xbd0da3?, 0x411605?}, {0xc00005b230?, 0xac3360?, 0x301?})
    /nix/store/v2hwzp398da6s8vwgckf59xxx0va00pc-go-1.22.0/share/go/src/testing/testing.go:1074 +0x54
github.com/stretchr/testify/assert.Fail({0xd7f5c0, 0xc000427040}, {0xc00027e480, 0x22d}, {0x0, 0x0, 0x0})
    /home/runner/go/pkg/mod/github.com/stretchr/testify@v1.9.0/assert/assertions.go:363 +0x370
github.com/stretchr/testify/assert.NoError({0xd7f5c0, 0xc000427040}, {0xd7fac0, 0xc000064f40}, {0x0, 0x0, 0x0})
    /home/runner/go/pkg/mod/github.com/stretchr/testify@v1.9.0/assert/assertions.go:[154](https://github.com/cosmos/cosmos-sdk/actions/runs/8377413442/job/22977580333?pr=19804#step:8:155)5 +0x125
github.com/stretchr/testify/require.NoError({0xd817f8, 0xc000427040}, {0xd7fac0, 0xc000064f40}, {0x0, 0x0, 0x0})
    /home/runner/go/pkg/mod/github.com/stretchr/testify@v1.9.0/require/require.go:1354 +0xb0
cosmossdk.io/store/v2/snapshots_test.TestStore_Save.func1()
    /home/runner/work/cosmos-sdk/cosmos-sdk/store/snapshots/store_test.go:327 +0x85
created by cosmossdk.io/store/v2/snapshots_test.TestStore_Save in goroutine 76
    /home/runner/work/cosmos-sdk/cosmos-sdk/store/snapshots/store_test.go:324 +0x1279
FAIL    cosmossdk.io/store/v2/snapshots 0.128s
EmilGeorgiev commented 5 months ago

Can I take this issue?

EmilGeorgiev commented 4 months ago

I found that panic is not the only problem this test has; the test can also fail for other reasons. All the issues arise from improperly managing the workflow execution of the goroutines:

  1. Panic Issue: The panic "Log in goroutine after has completed" occurs when the test finishes but a non-terminated goroutine tries to write something to the test log. This happens because calling the method _, err = store.Save(7, 2, makeChunks(nil)) blocks as it reads from the channel ch in a loop, and this read operation blocks until the channel is closed. The channel is closed at the end of the test, and in some rare cases, the test finishes while the goroutine is still active.

  2. Failure to Handle Error: The test can also fail with the following error:

--- FAIL: TestStore_Save (0.01s)
    store_test.go:335:
        Error Trace: /Users/georgievem/go/src/github.com/EmilGeorgiev/cosmos-sdk/store/snapshots/store_test.go:335
        Error:      Expected an error but got nil.
        Test:       TestStore_Save

This failure occurs because at line 335, we expect an error due to the goroutine having already called the method store.Save with height 7. Calling it a second time with the same height should return an error. However, the current mutex's Lock and Unlock methods do not guarantee that the method inside the goroutine will be called first. In some rare cases, it is possible for the method in the goroutine to be called second; then, the method will return nil instead of an error, and the assertion will fail.

These issues can be resolved by properly managing the execution of goroutines. You can see how this is fixed in the PR. This is the result after executing the test 100 000 times.

$ go test -run=TestStore_Save -count=100000
PASS
ok      cosmossdk.io/store/v2/snapshots 940.395s