celestiaorg / celestia-app

PoS application for the consensus portion of the Celestia network. Built using celestia-core (fork of CometBFT) and the cosmos-sdk
https://celestia.org
Apache License 2.0
327 stars 266 forks source link

TestMaxBlockSize flake #3086

Closed rootulp closed 3 months ago

rootulp commented 5 months ago

Problem

--- FAIL: TestIntegrationTestSuite (21.86s)
    --- FAIL: TestIntegrationTestSuite/TestMaxBlockSize (13.12s)
        integration_test.go:[27](https://github.com/celestiaorg/celestia-app/actions/runs/7817375365/job/21325106901?pr=3085#step:4:28)1: 
                Error Trace:    /home/runner/work/celestia-app/celestia-app/app/test/integration_test.go:271
                                            /home/runner/work/celestia-app/celestia-app/app/test/integration_test.go:174
                                            /home/runner/go/pkg/mod/github.com/stretchr/testify@v1.8.4/suite/suite.go:112
                Error:          Not equal: 
                                expected: []byte{0xe8, 0xf8, 0x5b, 0xba, 0xe5, 0x85, 0xc6, 0x19, 0xf3, 0xbd, 0x98, 0x4, 0x5e, 0xe5, 0x4d, 0x63, 0xc1, 0xd, 0x77, 0x8d, 0x22, 0x8a, 0x33, 0xea, 0xc6, 0x60, 0x[28](https://github.com/celestiaorg/celestia-app/actions/runs/7817375365/job/21325106901?pr=3085#step:4:29), 0x78, 0x82, 0x28, 0xbf, 0xbd}
                                actual  : []byte{0xd5, 0x5c, 0xaf, 0xbb, 0x44, 0x65, 0x4e, 0xc1, 0xb5, 0x7e, 0xf9, 0x43, 0x83, 0x97, 0x9e, 0x7c, 0x[31](https://github.com/celestiaorg/celestia-app/actions/runs/7817375365/job/21325106901?pr=3085#step:4:32), 0xc, 0xd, 0xcc, 0x92, 0x75, 0x7c, 0xde, 0xfd, 0xdd, 0x9b, 0xa5, 0x53, 0x6b, 0xc4, 0x5}

                                Diff:
                                --- Expected
                                +++ Actual
                                @@ -1,4 +1,4 @@
                                 ([]uint8) (len=[32](https://github.com/celestiaorg/celestia-app/actions/runs/7817375365/job/21325106901?pr=3085#step:4:33)) {
                                - 00000000  e8 f8 5b ba e5 85 c6 19  f3 bd 98 04 5e e5 4d 63  |..[.........^.Mc|
                                - 00000010  c1 0d 77 8d 22 8a [33](https://github.com/celestiaorg/celestia-app/actions/runs/7817375365/job/21325106901?pr=3085#step:4:34) ea  c6 60 28 78 82 28 bf bd  |..w.".3..`(x.(..|
                                + 00000000  d5 5c af bb 44 65 4e c1  b5 7e f9 [43](https://github.com/celestiaorg/celestia-app/actions/runs/7817375365/job/21325106901?pr=3085#step:4:44) 83 97 9e 7c  |.\..DeN..~.C...||
                                + 00000010  31 0c 0d cc 92 75 7c de  fd dd 9b a5 53 6b c4 05  |1....u|.....Sk..|
                                 }
                Test:           TestIntegrationTestSuite/TestMaxBlockSize
    network.go:48: tearing down testnode
FAIL

Occurences

rootulp commented 5 months ago

I got this to reproduce locally so here's more info: bad_block_EJVzDj.json

My hypothesis was that this test starts with app version 1 and then the bad block gets produced with app version 2 because some other test upgraded the testnode version but I re-ran the test locally with debug logs and this time it passed and all blocks appear to use app version 2.

height 238 app version 2
height 239 app version 2
height 240 app version 2
height 241 app version 2
height 242 app version 2
rootulp commented 5 months ago

It looks like the test node app version is on 0 but the blocks are getting produced with app version 2.

app version 0 block header 0
app version 0 block header 2
app version 0 block header 0
app version 0 block header 2
rootulp commented 5 months ago

Notes while revisiting this:

  1. The test name TestMaxBlockSize isn't accurate because this test verifies the max square size is hit, not the max block size. Word choice matters here because block size != square size so the max block size != max square size.
  2. I don't understand why the TestMaxSquareSize performs ExtendBlockTest. IMO these should be two separate tests to make it easier to identify what broke when one test fails.
  3. I'm trying to figure out why blocks get produced with app version 2 when it looks like app version is 0 in prepare and process proposal.
app version in prepare proposal 0
app version 0 block header 0
app version in process proposal 0
app version 0 block header 2
cmwaters commented 4 months ago

I'm trying to figure out why blocks get produced with app version 2 when it looks like app version is 0 in prepare and process proposal.

The app version of the block is actually set by comet which is probably on version 2 whereas prepare proposal and process proposal have v0 because the context its not correct set

rootulp commented 4 months ago

Sorry I put this on pause because a few weeks ago we thought this would be resolved by some app version fixes:

  1. https://github.com/celestiaorg/celestia-core/pull/1202
  2. https://github.com/celestiaorg/celestia-core/pull/1227

TBH I thought there was one other issue / PR version related but can't find it (cc: @cmwaters ). Regardless, since this is still occurring on main, it's likely that none of the version fixes actually resolved this issue.

rootulp commented 4 months ago

To repro more quickly, one can run:

cd app/test && go test -run TestIntegrationTestSuite/TestMaxBlockSize -count 5 -v
rootulp commented 4 months ago

The test was skipped in https://github.com/celestiaorg/celestia-app/pull/3142 so this issue should make it unflaky and then re-enable it

rootulp commented 4 months ago

I wasn't able to get visibility because I was trying to log the version in prepare/process proposal using:

    fmt.Printf("app version in prepare proposal %v\n", sdkCtx.ConsensusParams().Version.AppVersion)

which was silently panicing so I only observed test timeouts. It is possible to print the version via:

    fmt.Printf("app version in prepare proposal %v\n", sdkCtx.ConsensusParams().Version.GetAppVersion())
rootulp commented 4 months ago

Interestingly can't repro with single blob or multi blob test cases. Can only repro with randomTxGen. The app version in prepare / process proposal was 2 before and after:

app version in prepare proposal 2
app version in process proposal 2
=== NAME  TestIntegrationTestSuite/TestMaxBlockSize
    integration_test.go:277:
            Error Trace:    /Users/rootulp/git/rootulp/celestiaorg/celestia-app/app/test/integration_test.go:277
                                        /Users/rootulp/git/rootulp/celestiaorg/celestia-app/app/test/integration_test.go:174
                                        /Users/rootulp/go/pkg/mod/github.com/stretchr/testify@v1.9.0/suite/suite.go:115
            Error:          Not equal:
                            expected: []byte{0xd7, 0x72, 0x7a, 0xd0, 0x5f, 0xef, 0xa3, 0x9f, 0x52, 0x7f, 0xe3, 0x4f, 0x82, 0xcf, 0xa2, 0x87, 0xbc, 0xbb, 0x8a, 0xc0, 0xcf, 0x52, 0x72, 0x3e, 0xc4, 0x21, 0x3d, 0x43, 0x8c, 0x97, 0xf9, 0x62}
                            actual  : []byte{0x35, 0x48, 0x18, 0x18, 0x2b, 0xba, 0x90, 0xb5, 0xd8, 0x6b, 0xde, 0xad, 0x9c, 0x36, 0x29, 0x75, 0x60, 0xa1, 0x6, 0x6d, 0x3, 0xe4, 0x1, 0x2e, 0x4c, 0x4b, 0x92, 0x1f, 0x81, 0x19, 0x67, 0x18}

                            Diff:
                            --- Expected
                            +++ Actual
                            @@ -1,4 +1,4 @@
                             ([]uint8) (len=32) {
                            - 00000000  d7 72 7a d0 5f ef a3 9f  52 7f e3 4f 82 cf a2 87  |.rz._...R..O....|
                            - 00000010  bc bb 8a c0 cf 52 72 3e  c4 21 3d 43 8c 97 f9 62  |.....Rr>.!=C...b|
                            + 00000000  35 48 18 18 2b ba 90 b5  d8 6b de ad 9c 36 29 75  |5H..+....k...6)u|
                            + 00000010  60 a1 06 6d 03 e4 01 2e  4c 4b 92 1f 81 19 67 18  |`..m....LK....g.|
                             }
            Test:           TestIntegrationTestSuite/TestMaxBlockSize
app version in prepare proposal 2
app version in process proposal 2
rootulp commented 4 months ago

There is a discrepancy in the square sizes used in process_proposal.go and extend_block.go. The latter looks incorrect because it always uses the square size upper bound rather than the max effective square size (i.e. the minimum of: gov max square size, upper bound versioned const).

https://github.com/celestiaorg/celestia-app/blob/f3fb834dd3e0b6e8c79eb077d2fbe97257eb8bcb/app/process_proposal.go#L122

https://github.com/celestiaorg/celestia-app/blob/f3fb834dd3e0b6e8c79eb077d2fbe97257eb8bcb/app/extend_block.go#L18

rootulp commented 4 months ago

At least locally, I was able to resolve flakiness via the diff in https://github.com/celestiaorg/celestia-app/pull/3150

rootulp commented 4 months ago

Fortunately it looks like celestia-node doesn't use the ExtendBlock function exported from celestia-app (see this search). But it does have it's own buggy implementation (see here) because I think that should be using the max effective square size.

So ExtendBlock in this repo should be updated to account for the max effective square size. This means it needs access to the app (i.e. it could be moved from a stand-alone exported function to a method on app).

Separately, celestia-node needs a way to figure out the max effective square size from app and then use that to extend the block.

rootulp commented 4 months ago

A few concrete action items:

  1. [breaking] Modify ExtendBlock to be a method on app. It likely also needs an sdkCtx so the function signature will change.
  2. [breaking] Consider un-exporting ExtendBlock because it isn't used by celestia-node.
  3. [breaking] Rename GovSquareSizeUpperBound to MaxEffectiveSquareSize.
  4. Add unit tests to ExtendBlock that are stand-alone from the TestMaxBlockSize.
evan-forbes commented 4 months ago

this might stem all the way back to https://github.com/celestiaorg/celestia-app/pull/1772, where we might have covered up an underlying issue with square by accidently using the same stateful param in ProcessProposal instead of fixing the non-determinism issue we're seeing now

rootulp commented 4 months ago

Copying from a Slack convo, we may actually want to use the upper bound in process proposal and extend block so that celestia-node isn't stateful (i.e. doesn't need access to gov max square size) when it tries to extend squares.

cmwaters commented 4 months ago

Separately, celestia-node needs a way to figure out the max effective square size from app and then use that to extend the block.

It can query the gov max square size and perform the same method of taking the min of the two. The problem is that if this were to change celestia-node wouldn't know unless it was constantly querying the square size each block. I remember talking about this problem several months ago and just concluding that it wasn't a problem if light nodes just use the upper bound and we rely on the validator set to ensure that it is within bounds.

As an aside, I thought the whole point of ExtendBlock was for celestia-node to use it

rootulp commented 4 months ago

As an aside, I thought the whole point of ExtendBlock was for celestia-node to use it

+1 but celestia-node doesn't use it currently.

I remember talking about this problem several months ago and just concluding that it wasn't a problem if light nodes just use the upper bound and we rely on the validator set to ensure that it is within bounds.

I vaguely recall the same outcome. I think we agreed to try and avoid celestia-node having to query the gov max square size param each block. If celestia-node doesn't query the gov max square size then this should hold: square.Construct(_, 128, _) == square.Construct(_, 64, _) for the same set of transactions but based on the flakiness of this test case, I don't think that's the case.

rootulp commented 4 months ago

On celestia-app v1.x, ProcessProposal and ExtendBlock both have a worstCaseShareIndexes that always uses 128.

https://github.com/celestiaorg/celestia-app/blob/b25766f98fb0ac31fce43c280e1f25c68d8d9d1b/pkg/square/builder.go#L424-L425

On celestia-app main, we use go-square's implementation of worstCaseShareIndexes which uses a max square size plumbed in via square.Construct(). See here:

func worstCaseShareIndexes(blobs, maxSquareSize int) []uint32 {
    shareIndexes := make([]uint32, blobs)

ProcessProposal calls square.Construct with 64 and ExtendBlock calls square.Construct with 128.

Therefore, the bug is not present on celestia-app v1.x or any celestia-node release. The bug is present on go-square v1.0.0 and celestia-app main.

rootulp commented 4 months ago

Created https://github.com/celestiaorg/go-square/issues/47

rootulp commented 3 months ago

Blocked on a release of go-square with https://github.com/celestiaorg/go-square/pull/49. The team has different opinions on whether it should be released as 1.0.1 or 2.0.0. We're meeting tomorrow to decide.