Closed rootulp closed 3 months ago
I got this to reproduce locally so here's more info: bad_block_EJVzDj.json
My hypothesis was that this test starts with app version 1 and then the bad block gets produced with app version 2 because some other test upgraded the testnode version but I re-ran the test locally with debug logs and this time it passed and all blocks appear to use app version 2.
height 238 app version 2
height 239 app version 2
height 240 app version 2
height 241 app version 2
height 242 app version 2
It looks like the test node app version is on 0 but the blocks are getting produced with app version 2.
app version 0 block header 0
app version 0 block header 2
app version 0 block header 0
app version 0 block header 2
Notes while revisiting this:
TestMaxBlockSize
isn't accurate because this test verifies the max square size is hit, not the max block size. Word choice matters here because block size != square size so the max block size != max square size.TestMaxSquareSize
performs ExtendBlockTest
. IMO these should be two separate tests to make it easier to identify what broke when one test fails.app version in prepare proposal 0
app version 0 block header 0
app version in process proposal 0
app version 0 block header 2
I'm trying to figure out why blocks get produced with app version 2 when it looks like app version is 0 in prepare and process proposal.
The app version of the block is actually set by comet which is probably on version 2 whereas prepare proposal and process proposal have v0 because the context its not correct set
Sorry I put this on pause because a few weeks ago we thought this would be resolved by some app version fixes:
TBH I thought there was one other issue / PR version related but can't find it (cc: @cmwaters ). Regardless, since this is still occurring on main, it's likely that none of the version fixes actually resolved this issue.
To repro more quickly, one can run:
cd app/test && go test -run TestIntegrationTestSuite/TestMaxBlockSize -count 5 -v
The test was skipped in https://github.com/celestiaorg/celestia-app/pull/3142 so this issue should make it unflaky and then re-enable it
I wasn't able to get visibility because I was trying to log the version in prepare/process proposal using:
fmt.Printf("app version in prepare proposal %v\n", sdkCtx.ConsensusParams().Version.AppVersion)
which was silently panicing so I only observed test timeouts. It is possible to print the version via:
fmt.Printf("app version in prepare proposal %v\n", sdkCtx.ConsensusParams().Version.GetAppVersion())
Interestingly can't repro with single blob or multi blob test cases. Can only repro with randomTxGen
. The app version in prepare / process proposal was 2
before and after:
app version in prepare proposal 2
app version in process proposal 2
=== NAME TestIntegrationTestSuite/TestMaxBlockSize
integration_test.go:277:
Error Trace: /Users/rootulp/git/rootulp/celestiaorg/celestia-app/app/test/integration_test.go:277
/Users/rootulp/git/rootulp/celestiaorg/celestia-app/app/test/integration_test.go:174
/Users/rootulp/go/pkg/mod/github.com/stretchr/testify@v1.9.0/suite/suite.go:115
Error: Not equal:
expected: []byte{0xd7, 0x72, 0x7a, 0xd0, 0x5f, 0xef, 0xa3, 0x9f, 0x52, 0x7f, 0xe3, 0x4f, 0x82, 0xcf, 0xa2, 0x87, 0xbc, 0xbb, 0x8a, 0xc0, 0xcf, 0x52, 0x72, 0x3e, 0xc4, 0x21, 0x3d, 0x43, 0x8c, 0x97, 0xf9, 0x62}
actual : []byte{0x35, 0x48, 0x18, 0x18, 0x2b, 0xba, 0x90, 0xb5, 0xd8, 0x6b, 0xde, 0xad, 0x9c, 0x36, 0x29, 0x75, 0x60, 0xa1, 0x6, 0x6d, 0x3, 0xe4, 0x1, 0x2e, 0x4c, 0x4b, 0x92, 0x1f, 0x81, 0x19, 0x67, 0x18}
Diff:
--- Expected
+++ Actual
@@ -1,4 +1,4 @@
([]uint8) (len=32) {
- 00000000 d7 72 7a d0 5f ef a3 9f 52 7f e3 4f 82 cf a2 87 |.rz._...R..O....|
- 00000010 bc bb 8a c0 cf 52 72 3e c4 21 3d 43 8c 97 f9 62 |.....Rr>.!=C...b|
+ 00000000 35 48 18 18 2b ba 90 b5 d8 6b de ad 9c 36 29 75 |5H..+....k...6)u|
+ 00000010 60 a1 06 6d 03 e4 01 2e 4c 4b 92 1f 81 19 67 18 |`..m....LK....g.|
}
Test: TestIntegrationTestSuite/TestMaxBlockSize
app version in prepare proposal 2
app version in process proposal 2
There is a discrepancy in the square sizes used in process_proposal.go
and extend_block.go
. The latter looks incorrect because it always uses the square size upper bound rather than the max effective square size (i.e. the minimum of: gov max square size, upper bound versioned const).
At least locally, I was able to resolve flakiness via the diff in https://github.com/celestiaorg/celestia-app/pull/3150
Fortunately it looks like celestia-node doesn't use the ExtendBlock
function exported from celestia-app (see this search). But it does have it's own buggy implementation (see here) because I think that should be using the max effective square size.
So ExtendBlock
in this repo should be updated to account for the max effective square size. This means it needs access to the app (i.e. it could be moved from a stand-alone exported function to a method on app
).
Separately, celestia-node needs a way to figure out the max effective square size from app and then use that to extend the block.
A few concrete action items:
ExtendBlock
to be a method on app
. It likely also needs an sdkCtx so the function signature will change.ExtendBlock
because it isn't used by celestia-node. GovSquareSizeUpperBound
to MaxEffectiveSquareSize
.ExtendBlock
that are stand-alone from the TestMaxBlockSize
.this might stem all the way back to https://github.com/celestiaorg/celestia-app/pull/1772, where we might have covered up an underlying issue with square by accidently using the same stateful param in ProcessProposal instead of fixing the non-determinism issue we're seeing now
Copying from a Slack convo, we may actually want to use the upper bound in process proposal and extend block so that celestia-node isn't stateful (i.e. doesn't need access to gov max square size) when it tries to extend squares.
Separately, celestia-node needs a way to figure out the max effective square size from app and then use that to extend the block.
It can query the gov max square size and perform the same method of taking the min of the two. The problem is that if this were to change celestia-node
wouldn't know unless it was constantly querying the square size each block. I remember talking about this problem several months ago and just concluding that it wasn't a problem if light nodes just use the upper bound and we rely on the validator set to ensure that it is within bounds.
As an aside, I thought the whole point of ExtendBlock
was for celestia-node to use it
As an aside, I thought the whole point of ExtendBlock was for celestia-node to use it
+1 but celestia-node doesn't use it currently.
I remember talking about this problem several months ago and just concluding that it wasn't a problem if light nodes just use the upper bound and we rely on the validator set to ensure that it is within bounds.
I vaguely recall the same outcome. I think we agreed to try and avoid celestia-node having to query the gov max square size param each block. If celestia-node doesn't query the gov max square size then this should hold: square.Construct(_, 128, _)
== square.Construct(_, 64, _)
for the same set of transactions but based on the flakiness of this test case, I don't think that's the case.
On celestia-app v1.x, ProcessProposal and ExtendBlock both have a worstCaseShareIndexes that always uses 128.
On celestia-app main, we use go-square's implementation of worstCaseShareIndexes which uses a max square size plumbed in via square.Construct(). See here:
func worstCaseShareIndexes(blobs, maxSquareSize int) []uint32 {
shareIndexes := make([]uint32, blobs)
ProcessProposal calls square.Construct with 64
and ExtendBlock calls square.Construct with 128
.
Therefore, the bug is not present on celestia-app v1.x or any celestia-node release. The bug is present on go-square v1.0.0 and celestia-app main.
Blocked on a release of go-square with https://github.com/celestiaorg/go-square/pull/49. The team has different opinions on whether it should be released as 1.0.1 or 2.0.0. We're meeting tomorrow to decide.
Problem
Occurences