cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.12k stars 3.81k forks source link

roachtest: cdc/filtering/session fails due to duplicated events #117590

Closed srosenberg closed 10 months ago

srosenberg commented 10 months ago

This came up during an adhoc run in AWS using graviton3 instances,

07:06:20 test_runner.go:824: [w4] test failed: cdc/filtering/session (run 1)
07:06:20 test_runner.go:840: [w4] destroying cluster srosenberg-1704783858-01-n3cpu4 [tag:] (3 nodes) because: cdc/filtering/session (1) - (assertions.go:333).Fail:
        Error Trace:    github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cdc_filtering.go:278
                                                github.com/cockroachdb/cockroach/pkg/cmd/roachtest/tests/cdc_filtering.go:192
                                                main/pkg/cmd/roachtest/test_runner.go:1107
                                                src/runtime/asm_amd64.s:1650
        Error:          Not equal:
                        expected: []string{"A@1", "B@1", "C@1", "B@2 (before: B@1)", "C@2 (before: C@1)", "C@3 (before: C@2)", "A@4 (before: A@3)", "D@1"}
                        actual  : []string{"A@1", "B@1", "C@1", "B@2 (before: B@1)", "B@2 (before: B@1)", "C@2 (before: C@1)", "C@2 (before: C@1)", "C@3 (before: C@2)", "C@3 (before: C@2)", "A@4 (before: A@3)", "A@4 (before: A@3)", "D@1", "D@1"}

We can see that some events were duplicated whereas the test assertion expects uniques. cdc_filtering_logs.tar.gz

Jira issue: CRDB-35256

Epic CRDB-13169

blathers-crl[bot] commented 10 months ago

cc @cockroachdb/cdc

srosenberg commented 10 months ago

Looking at n1, we can see that a range was split immediately after a changfeed was created,

I240109 07:05:53.804169 5274 kv/kvserver/replica_command.go:440 ⋮ [T1,Vsystem,n1,split,s1,r67/1:‹/{Table/65-Max}›] 205  initiating a split of this range at key /Table/104 [r68] (span config)

Note that replication of system ranges is co-occurring with the changefeeds. It might be possible to exclude duplicates by waiting for replication to finish (see WaitFor3XReplication). Given the current implementation of this test, we would then expect no other range splits. Otherwise, if duplicates cannot be provably excluded, then the assertion should be weakened.

CC @andyyang890 @nicktrav

nicktrav commented 10 months ago

@andyyang890 - wdyt about just updating the testing logic in here to eliminate dupes? It's expected that we will be encountering them on a changefeed. This should be a pretty easy fix.

andyyang890 commented 10 months ago

Sure, I'll do that!