cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
30.07k stars 3.8k forks source link

cdc: Unskip TestChangefeedNodeShutdown #32232

Closed mrtracy closed 1 year ago

mrtracy commented 5 years ago

TestChangefeedNodeShutdown is a new test which is partially complete, but has been checked in Skip()ed due to unfortunate flakiness issues, of which there are several:

This unit test was intended as a faster-running replacement for the cdc/crdbChaos roachtest, so this functionality is not going untested. We are logging this follow-up issue and skipping the test so that we get this needed (and verified) functionality checked in.

Jira issue: CRDB-4756

mrtracy commented 5 years ago

I've been heads-down on this for a few days, here's what I've been working towards:

A third angle of verification would be to verify the replication history; that is, make sure replica counts make sense, replicas are spread evenly, and that there isn't excess replication. This might be possible through time series, though, if we are unconcerned with specifics. However, I am leaving this out of my V1 product.

tbg commented 5 years ago

@danhhz can you clarify if this test provides anything in addition to the cdc-chaos roachtests? This sounds like we're mostly shaving yaks.

danhhz commented 5 years ago

My intention was to replace the cdc/crdb-chaos roachtest. Roachtests are necessary for testing large and long-running things, but when something can be a TestCluster-based unit test, I always prefer that. Unit tests are certainly easier while developing something, but in this case it's mostly because Roachtests flake at a dramatically higher rate, and when that ongoing load can be avoided, I'd like to.

danhhz commented 5 years ago

Also just remembered that I'd like to solve some of these anyway. I'm about to start working on a jepsen style test for changefeeds. The overall correctness of changefeeds is dependent on them maintaining our user-guaranteed invariants across anything we throw at it: crdb chaos, sink chaos, schema changes, job pause/resume, etc. There are a huge number of moving pieces that all have to work together exactly right and this is currently under-tested. Being able to do this in a go test with TestCluster would be huge.

tbg commented 5 years ago

I agree that it's worthwhile but my main concern is that getting these tests to run fast enough to run on every CI is not worth the hacks. They seem to live at a stage between roachtests and unit tests, where we want something that's really easy to run but we'd rather not run it in CI (but, say, in a nightly stress test).

I think if we skipped the test if !testutils.NightlyStress() we'd get the desired behavior (or you can run it at any cadence you like, just have to set up the jobs the right way). This could also be the pattern used for this jepsen-style suite you're working on. Thoughts?

https://github.com/cockroachdb/cockroach/blob/2558dccbcda09c0eae28f816a05adb8620155175/pkg/testutils/stress.go#L19-L25

danhhz commented 5 years ago

I'm certainly amenable to the argument, but I don't see how it applies to this case? What about this test makes you worried that it will take too long to run on every CI?

tbg commented 5 years ago

Looking at this test now. Right now it fails pretty much on every run. Will look a bit as to why that is.

tbg commented 5 years ago

Doesn't fail for any CDC related reasons. Basically after shutting down the first node, things fall apart. I can reduce this to

func TestChangefeedNodeShutdown(t *testing.T) {
    defer leaktest.AfterTest(t)()

    args := base.TestServerArgs{
        UseDatabase: "d",
    }

    tc := serverutils.StartTestCluster(t, 3, base.TestClusterArgs{
        ServerArgs: args,
    })
    defer tc.Stopper().Stop(context.Background())

    db := tc.ServerConn(1)
    sqlDB := sqlutils.MakeSQLRunner(db)

    sqlDB.Exec(t, `CREATE DATABASE d`)
    sqlDB.Exec(t, `CREATE TABLE foo (a INT PRIMARY KEY, b STRING)`)
    sqlDB.Exec(t, `INSERT INTO foo VALUES (0, 'initial')`)

    time.Sleep(10 * time.Second)
    tc.StopServer(0)

    sqlDB.Exec(t, `UPSERT INTO foo VALUES(0, 'updated')`)
    sqlDB.Exec(t, `INSERT INTO foo VALUES (3, 'third')`)
}
--- FAIL: TestChangefeedNodeShutdown (24.95s)
    changefeed_test.go:1741: error executing 'UPSERT INTO foo VALUES(0, 'updated')': pq: result is ambiguous (error=failed to connect to n1 at 127.0.0.1:59716: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:59716: connect: connection refused" [exhausted])
FAIL

Plenty of failed heartbeats before. I'm not quite sure why, the ranges should be upreplicated at this point. We'll see.

miretskiy commented 1 year ago

@shermanCRL why GA-blocker on an issue that's 5 years old?