cdc: Unskip TestChangefeedNodeShutdown

mrtracy commented 5 years ago

TestChangefeedNodeShutdown is a new test which is partially complete, but has been checked in Skip()ed due to unfortunate flakiness issues, of which there are several:

Primarily, while the job is usually picked up by another node, occasionally the test takes several minutes to complete, and may actually never complete - this it the primary flake causing me to pause this work.
SQL Statements after the node shutdown sometimes return with an ambiguous result. This could possibly be fixed with a succeedsSoon, but might further exacerbate the indeterministic nature of the test.
The output of the changefeed is not deterministic with the shutdown, there may be duplicates, meaning that it is not appropriate to use the assertPayloads() infrastructure and will instead require a verifier setup.
Finally, it is my intuition that these three issues may be combining to mask other existing flakes.

This unit test was intended as a faster-running replacement for the cdc/crdbChaos roachtest, so this functionality is not going untested. We are logging this follow-up issue and skipping the test so that we get this needed (and verified) functionality checked in.

Jira issue: CRDB-4756

mrtracy commented 5 years ago

I've been heads-down on this for a few days, here's what I've been working towards:

The first thing that is verified (and needed for additional verification) is the "availability history" of the cluster, which is synthesized from cluster events and the current node liveness values. This gives a point-in-time picture of how many nodes were available at any timestamp in the cluster. The history itself can be verified; for instance, it may be required that there is never more than one node down at a time. However, this history can also be used to adjust expectations elsewhere in the verification.
The primary verification is to query different time series values across the whole history of the cluster.
- We break down time into fixed periods, and verify that each period meets expectations; we adjust those expectations according to the values in the availability history; for example, how many nodes there are, how many nodes are down, whether we are near the beginning of the test, etc.
- The time series being verified can be configured; or rather, we will modularize the verification so that we only select modules that are appropriate. For example, if we are measuring an expected txn/s target, we might adjust that based on the workload that is being run against the cluster.
- Initial time series being verified will be per-machine latencies and manually configurable txn/s based on workload.

A third angle of verification would be to verify the replication history; that is, make sure replica counts make sense, replicas are spread evenly, and that there isn't excess replication. This might be possible through time series, though, if we are unconcerned with specifics. However, I am leaving this out of my V1 product.

tbg commented 5 years ago

@danhhz can you clarify if this test provides anything in addition to the cdc-chaos roachtests? This sounds like we're mostly shaving yaks.

danhhz commented 5 years ago

My intention was to replace the cdc/crdb-chaos roachtest. Roachtests are necessary for testing large and long-running things, but when something can be a TestCluster-based unit test, I always prefer that. Unit tests are certainly easier while developing something, but in this case it's mostly because Roachtests flake at a dramatically higher rate, and when that ongoing load can be avoided, I'd like to.

danhhz commented 5 years ago

Also just remembered that I'd like to solve some of these anyway. I'm about to start working on a jepsen style test for changefeeds. The overall correctness of changefeeds is dependent on them maintaining our user-guaranteed invariants across anything we throw at it: crdb chaos, sink chaos, schema changes, job pause/resume, etc. There are a huge number of moving pieces that all have to work together exactly right and this is currently under-tested. Being able to do this in a go test with TestCluster would be huge.

tbg commented 5 years ago

I agree that it's worthwhile but my main concern is that getting these tests to run fast enough to run on every CI is not worth the hacks. They seem to live at a stage between roachtests and unit tests, where we want something that's really easy to run but we'd rather not run it in CI (but, say, in a nightly stress test).

I think if we skipped the test if !testutils.NightlyStress() we'd get the desired behavior (or you can run it at any cadence you like, just have to set up the jobs the right way). This could also be the pattern used for this jepsen-style suite you're working on. Thoughts?

https://github.com/cockroachdb/cockroach/blob/2558dccbcda09c0eae28f816a05adb8620155175/pkg/testutils/stress.go#L19-L25

danhhz commented 5 years ago

I'm certainly amenable to the argument, but I don't see how it applies to this case? What about this test makes you worried that it will take too long to run on every CI?

tbg commented 5 years ago

Looking at this test now. Right now it fails pretty much on every run. Will look a bit as to why that is.

tbg commented 5 years ago

Doesn't fail for any CDC related reasons. Basically after shutting down the first node, things fall apart. I can reduce this to

func TestChangefeedNodeShutdown(t *testing.T) {
    defer leaktest.AfterTest(t)()

    args := base.TestServerArgs{
        UseDatabase: "d",
    }

    tc := serverutils.StartTestCluster(t, 3, base.TestClusterArgs{
        ServerArgs: args,
    })
    defer tc.Stopper().Stop(context.Background())

    db := tc.ServerConn(1)
    sqlDB := sqlutils.MakeSQLRunner(db)

    sqlDB.Exec(t, `CREATE DATABASE d`)
    sqlDB.Exec(t, `CREATE TABLE foo (a INT PRIMARY KEY, b STRING)`)
    sqlDB.Exec(t, `INSERT INTO foo VALUES (0, 'initial')`)

    time.Sleep(10 * time.Second)
    tc.StopServer(0)

    sqlDB.Exec(t, `UPSERT INTO foo VALUES(0, 'updated')`)
    sqlDB.Exec(t, `INSERT INTO foo VALUES (3, 'third')`)
}

--- FAIL: TestChangefeedNodeShutdown (24.95s)
    changefeed_test.go:1741: error executing 'UPSERT INTO foo VALUES(0, 'updated')': pq: result is ambiguous (error=failed to connect to n1 at 127.0.0.1:59716: initial connection heartbeat failed: rpc error: code = Unavailable desc = all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial tcp 127.0.0.1:59716: connect: connection refused" [exhausted])
FAIL

Plenty of failed heartbeats before. I'm not quite sure why, the ranges should be upreplicated at this point. We'll see.

miretskiy commented 1 year ago

@shermanCRL why GA-blocker on an issue that's 5 years old?

cockroachdb / cockroach

cdc: Unskip TestChangefeedNodeShutdown #32232