cockroachdb / cockroach

CockroachDB — the cloud native, distributed SQL database designed for high availability, effortless scale, and control over data placement.
https://www.cockroachlabs.com
Other
29.84k stars 3.77k forks source link

kvserver: consider additional reproposal chaos testing #110551

Open erikgrinaker opened 11 months ago

erikgrinaker commented 11 months ago

After the recent reproposal revamp in https://github.com/cockroachdb/cockroach/pull/97779, we added TestKVNemesisSingleNode_ReproposalChaos to do reproposal chaos testing (e.g. inject various errors, drop proposals, etc). Is this sufficient test coverage of reproposals, or do we need more?

Jira issue: CRDB-31499

Epic CRDB-39898

blathers-crl[bot] commented 11 months ago

Hi @erikgrinaker, please add branch-* labels to identify which branch(es) this release-blocker affects.

:owl: Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

blathers-crl[bot] commented 11 months ago

cc @cockroachdb/replication

pav-kv commented 11 months ago

I was stressing TestKVNemesisSingleNode_ReproposalChaos, it normally runs for a few seconds, but times out at 5min once a few thousands runs. I tried setting rand seed to what it printed, and it makes the failure reproduce perfectly (master @ c704759):

$ dev test --verbose pkg/kv/kvnemesis --filter=TestKVNemesisSingleNode_ReproposalChaos -- --test_env=COCKROACH_RANDOM_SEED=-1416151305199981714
...
I230927 10:58:05.045395 1 (gostd) rand.go:250  [-] 1  random seed: -1416151305199981714
=== RUN   TestKVNemesisSingleNode_ReproposalChaos
    test_log_scope.go:167: test logs captured to: ...
    test_log_scope.go:81: use -show-logs to present logs inline
    kvnemesis_test.go:283: seed: 4107035115243961836
    kvnemesis_test.go:197: kvnemesis logging to ...
panic: test timed out after 4m55s
running tests:
        TestKVNemesisSingleNode_ReproposalChaos (4m55s)
...

Looking into what causes the timeout.

Upd: bisect points at #110651. Looks like the addition of shared locks to kvnemesis makes the test stuck sometimes. Successfully stressed the commit before 9856622 10k times (no rand seed pinning this time), to make sure. @arulajmani Could you take a look?

I can see that exactly two steps are blocked on Apply, one of which is a ClosureTxnOperation, and another one is a "primitive" op like put/get. Both are blocked on lockTableWaiterImpl.WaitOn for 4 minutes, so I assume there is some deadlock caused by the way they conflict.

Here is the content of the blocked steps (run in this order):

// blocked for 4 minutes
start: x.DelRange(ctx, tk(340099245535959704), tk(16963214484241583826), true /* @s10 */)

// in the meantime, this command ends
// note that the key is in the DelRange above
end: x.Put(ctx, tk(6360205682711281622), sv(9)) // on 1 Put at /Table/100/"5843fd5ee06d3fd6": injected error

// blocked for 4 minutes
// note that the Del key is in the DelRange too
// as well as the key in GetForUpdateSkipLocked
start: x.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
  txn.SetIsoLevel(isolation.ReadCommitted)
  b := &kv.Batch{}
  b.GetForUpdateSkipLocked(tk(15321639564963858289))
  b.Del(tk(13939358664956423074) /* @s11 */)
  txn.CommitInBatch(ctx, b)
  return nil
})
arulajmani commented 11 months ago

Thanks for investigating Pavel. I'll pick this up. Interestingly, it looks like this isn't because of shared locks -- it's because of the GetForUpdateSkipLocked (maybe). The reason it's bisecting to https://github.com/cockroachdb/cockroach/pull/110651 is that I realized we were missing that operation and decided to include it in that PR.

pav-kv commented 11 months ago

@arulajmani I forked #111352 to keep it separate from this issue.

pav-kv commented 10 months ago

Is this sufficient test coverage of reproposals, or do we need more?

Here is what a typical TestKVNemesisSingleNode_ReproposalChaos run reports for the [re-]proposals metrics:

kvnemesis_test.go:312: proposed: 947
kvnemesis_test.go:314: reproposed unchanged: 53
kvnemesis_test.go:316: reproposed with new LAI: 6

The TestKVNemesisSingleNode and TestKVNemesisMultiNode tests can repropose too (with a lower probability). Typical runs:

TestKVNemesisSingleNode ``` kvnemesis_test.go:312: proposed: 682 kvnemesis_test.go:314: reproposed unchanged: 60 kvnemesis_test.go:316: reproposed with new LAI: 1 ```
TestKVNemesisMultiNode ``` kvnemesis_test.go:312: proposed: 2370 kvnemesis_test.go:314: reproposed unchanged: 54 kvnemesis_test.go:316: reproposed with new LAI: 0 ```

I think we should bump the probability of LAI reproposals in TestKVNemesisMultiNode.

erikgrinaker commented 10 months ago

Are these multiplicative, i.e. given a reproposal there's a 10% chance of it changing the LAI? If so, bumping it to e.g. 20% or 30% seems fine.

pav-kv commented 10 months ago

The logic of probabilities isn't super intuitive here. Both kinds of reproposals share the same attempts counter (1, 2), which then changes the probabilities. So, the actual probability is at most half of what's configured. Need to fix it.

But apart from that, both injections are independent (and can happen both at the same time). The LAI injection just overrides a LAI during the proposal flushing, while the non-LAI injection errors out below Raft.

The second kind of injection, though, gets translated to an invalid LAI error too. So, all this logic seems to be only testing the new-LAI kind of reproposals. I'm not sure yet where the "unchanged" reproposals are coming from in this test.

pav-kv commented 10 months ago

I'm not sure yet where the "unchanged" reproposals are coming from in this test.

All of these come up because of reasonNewLeader here: https://github.com/cockroachdb/cockroach/blob/e8367041383e1123b2f71cfd4b54439ab4ac56b8/pkg/kv/kvserver/replica_raft.go#L1441

This happens regardless of the configured probabilities. Probably just inherent to this test.

pav-kv commented 10 months ago

Something in these tests doesn't let reproposals rate to correspond to the configured probability. I ran TestKVNemesisMultiNode with 1000 steps and (invalidLeaseAppliedIndexProb = 0.2, injectReproposalErrorProb = 0.5), and barely got to a few reproposals:

[n1] proposed: 4957
[n1] reproposed with new LAI: 3
[n2] proposed: 263
[n2] reproposed with new LAI: 1
[n3] proposed: 1167
[n3] reproposed with new LAI: 11

This could be due to the fact that we configure the kvnemesis test by default to have many operations with uniform probabilities, so the number of the affected proposals is small.

OTOH, I counted the affected proposals that go through these hooks with a modified LAI, and there are a couple of hundreds. It's unclear why they are "lost" and don't show up in reproposals metrics.