kvserver: NLHE returned from tryReproposeWithNewLeaseError when ts got closed

cockroach-teamcity commented 4 years ago

(roachtest).kv50/rangelookups/relocate/nodes=8 failed on master@809829bfe7ff27a610fa78f409fae658b9f2d9d9:

The test failed on branch=master, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/kv50/rangelookups/relocate/nodes=8/run_1
    cluster.go:2456,kv.go:656,kv.go:685,test_runner.go:753: monitor failure: monitor task failed: pq: unable to select removal target from [(n1,s1):?]; current replicas [(n2,s2):1 (n7,s7):2 (n1,s1):3 (n8,s8):4]: could not select an appropriate replica to be removed
        (1) attached stack trace
          | main.(*monitor).WaitE
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2444
          | main.(*monitor).Wait
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2452
          | main.registerKVRangeLookups.func1
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/kv.go:656
          | main.registerKVRangeLookups.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/kv.go:685
          | [...repeated from below...]
        Wraps: (2) monitor failure
        Wraps: (3) attached stack trace
          | main.(*monitor).wait.func2
          |     /home/agent/work/.go/src/github.com/cockroachdb/cockroach/pkg/cmd/roachtest/cluster.go:2500
          | runtime.goexit
          |     /usr/local/go/src/runtime/asm_amd64.s:1357
        Wraps: (4) monitor task failed
        Wraps: (5) pq: unable to select removal target from [(n1,s1):?]; current replicas [(n2,s2):1 (n7,s7):2 (n1,s1):3 (n8,s8):4]: could not select an appropriate replica to be removed
        Error types: (1) *withstack.withStack (2) *errutil.withMessage (3) *withstack.withStack (4) *errutil.withMessage (5) *pq.Error

More

Artifacts: [/kv50/rangelookups/relocate/nodes=8](https://teamcity.cockroachdb.com/viewLog.html?buildId=1964552&tab=artifacts#/kv50/rangelookups/relocate/nodes=8) [See this test on roachdash](https://roachdash.crdb.dev/?filter=status%3Aopen+t%3A.%2Akv50%2Frangelookups%2Frelocate%2Fnodes%3D8.%2A&sort=title&restgroup=false&display=lastcommented+project) _{powered by [pkg/cmd/internal/issues](https://github.com/cockroachdb/cockroach/tree/master/pkg/cmd/internal/issues)}

Jira issue: CRDB-4220

knz commented 4 years ago

The kv workload client failed with

pq: unable to select removal target from [(n1,s1):?]; current replicas [(n2,s2):1 (n7,s7):2 (n1,s1):3 (n8,s8):4]: could not select an appropriate replica to be removed

At the time of the failure, several ranges were stuck unable to repropose, for example:

W200524 15:58:06.375362 202 kv/kvserver/replica_application_result.go:146  [n1,s1,r629/2:/Table/53/1/41{55585â€¦-74013â€¦}] failed to repropose with new lease index: [NotLeaseHolderError] r629: replica (n1,s1):2 not lease holder; current lease is repl=(n1,s1):2 seq=12 start=1590335828.141040543,23 epo=1 pro=1590335828.141040543,24

(direct link: https://teamcity.cockroachdb.com/repository/download/Cockroach_Nightlies_WorkloadNightly/1964552:id/kv50/rangelookups/relocate/nodes%3D8/run_1/1.logs/cockroach.teamcity-1964552-1590305390-61-n9cpu8-0001.ubuntu.2020-05-24T15_56_25Z.004143.log )

I see two separate concerns here:

1) what's the proximate situation - why is this range unable to repropose, and why does this cause a client query to fail

2) the KV-level error message is bubbling up to the client unmodified. This should be masked behind a SQL error code and perhaps the details should be abstracted a little bit.

cc @andreimatei for dealing with the KV error (filed separate issue #49868)

I'd like help from both @tbg @andreimatei to explain what the "failed to repropose" error means and when/why it's likely to happen in this test.

andreimatei commented 4 years ago

what's the proximate situation - why is this range unable to repropose, and why does this cause a client query to fail

I spent some time staring at this. I don't know exactly what's going on. The "failed to repropose" is coming from here: https://github.com/cockroachdb/cockroach/blob/47a3b687cbd016f1e6d828337549e1b88c9bedc0/pkg/kv/kvserver/replica_application_result.go#L146 so a command was rejected with proposalIllegalLeaseIndex, which means it was reordered with another command. The exact situations in which commands can be reordered around the Raft transport I don't remember exactly, but I believe them to be rare (involving a remote leader reconnects between the leaseholder and the Raft leader?). Then, the reason why the reproposal fails is weird: the lease didn't change since the original evaluation of the request (it it had, this code would have fired and we wouldn't have tried to repropose). The reproposal failure comes from here: https://github.com/cockroachdb/cockroach/blob/ecb371b4654d0f3158be56b4d6efefe0f442ce78/pkg/kv/kvserver/replica_application_state_machine.go#L251 It's telling us that the request needs to be re-evaluated at a new timestamp because the a lower timestamp was already closed. I'm kinda surprised that this can happen - doesn't the closing of timestamps synchronize with in-flight commands in any way? Maybe @tbg can shed some light.

Anyway, that NLHE should have had a better message - particularly given that it's a pretty funky error to return in this case to begin with. And I'm not sure if I should expect for the request to be re-evaluated at a new timestamp on the server side. It could be that this error never made it to the client, but I'm not sure.

andreimatei commented 4 years ago

For the relocation error that caused the failure, if it turns out that it's expected it can be included in this list to get the test happy: https://github.com/cockroachdb/cockroach/blob/f80ec9c292d43676d739fffabc847fb8b3195396/pkg/cmd/roachtest/bank.go#L330

andreimatei commented 4 years ago

It's telling us that the request needs to be re-evaluated at a new timestamp because the a lower timestamp was already closed. I'm kinda surprised that this can happen - doesn't the closing of timestamps synchronize with in-flight commands in any way? Maybe @tbg can shed some light.

I've educated myself a little bit and it seems that closing a timestamp can happen when the original proposal is rejected. I think that's the point where we remove it from the minprop.Tracker. So in order to allow a reproposal, I guess we'd have to rework how this tracking of in-flight commands works. But I don't really know what I'm talking about.

And I'm not sure if I should expect for the request to be re-evaluated at a new timestamp on the server side. It could be that this error never made it to the client, but I'm not sure.

I think the answer here is that there won't be any server-side re-evaluation. We only do the re-evaluations for specific errors and the NLHE is not one of them. So I guess what'll happen is that the NLHE is awkwardly telling the DistSender to retry, but I think it doesn't even tell it that the write timestamp needs to be bumped - it'll take another attempt to figure that out.

tbg commented 4 years ago

The proposal is only in the tracker while it's evaluating, so when it comes out of raft and needs a reproposal the original timestamp may already be closed out.

I'm improving the error's message in this case and will also massage the allowlist.

tbg commented 4 years ago

https://github.com/cockroachdb/cockroach/pull/50272 gives the NLHE a custom message indicating why the error occurred.

Leaving this open at @andreimatei's request to investigate returning a different error/reevaluating locally (this is not trivial since we don't have a client goroutine for it any more in the case of async consensus).

cockroachdb / cockroach

kvserver: NLHE returned from tryReproposeWithNewLeaseError when ts got closed #49513