roachtest: decommission/randomized failed [waiting for backport]

cockroach-teamcity commented 3 years ago

roachtest.decommission/randomized failed with artifacts on release-21.1 @ 4837b15a513fe9f0283b1583fece7fe8d3cd49ae:

The test failed on branch=release-21.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/randomized/run_1
    decommission.go:848,retry.go:197,decommission.go:794,decommission.go:61,test_runner.go:767: unexpected diff(matrix, expMatrix):
        [[][]string[19] != [][]string[20]]
        [[node_decommissioning] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioned] [node_decommissioned] [node_decommissioned]]
        vs.
        [[node_decommissioning] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioned] [node_decommissioned] [node_decommissioning] [node_decommissioned]]

Reproduce

To reproduce, try: ```bash # From https://go.crdb.dev/p/roachstress, perhaps edited lightly. caffeinate ./roachstress.sh decommission/randomized ```

Same failure on other branches

- #65589 roachtest: decommission/randomized failed [missing event] [C-test-failure O-roachtest O-robot branch-master release-blocker]

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

cockroach-teamcity commented 3 years ago

roachtest.decommission/randomized failed with artifacts on release-21.1 @ 587aaedda1a00fe03dd16317ec6da329f69a3ac9:

The test failed on branch=release-21.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/randomized/run_1
    decommission.go:848,retry.go:197,decommission.go:794,decommission.go:61,test_runner.go:767: unexpected diff(matrix, expMatrix):
        [[][]string[19] != [][]string[20]]
        [[node_decommissioning] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioned] [node_decommissioned] [node_decommissioned]]
        vs.
        [[node_decommissioning] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioned] [node_decommissioned] [node_decommissioning] [node_decommissioned]]

Reproduce

To reproduce, try: ```bash # From https://go.crdb.dev/p/roachstress, perhaps edited lightly. caffeinate ./roachstress.sh decommission/randomized ```

Same failure on other branches

- #65589 roachtest: decommission/randomized failed [missing event] [C-test-failure O-roachtest O-robot branch-master release-blocker]

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

aliher1911 commented 3 years ago

The issue is coming from the way we assert on on command output. We execute commands with CombineOutput() which redirects stderr to stdout and merge it together. Command result goes to stdout and warnings to to strerr. Commands write warning prior to executing action, but in merged output warnings sometimes get reordered with output. Our assert uses a tail of output to match all expected lines so if warning slips under output the test fails. Having err and out in the same buffer is not the best idea as it complicates verification. But changing it will go through a lot of code. As a fix I suggest we filter known warnings for now.

tbg commented 3 years ago

Interesting. Do you have more details? What does the output look like in the error case? I agree that making the output parsing in this test more robust is a good immediate fix.

aliher1911 commented 3 years ago

Those are two cases that failed for example: https://gist.github.com/aliher1911/fc94d7ba65fbd519f8d3dd8c19002943 And the warning in question is: ^warning: node [0-9]+ is already decommissioning or decommissioned$ Failure sometimes happens on one of two asserts where we try to decom node once again and expect noop.

aliher1911 commented 3 years ago

There are other issues with the test where test expect nodes to be live after decommission, but because of aforementioned flake above it waits for too long for warnings to appear in the right place, and the node goes dark because of circuitbreaker on rpc kiling it.

aliher1911 commented 3 years ago

Okay this is another failure in the same test. So it was frequently failing with what was described above and now that I changed that it started failing with the error that is in the description above. So fixing this only addresses another issue but not this one. The one described here is reproduceable now, and it looks like the events table misses one state.

aliher1911 commented 3 years ago

So what it looks like is that is node has lots of replicas and after ./cockroach node decommission we see draining sequence starting from: 6,false,28,true,decommissioning,false and going to 6,false,0,true,decommissioning,false in 20+ seconds test passes. While is sequence starts from: 2,false,3,true,decommissioning,false it'll take 10 sec to finish and test fails without seeing node_decommissioning status next to last. I'm going to add more logging to see which node is in each state in that query to maybe get more insights.

aliher1911 commented 3 years ago

Ok I found a culprit: E210603 22:03:19.644374 191965 1@server/server.go:2377 ⋮ [-] 1793 unable to record event: ‹common:<timestamp:1622757789936846453 event_type:"node_decommissioning" > node:<requesting_node_id:1 target_node_id:5 >›: ‹log-event›: failed to connect to n2 at ‹10.142.0.54:26257›: ‹initial connection heartbeat failed›: ‹rpc error: code = PermissionDenied desc = n2 was permanently removed from the cluster at 2021-06-03 22:02:25.848597039 +0000 UTC; it is not allowed to rejoin the cluster› I can see that on the node that register decommissioning event. Node 5 was killed and we are trying to decommission it while it's unavailable, node 2 was also decommissioned as a part of the test prior to that. We are running with only 2 nodes which should be... still ok?

tbg commented 3 years ago

Interesting. So there are two cases here, where you're posting the second case.

case 1: we're decommissioning (running) node X through node X. We'll set the decommissioned status and next we'll try to log the event. This may fail, since by decommissioning itself the node is about to be rejected by all other nodes. The write to the event log will race with the other nodes realizing (through gossip) to reject us.

case 2: this is what we see above. We're decommissioning node X through node Y (!= X) but we still get this error and fail to log the event. It seems that our node has n2 cached as a leaseholder for the range that the event log write is addressed to. The RPC layer (on node Y) is already aware that X has been removed, and so it returns a hard error from trying to dial it. The error is generated here:

https://github.com/cockroachdb/cockroach/blob/904807640f6a28aa82c5e63f0b62de2dbf27b3a0/pkg/server/server.go#L291-L297

As a general rule, we don't retry on these errors, see for example

https://github.com/cockroachdb/cockroach/blob/904807640f6a28aa82c5e63f0b62de2dbf27b3a0/pkg/kv/kvclient/kvcoord/dist_sender.go#L1864-L1870

This is because otherwise, operations that go through a decommissioned node hang forever (since anywhere you try to connect you'll get the auth error). But I think we made this too symmetrical: when healthy nodes try to dial a removed node (which they shouldn't, but it's hard to avoid entirely and at once due to caches etc), they also get this permission denied error, but it really indicates something else than when the decommissioned node tries to dial someone. I think we need to break the symmetry (when you dial a removed node, you get a different error that we consider retriable, the idea being that you won't actually retry).

@erikgrinaker mind checking that plan? We'd split up https://github.com/cockroachdb/cockroach/blob/904807640f6a28aa82c5e63f0b62de2dbf27b3a0/pkg/server/server.go#L309-L314 so that when trying to connect to a removed node (OnOutgoingPing), we get a FailedPrecondition (is that the right one? Or NotFound?) back, which will by default be retriable. We reserve the PermissionDenied for the case in which the remote actually rejects our connection attempt (OnIncomingPing, where we are the remote and we're serving the error to the removed node).

aliher1911 commented 3 years ago

We definitely have a case 2 here, the only difference is we have nodes X that we decommission, Y that we go through and Z that is a cached leaseholder, but it was also decommissioned on the previous steps of the test. Just for clarity so it matches the error above. The observed behaviour is not different from what is described.

Looking on the recommended use of error code, FailedPrecondition intended use is when system state needs to be explicitly fixed before retrying while Unavailable seem to be the code that you are supposed to immediately retry. So if we return FailedPrecondition at transport layer, then it should trigger refreshing leases and retry. Sounds logical to me.

I think we should split fixes to the test and fixes to error handling here. By test fixes I mean what I mentioned earlier - filtering stderr and maybe few other minor things there unrelated to this particular history mismatch.

erikgrinaker commented 3 years ago

Right, this is an interesting state of affairs. I get that we want to retry the log write after a lease refresh, but if we make all RPCs to decommissioned nodes return a retryable error, isn't there a risk that we can end up in retry loops for some operations?

tbg commented 3 years ago

You're right, I think we need to be selective about it. What I should've said is: in DistSender, we should treat FailedPrecondition as retriable, in effect by not special casing it as an auth error - so we'll treat it like an opaque SendError (leading to a cache eviction + retry, if memory serves correctly). When a node in the cluster happens to run an RPC targeting that particular decommissioned node directly - it should not retry, i.e. retain the current behavior.

erikgrinaker commented 3 years ago

in DistSender, we should treat FailedPrecondition as retriable, in effect by not special casing it as an auth error - so we'll treat it like an opaque SendError (leading to a cache eviction + retry, if memory serves correctly). When a node in the cluster happens to run an RPC targeting that particular decommissioned node directly - it should not retry, i.e. retain the current behavior.

Aren't these mutually exclusive? If "a node in the cluster happens to run an RPC targeting that particular decommissioned node", wouldn't that RPC typically go through the DistSender, who we're saying should retry it?

erikgrinaker commented 3 years ago

Actually, I guess that's not true. The DistSender primarily uses key/range addressing, not node addressing, so we'd want it to eventually get to that range. I.e., key/range-addressed operations should retry, but node-addressed operations shouldn't.

aliher1911 commented 3 years ago

So this is the interesting part where we handle auth error which now doesn't allow retries and would allow them if we change outgoing heartbeat check to return codes.FailedPrecondition: https://github.com/cockroachdb/cockroach/blob/904807640f6a28aa82c5e63f0b62de2dbf27b3a0/pkg/kv/kvclient/kvcoord/dist_sender.go#L1860-L1938 I actually tried the change and not able to repro the test failure with it. But I didn't try the full test suite to verify if anything else will fail.

erikgrinaker commented 3 years ago

Great! As outlined, we need to be careful to handle FailedPrecondition as a non-retryable error for RPC calls outside of the DistSender -- otherwise they may end up in ~infinite retry loops. Looking at IsAuthError() call sites and code paths would be a good start. And we should make sure there's ample comments and tests for this, since it's a bit of a footgun.

tbg commented 3 years ago

This will be fixed with the backport of https://github.com/cockroachdb/cockroach/issues/66199.

tbg commented 3 years ago

@aliher1911 could you backport your change? Or are we intentionally holding off to let it bake on master for a bit?

aliher1911 commented 3 years ago

@tbg brewing it on master a bit was my plan. I'm also trying some local tests to verify if it addresses some other issues with talking to decom nodes that were raised against 19.1 (#47460 and #47459) and checking if nothing else is broken with this fix.

cockroach-teamcity commented 3 years ago

roachtest.decommission/randomized failed with artifacts on release-21.1 @ db3cdb8913ba56599cc52766411893438b5c4b54:

The test failed on branch=release-21.1, cloud=gce:
test artifacts and logs in: /home/agent/work/.go/src/github.com/cockroachdb/cockroach/artifacts/decommission/randomized/run_1
    decommission.go:848,retry.go:197,decommission.go:794,decommission.go:61,test_runner.go:728: unexpected diff(matrix, expMatrix):
        [[][]string[19] != [][]string[20]]
        [[node_decommissioning] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioned] [node_decommissioned] [node_decommissioned]]
        vs.
        [[node_decommissioning] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_decommissioning] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_recommissioned] [node_decommissioning] [node_decommissioning] [node_decommissioned] [node_decommissioned] [node_decommissioning] [node_decommissioned]]

Reproduce

To reproduce, try: ```bash # From https://go.crdb.dev/p/roachstress, perhaps edited lightly. caffeinate ./roachstress.sh decommission/randomized ```

/cc @cockroachdb/kv

This test on roachdash | Improve this report!

tbg commented 3 years ago

It's been 26 days, should we go ahead with the backport after the local verifications you mention above @aliher1911?

aliher1911 commented 3 years ago

Actually we backported it to 21.1 already: https://github.com/cockroachdb/cockroach/pull/66831 As for previous versions, it is not possible to do so as the code to shortcut decommissioned nodes didn't exits before 21.1. So I'm going to close this as fixed.

cockroachdb / cockroach

roachtest: decommission/randomized failed [waiting for backport] #65877