Closed petermattis closed 3 years ago
As @bdarnell guessed, the slowness is retrieving the range descriptor
The problem is not with SendNextTimeout
. That timeout controls how long it is before DistSender
waits to send to the next replica, but in this test we're sending to the next replica almost immediately because the first replica has replied indicating it does not have the desired range. So we send to the next replica and it says "I'm not the lease holder". And we send to the third replica and it says "I'm not the lease holder". And then DistSender
backs off and retries and we go through that dance again. If everything times perfectly, after 3 seconds one of the 2 remaining replicas becomes the lease holder. If the timing isn't perfect, then the DistSender
retry loop will just miss the new lease acquisition and will delay the test by another 3-4 seconds.
I see two options here:
@andreimatei, @tschottdorf Thoughts?
I thought we were going to make use of leadership transfers here.
@tschottdorf We're currently not (as far as I can tell). Do you recall where that discussion was had?
@tschottdorf We're currently not (as far as I can tell).
No, we're definitely not. Lease transfers only landed recently (#7345) and we haven't done it yet (or even properly discussed it on an issue). During discussions on that PR that we realized that the leader could transfer away its lease without any downtime. I'm sure I mentioned that a few times during #7297 as well, but perhaps not online. Looks like the obvious thing to do, but let's discuss if it isn't.
@bdarnell is wary of introducing production usage of lease transfers. In the short term, disallowing removal of the replica which is the lease holder or making sure replicateQueue
does not remove the lease holder seem safer.
Works for me, though that means that runs contrary to balancing load (since all active replicas stick to where the load is), but it'll probably do for a while.
We don't rebalance based on load right now. Perhaps we can have an overloaded node gracefully shed range-leases.
We don't rebalance on load, but with that proposed change we rebalance based on no load. If block-writer writes to a single node, that node will have all the leader leases, so I will be unlikely to rebalance away from that node. In effect this can create a hotspot node on which all ranges are active and which has no way of getting rid of them.
There's no gracefully shedding active leases without lease transfers. I'm personally not wary of them (though they could use a bit more testing) but I don't object to the decision of deferring their use for simplicity. Once we see those hotspot problems in practice we should bite the bullet and put in the work, not a workaround. I hope that will not be very soon (given the absence of production users at this point).
Ah, that's a good point about only rebalancing ranges that are not receiving local load.
Ok, so where do we stand with this issue?
Should I turn off removing the replica with the range lease to avoid the 3s delay?
I do plan to add load info for rebalancing soon, but that's a few weeks away.
Yes, turn of removal of the range-lease-holder replica for rebalancing. But this isn't as urgent as the disk space issues.
We could also remove the randomness here by adding the lease expiration time to NotLeaderError
. DistSender
could synchronize its retries with the expiration of the old lease. See also #3196.
We now have two tests (TestBasicManualReplication and TestGossipFirstRange) that transfer the range-lease immediately before removing the replica that was holding the lease. As far as I can tell, this works fine. Perhaps we should push harder on testing to get @bdarnell over his apprehension of using TransferLease
.
Ben, any particular thing you're afraid of? Or what kind of test would you like to see?
One thing I should have said here is that there's one more thing to be done for lease transfers before they can be used in production - the old lease holder doesn't save the fact that it can't use the current lease any more to disk state. And so if it restarts before the transfer comes out of raft, it will think it has the lease and we'll have a bad time. It's on my list to figure out this disk state.
wanna create an issue for that?
On Fri, Jul 22, 2016 at 9:51 AM Andrei Matei notifications@github.com wrote:
One thing I should have said here is that there's one more thing to be done for lease transfers before they can be used in production - the old lease holder doesn't save the fact that it can't use the current lease any more to disk state. And so if it restarts before the transfer comes out of raft, it will think it has the lease and we'll have a bad time. It's on my list to figure out this disk state.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/7872#issuecomment-234549094, or mute the thread https://github.com/notifications/unsubscribe-auth/ALOpBNeHX4JWHiFna9cpAlvo21Lov_eLks5qYMragaJpZM4JOWMP .
Ben, any particular thing you're afraid of? Or what kind of test would you like to see?
I don't have anything in particular in mind, just a general fear that it would be very easy to introduce bugs like #7899. I'm concerned that we may have missed things (like the issue you just raised) because the original proposal was that it was just for tests.
Specifically, we need more tests that use TransferLease on a cluster that is under load (and these tests need to be sensitive to any inconsistencies that may be introduced. Maybe we need to use jepsen?). Our current tests only transfer leases while everything is quiet.
I'm concerned that we may have missed things (like the issue you just raised) because the original proposal was that it was just for tests.
I've reviewed it under the assumption that it wasn't only for tests. The restart-scenario was discussed during the review of the original PR, so it's been on the radar throughout (though we should fix it; it isn't terribly difficult. I filed https://github.com/cockroachdb/cockroach/issues/7996.
Specifically, we need more tests that use TransferLease on a cluster that is under load (and these tests need to be sensitive to any inconsistencies that may be introduced. Maybe we need to use jepsen?).
I think we should consider enabling lease transfers behind a debug flag. That way we can test them in all the ways we test regular builds (in particular including Jepsen) and can at some future point flip the switch. That is, of course, after fixing #7996 and having unit tested various of these moving parts better.
This issues seems to be done.
I don't think this is done since TestBasicManualReplication still includes the comment:
// TODO(peter): Removing the range leader (tc.Target(1)) causes the test to // take ~13s vs ~1.5s for removing a non-leader. Track down that slowness.
Ok, I'm going to unassign it from me, as I'm not actively looking into this. Putting it on Peter's plate.
Feel free to put it back on my plate if you want me to work on it.
There's still this reference to the issue:
@peter, mind remembering what the deal is?
@peter, mind remembering what the deal is?
Oh boy, I should have left a better comment. At one point in time it was problematic to remove the leaseholder replica. It still is problematic. You first have to transfer the leaseholder somewhere else and then remove it. I'm not sure if TestCluster.RemoveReplicas()
will do that automatically or not.
TestBasicManualReplication
currently tests removal of a non-lease-holder replica. If the test is changed to remove the lease-holder replica, the runtime of the test 8.5x. Need to track down where that slowness is coming from. @bdarnell hypothesizes: