petermattis commented 8 years ago

TestBasicManualReplication currently tests removal of a non-lease-holder replica. If the test is changed to remove the lease-holder replica, the runtime of the test 8.5x. Need to track down where that slowness is coming from. @bdarnell hypothesizes:

My guess: at the end of RemoveReplicas, you do a consistent read of the range descriptor via the first store. Since that store has a local replica, it tries to read from itself first, and hangs until SendNextTimeout. Fixing the races and reducing SendNextTimeout would mitigate this problem, but there's probably something better we could do to get the local replica to fail sooner.

petermattis commented 8 years ago

As @bdarnell guessed, the slowness is retrieving the range descriptor

The problem is not with SendNextTimeout. That timeout controls how long it is before DistSender waits to send to the next replica, but in this test we're sending to the next replica almost immediately because the first replica has replied indicating it does not have the desired range. So we send to the next replica and it says "I'm not the lease holder". And we send to the third replica and it says "I'm not the lease holder". And then DistSender backs off and retries and we go through that dance again. If everything times perfectly, after 3 seconds one of the 2 remaining replicas becomes the lease holder. If the timing isn't perfect, then the DistSender retry loop will just miss the new lease acquisition and will delay the test by another 3-4 seconds.

I see two options here:

Disallow removal of the leader-holder replica.
Transfer the range-lease before removing the lease-holder replica.

@andreimatei, @tschottdorf Thoughts?

tbg commented 8 years ago

I thought we were going to make use of leadership transfers here.

petermattis commented 8 years ago

@tschottdorf We're currently not (as far as I can tell). Do you recall where that discussion was had?

tbg commented 8 years ago

@tschottdorf We're currently not (as far as I can tell).

No, we're definitely not. Lease transfers only landed recently (#7345) and we haven't done it yet (or even properly discussed it on an issue). During discussions on that PR that we realized that the leader could transfer away its lease without any downtime. I'm sure I mentioned that a few times during #7297 as well, but perhaps not online. Looks like the obvious thing to do, but let's discuss if it isn't.

petermattis commented 8 years ago

@bdarnell is wary of introducing production usage of lease transfers. In the short term, disallowing removal of the replica which is the lease holder or making sure replicateQueue does not remove the lease holder seem safer.

tbg commented 8 years ago

Works for me, though that means that runs contrary to balancing load (since all active replicas stick to where the load is), but it'll probably do for a while.

petermattis commented 8 years ago

We don't rebalance based on load right now. Perhaps we can have an overloaded node gracefully shed range-leases.

tbg commented 8 years ago

We don't rebalance on load, but with that proposed change we rebalance based on no load. If block-writer writes to a single node, that node will have all the leader leases, so I will be unlikely to rebalance away from that node. In effect this can create a hotspot node on which all ranges are active and which has no way of getting rid of them.

There's no gracefully shedding active leases without lease transfers. I'm personally not wary of them (though they could use a bit more testing) but I don't object to the decision of deferring their use for simplicity. Once we see those hotspot problems in practice we should bite the bullet and put in the work, not a workaround. I hope that will not be very soon (given the absence of production users at this point).

petermattis commented 8 years ago

Ah, that's a good point about only rebalancing ranges that are not receiving local load.

BramGruneir commented 8 years ago

Ok, so where do we stand with this issue?
Should I turn off removing the replica with the range lease to avoid the 3s delay?

I do plan to add load info for rebalancing soon, but that's a few weeks away.

petermattis commented 8 years ago

Yes, turn of removal of the range-lease-holder replica for rebalancing. But this isn't as urgent as the disk space issues.

bdarnell commented 8 years ago

We could also remove the randomness here by adding the lease expiration time to NotLeaderError. DistSender could synchronize its retries with the expiration of the old lease. See also #3196.

petermattis commented 8 years ago

We now have two tests (TestBasicManualReplication and TestGossipFirstRange) that transfer the range-lease immediately before removing the replica that was holding the lease. As far as I can tell, this works fine. Perhaps we should push harder on testing to get @bdarnell over his apprehension of using TransferLease.

andreimatei commented 8 years ago

Ben, any particular thing you're afraid of? Or what kind of test would you like to see?

andreimatei commented 8 years ago

One thing I should have said here is that there's one more thing to be done for lease transfers before they can be used in production - the old lease holder doesn't save the fact that it can't use the current lease any more to disk state. And so if it restarts before the transfer comes out of raft, it will think it has the lease and we'll have a bad time. It's on my list to figure out this disk state.

vivekmenezes commented 8 years ago

wanna create an issue for that?

On Fri, Jul 22, 2016 at 9:51 AM Andrei Matei notifications@github.com wrote:

One thing I should have said here is that there's one more thing to be done for lease transfers before they can be used in production - the old lease holder doesn't save the fact that it can't use the current lease any more to disk state. And so if it restarts before the transfer comes out of raft, it will think it has the lease and we'll have a bad time. It's on my list to figure out this disk state.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/cockroachdb/cockroach/issues/7872#issuecomment-234549094, or mute the thread https://github.com/notifications/unsubscribe-auth/ALOpBNeHX4JWHiFna9cpAlvo21Lov_eLks5qYMragaJpZM4JOWMP .

bdarnell commented 8 years ago

Ben, any particular thing you're afraid of? Or what kind of test would you like to see?

I don't have anything in particular in mind, just a general fear that it would be very easy to introduce bugs like #7899. I'm concerned that we may have missed things (like the issue you just raised) because the original proposal was that it was just for tests.

Specifically, we need more tests that use TransferLease on a cluster that is under load (and these tests need to be sensitive to any inconsistencies that may be introduced. Maybe we need to use jepsen?). Our current tests only transfer leases while everything is quiet.

tbg commented 8 years ago

I'm concerned that we may have missed things (like the issue you just raised) because the original proposal was that it was just for tests.

I've reviewed it under the assumption that it wasn't only for tests. The restart-scenario was discussed during the review of the original PR, so it's been on the radar throughout (though we should fix it; it isn't terribly difficult. I filed https://github.com/cockroachdb/cockroach/issues/7996.

Specifically, we need more tests that use TransferLease on a cluster that is under load (and these tests need to be sensitive to any inconsistencies that may be introduced. Maybe we need to use jepsen?).

I think we should consider enabling lease transfers behind a debug flag. That way we can test them in all the ways we test regular builds (in particular including Jepsen) and can at some future point flip the switch. That is, of course, after fixing #7996 and having unit tested various of these moving parts better.

BramGruneir commented 8 years ago

This issues seems to be done.

7996 is still being worked on, but nothing else that I can see.

tamird commented 8 years ago

I don't think this is done since TestBasicManualReplication still includes the comment:

// TODO(peter): Removing the range leader (tc.Target(1)) causes the test to // take ~13s vs ~1.5s for removing a non-leader. Track down that slowness.

BramGruneir commented 8 years ago

Ok, I'm going to unassign it from me, as I'm not actively looking into this. Putting it on Peter's plate.
Feel free to put it back on my plate if you want me to work on it.

andreimatei commented 5 years ago

There's still this reference to the issue:

https://github.com/cockroachdb/cockroach/blob/1544d1501aec633ad41b28cd867e6cbe68d687e5/pkg/storage/gossip_test.go#L121-L122

@peter, mind remembering what the deal is?

petermattis commented 5 years ago

@peter, mind remembering what the deal is?

Oh boy, I should have left a better comment. At one point in time it was problematic to remove the leaseholder replica. It still is problematic. You first have to transfer the leaseholder somewhere else and then remove it. I'm not sure if TestCluster.RemoveReplicas() will do that automatically or not.

cockroachdb / cockroach

testutils/testcluster: TestBasicManualReplication slowness #7872

7996 is still being worked on, but nothing else that I can see.