petermattis commented 7 years ago

14718 limited the bandwidth for preemptive snapshots (i.e. rebalancing) to 2 MB/sec. This is a blunt instrument. @bdarnell says:

What we really need is some sort of prioritization scheme that would allow snapshots to use the bandwidth only if it's not needed for other stuff. But I don't have any concrete suggestions so maybe we should go ahead and do this anyway.

Jira issue: CRDB-6099

petermattis commented 7 years ago

Perhaps something like weighted fair queueing. Here's a go implementation.

tbg commented 7 years ago

Would an ideal solution have the limiting sit on the incoming bandwidth and not the outgoing? I'm thinking of the case in which node 1 and 2 both send to node 3, but node 1 streams a snapshot while node 2 has foreground traffic. It would really have to be node 3 which backs off the other two; node 1 and 2 each would try to go at full speed without a chance of determining their relative priorities. Naively, I'd think that WFQ for reading from the connection should work since there's flow control and the sender has to wait once they've filled up their window.

I've searched around for precedent which I'm sure must exist somewhere, but haven't really been successful.

petermattis commented 7 years ago

I think we'd want foreground/background traffic prioritization on both the sender and receiver. The receiver might not have any foreground traffic, but the sender might. And vice versa.

cuongdo commented 7 years ago

@m-schneider Can you take this on during 1.2? The issue is a little vague now, and this will require an RFC. So, it'd be helpful if you worked with @tschottdorf on making this specific enough to be actionable as a first step.

petermattis commented 7 years ago

I'd estimate that the coding portion of this is a small fraction of the work. The bigger task is to test the prioritization mechanism under a variety of rebalancing and recovery scenarios on real clusters.

tbg commented 7 years ago

performance testing, as the mechanism will sit squarely in the hot path.

m-schneider commented 7 years ago

@cuongdo Sure sounds like a great project for 1.2!

m-schneider commented 7 years ago

Toby and I discussed this and looked into what we can do with gRPC. Prioritizing traffic on a sender is fairly straight forward, we can use gRPC interceptors and the WFQ implementation that @petermattis linked. However on a receiver there doesn't seem to be any straight forward way to do this. Before a recent optimization in gRPC(https://grpc.io/2017/08/22/grpc-go-perf-improvements.html) we could have blocked connections by stalling reads from on connections that we wished to deprioritize via interceptors. By the time the interceptor is invoked, the message has already been consumed and its connection window quota released, so there is an off-by-one that makes this an off-by-infinity for non-streaming RPCs (which we think use a new stream each time).

If we fork gRPC we can modify the connection and interceptor code to give us everything we need to block on a connection to throttle on the receiver side.

We're following issue #17370 because it touches many of the same pathways.

petermattis commented 7 years ago

Do we need to prioritize traffic at both the sender and recipient? I was imagining that we'd only prioritize traffic on the sender side, though I haven't thought this through in depth.

tbg commented 7 years ago

Change of heart from https://github.com/cockroachdb/cockroach/issues/14768#issuecomment-305552297? :-)

The basic problem is that if we only throttle on the sender and a node sends snapshots at full volume (because nothing else is going on), it doesn't matter what's going on on the recipients -- the foreground traffic will be impacted.

There's also a question of scope here: is this specifically to put something into place that allows snapshots to go fast when nothing else is going on, or is the next thing we're going to want a prioritization of Raft traffic vs foreground traffic also?

If there's a relatively clean workable solution for snapshots only that's not as invasive, that might be something to consider, but to "really" solve the problem it seems that we'd be adding some hooks to grpc's internals where we need them, and live with a fork.

a-robinson commented 7 years ago

cc @rytaft, who may have some wisdom to share

petermattis commented 7 years ago

Change of heart from #14768 (comment)? :-)

Heh, I knew I had thought about this before.

There's also a question of scope here: is this specifically to put something into place that allows snapshots to go fast when nothing else is going on, or is the next thing we're going to want a prioritization of Raft traffic vs foreground traffic also?

The initial scope was all about snapshots: allowing snapshots to be sent as fast as possible as long as there is nothing else going on. Prioritizing Raft traffic vs foreground traffic seems trickier as sometimes that Raft traffic is necessary to service the foreground traffic.

rytaft commented 7 years ago

@a-robinson Happy to help, but I think I need a bit more context. Is the main issue that multiple nodes are sending snapshots to the same recipient simultaneously? If so, would it be a problem to have them coordinate?

Also, is the bottleneck happening at the RPC layer, network, CPU utilization or something else? I could also talk about this offline with someone if that would be easier...

tbg commented 7 years ago

@rytaft the TL;DR is that we currently have two restrictions in place:

a node only accepts one incoming snapshot at a time
on the sending side, the snapshot stream is rate limited at 2-4 mb/s

The main goal here is to relax 2) by allowing snapshots to be transferred faster, but without impacting foreground traffic (as in, use the bandwidth if nobody else is using it, but don't try to compete, at least not too hard).

rytaft commented 7 years ago

Makes sense. I was talking about this with Cuong at lunch, and the main question I have is: are you sure it's the network bandwidth that is the bottleneck, or could it be the processing of RPC calls? In the latter case, you could just create a different channel/port for snapshot traffic....

tbg commented 7 years ago

We're somewhat certain that it's the network bandwidth, but @m-schneider is running more experiments now as the original issue https://github.com/cockroachdb/cockroach/issues/10972 didn't conclusively prove that.

By different channel, do you mean using a different port and then letting the OS throttle that against the rest? That's probably unlikely to happen, for two reasons: a) we only have one IANA assigned port (26257) and b) we don't want to burden the operator with setting up said limits.

rytaft commented 7 years ago

Yes that's what I meant - but if it's not possible to add a different port then I guess that's not an option.

So if a node only accepts one snapshot, the concern is that the single incoming snapshot may overburden the receiving node because the sender is not aware of the receiver's query load? If so, is it possible for the receiver to send "backpressure" messages to the sender somehow?

m-schneider commented 7 years ago

Currently there is back pressure, but only on a per stream basis, not on a connection basis in the latest gRPC implementation. While this can limit a single snapshot, it doesn't really limit other traffic because most of our requests are unary which use a new stream for each request.

petermattis commented 7 years ago

Sending a snapshot can also overburden the sending node's network. Consider that the snapshot is ~64MB. Sending that snapshot as fast as possible can starve other network traffic out of the node for a second or two on a fast network and tens of seconds on a slower (higher latency) link. At least, that's the theory about what was happening in #10972 and why placing a fixed rate-limit on how fast a snapshot is sent helped. Definitely seems worth understanding this better before trying to do something more sophisticated.

tbg commented 7 years ago

Yes, the receiver can consume the stream more slowly but I'm not clear on how it would decide how much to throttle. Sure, we could measure the incoming throughput in foreground traffic, but that doesn't really solve the problem because we don't know what the max throughput is, which leads back to considering something like weighted fair queueing on the receiver (and by a symmetric setup, you also want it on the sender).

petermattis commented 7 years ago

Our current gRPC-based internode communication is composed of Batch, RaftMessageBatch and RaftSnapshot. Batch and RaftMessageBatch contain foreground traffic. RaftSnapshot contains background traffic. All of these RPCs are sent over a single connection between nodes. A thought I had on my walk to the train is that we can segregate this traffic into 2 connections between nodes, wrap the raw network connections, and use weighted-fair-queueing to control the writes and reads from the network. This would obviate the need to hook into gRPC but is a somewhat blunter instrument.

m-schneider commented 7 years ago

I looked into a couple of proxy implementations in go yesterday, it would be fairly straight forward to write something like that. I also tried to reproduce your original experiment in issue #10972 and there seem to be other problems with foreground traffic when a new node joins a cluster or restarts which I'm also looking into. The question is now how much of this is network related.

tbg commented 6 years ago

nginx now supports grpc. I don't really think that helps here, but you could open another port for snapshots without having to plumb another parameter.

petermattis commented 6 years ago

I think this is still worth revisiting. Or perhaps highlighting the cluster setting that affects how much bandwidth background operations can take. During tpcc-10k rebalancing (after partitioning), I would run SET CLUSTER SETTING kv.snapshot_rebalance.max_rate='128MiB' in order to speed up the rebalancing.

tbg commented 6 years ago

This is something we're really going to want more rather than less as time goes by, but for now the technical complexity to get this in by 2.1 is too big. Moving to 2.2.

tbg commented 3 years ago

The rebalancing rate knob has come up a few more times in customer incidents as the rate is usually set too conservatively to allow for rapid data movement when it's needed but then, if not reset, remains too high and causes problems again later. Here are a few things we could do:

(build alerting and then) alert whenever that setting is not at the default value. Probably not a good option - many orgs consciously run with a non-default value as they know their networks will allow it.
make the cluster setting optionally reset back to a base value over time, i.e. instead of setting a high rate when it's needed and forgetting to reset it later, "set it to high for 5 hours" and the revert will take care of itself. This could cause the opposite problem, where 5 hours in the high setting is still needed and the reset itself causes problems again, but maybe it's the better side of the tradeoff. Also, this kind of configuration can definitely be pointed out by monitoring.
consider using a different port for sending/receiving snapshots. Our original motivation against this was to keep things on our IANA-assigned port but honestly that argument has not hold up to the passage of time well. This could likely solve the problem elegantly in larger organizations that know how to set up traffic shaping. We would be using the standard tools for the job instead of layering traffic shaping into our product. The improved external observability may also be a plus, especially for bigger orgs. On the down side, it adds more ways to get it wrong, and orgs that don't want to deal with the extra port or traffic shaping rules will not profit and will just rely on the knob again.
actually build a traffic shaping mechanism into CRDB. This would solve the problem for everyone at the expensive of engineering effort on our end. It's unclear how much effort is needed, but I expect it to be substantial. gRPC "wants to" eventually support http2 priorities which could be part of the solution here, but that was in 2017 and nothing has happened since. Their suggestion in the interim is to use multiple ports and external traffic shaping, just like suggested above.

tbg commented 3 years ago

The other thing I'm noticing is that it looks like we're using the same connection for snapshots and for "general" traffic: https://github.com/cockroachdb/cockroach/blob/8c5253bbccea8966fe20e3cb8cc6111712f6f509/pkg/kv/kvserver/raft_transport.go#L656-L664

I wonder if changing this alone can produce any benefits.

irfansharif commented 2 years ago

+cc @shralex.

irfansharif commented 2 years ago

x-ref https://github.com/cockroachdb/cockroach/issues/63728.

andrewbaptist commented 1 year ago

There are likely a few things we should do here:

[ ] Remove the split between rebalance and recovery rate #63728
[ ] Separate out snapshots (and other non-raft sstable transfers) to a separate network channel.
[ ] Move the throttling to the receiver side and integrate with AC to allow this to be dynamic. #80607
[ ] Increase the send and receive concurrency (maybe 4/2 are good values to try) - make them runtime configurable.

cockroachdb / cockroach

kvserver: avoid need for manual tuning of rebalance rate setting #14768

14718 limited the bandwidth for preemptive snapshots (i.e. rebalancing) to 2 MB/sec. This is a blunt instrument. @bdarnell says: