Remote concurrency limited by gRPC connections

EricBurnett commented 4 years ago

Description of the problem / feature request:

For remote RPCs, bazel only opens one connection per server (resolved IP). This constrains the maximum parallelism of a build - gRPC limits the number of in-flight requests per connection to MAX_CONCURRENT_STREAMS, which is a per-server HTTP2 setting, often set to 100 or 128.

In short, this means that when talking to a server resolved to a single IP, bazel can only have ~100 remote executions in flight, regardless of the setting of --jobs. If the server resolves to more IPs, this may be linearly increased, but rarely above 500. Most remote builds can exceed 100x parallelism, and many (esp. builds with long-duration actions) could exceed 500.

Note that due to bazel displaying full parallelism even if actions are actually queued waiting for gRPC connections, it's hard to know just how many users this affects - more than realize it, I'm quite sure. But I've had at least 3 independent projects root cause this reason for limited parallelism at this point, so it seems to be fairly frequent in reality (at least amongst RBE users).

The necessary fix in bazel is to replace the netty 'round_robin' load balancer with one that can open more than 1 connection (subchannel) per resolved address. Maybe by dynamically adding more when bottlenecked, but feeding in a target number of connections as e.g. max(2, jobs/50) would probably do just fine.

Feature requests: what underlying problem are you trying to solve with this feature?

Enable higher build parallelism for remote builds

Bugs: what's the simplest, easiest way to reproduce this bug? Please provide a minimal example if possible.

Run a remote build with --jobs=1000 and observe bazel showing '1000 remote'. Then check server perspective and realize bazel never had more than X00 RPCs in flight.

What operating system are you running Bazel on?

Linux (but applies to all)

What's the output of `bazel info release`?

3.4.1

Have you found anything relevant by searching the web?

gRPC are not currently planning on offering this feature natively: https://github.com/grpc/grpc/issues/21386
Usually resolved in clients by writing a custom load balancer, as is likely the necessary fix for bazel. Example: https://github.com/GoogleCloudPlatform/grpc-gcp-go/tree/master/grpcgcp

cc @buchgr @bergsieker @gkousik @ulfjack

gkousik commented 4 years ago

Thanks for filing the bug @EricBurnett !

A sample repo through which you can reproduce the issue from the client side: https://github.com/gkousik/bazel-parallelism-test

The above Bazel repo contains 500 genrules, with a random number value for each action-env so as to NOT get cache-hits. Each action sleeps for 10 seconds and then generates an output file. When you run build on that repo with a remote-execution pool with 500 bots, I would expect the build to complete in 10 seconds, since all rules are parallelizable and so with 500 bots, I would expect all of them to be sent to RBE at once.

What happens though is that, the build proceeds in batches of 200 actions and takes 30+ seconds to complete no matter how many times I run the build.

➜  bazel-parallelism-test git:(master) ✗ time ./runbuild.sh
INFO: Invocation ID: 803973fb-817c-4814-b27e-053086c56174
INFO: Build option --action_env has changed, discarding analysis cache.
DEBUG: /usr/local/google/home/kousikk/.cache/bazel/_bazel_kousikk/07c64333bc485db1757df85f8da29b91/external/bazel_toolchains/rules/rbe_repo.bzl:491:10: Bazel 3.2.0 is used in rbe_default.
INFO: Analyzed 500 targets (6 packages loaded, 511 targets configured).
INFO: Found 500 targets...
INFO: Elapsed time: 35.933s, Critical Path: 35.29s
INFO: 500 processes: 500 remote.
INFO: Build completed successfully, 501 total actions
./runbuild.sh  0.02s user 0.04s system 0% cpu 35.982 total

vmrautio commented 4 years ago

Hitting the same issue. We would have a back-end that scales to O(1000s) of concurrent actions but this is limiting the parallelism to O(100).

ulfjack commented 4 years ago

This depends on the server-side configuration. For a server written using grpc-java, the documentation of the NettyServerBuilder says:

public NettyServerBuilder maxConcurrentCallsPerConnection(int maxCalls)
The maximum number of concurrent calls permitted for each incoming connection. Defaults to no limit.

See here: https://grpc.github.io/grpc-java/javadoc/io/grpc/netty/NettyServerBuilder.html#maxConcurrentCallsPerConnection-int-

I am not sure Bazel should be circumventing the server-side configuration by creating multiple connections in this case.

EricBurnett commented 4 years ago

@ulfjack I don't think REAPI servers are (or should be) using a per connection limit to indicate to clients the max parallelism they want to support. If the goal is to tell clients an overall limit they want honoured (per IP or endpoint-wide), that should go into Capabilities so that the client can apply it. If it's per-connection, it should not also be interpreted as a limit on the number of connections.

Per @werkt , default is unlimited and bazel happily works with servers allowing many more RPCs. The most common case I'm aware of for MAX_CONCURRENT_STREAMS to be set is from load balancers, where multiple connections are a legitimate and appropriate way to spread load across multiple LBs and backends. Are you aware of anyone using it to mean something else?

ulfjack commented 4 years ago

Good point. I think one of the goals of HTTP2 was to only have one connection from the client to the server rather than multiple, but that may not apply to gRPC. If you have an LB with a single IP, should the client create multiple connections? The intention of the load balancer is to balance the load to the actual servers, so it isn't immediately obvious to me that you'd need that - the load balancers can terminate the HTTP2 connection and distribute the individual requests to different servers (i.e., a HTTP2 proxy). Do we need to load balance to the load balancers? If so, why would the service not announce multiple IPs?

I appreciate that it may not be possible in some cases to control MAX_CONCURRENT_STREAMS on the server-side when using pre-existing HTTP proxies, so maybe we need something in the protocol to tell the client that they can / should open multiple connections. It's a bit odd that there isn't such a thing in HTTP2 - that would seem the obvious place to put it. Maybe they intentionally didn't want to add such a mechanism because it's hard to enforce on the server-side (in a distributed system)?

Maybe we should ask the gRPC folks what they think?

gkousik commented 4 years ago

@ulfjack https://github.com/grpc/grpc/issues/21386#issuecomment-564742173 the recommendation from gRPC folks was to create additional connections and is leaning towards "Create a pool of connections and round-robin across it" (as opposed to reactively creating additional connections when a certain threshold of concurrent RPCs is reached).

ulfjack commented 4 years ago

Ah, ok. Sorry, I forgot about that in my previous post. What's the right behavior, then? Inspect MAX_CONCURRENT_STREAMS and decide based on that and --jobs how many channels to open? Do we need to add something to the protocol?

EricBurnett commented 4 years ago

Ulf:

Do we need to load balance to the load balancers?

I think so - e.g. consider a 64-core well-networked builder doing a byte-heavy build. My understanding is this will either be throughput constrained (possibly due to the fact that most LBs use at most one core per connection, possibly from HTTP2 bottlenecks itself), or will hotspot the LB (remembering that the bandwidth required of the LB is 2x the bandwidth of the client, since it has to proxy it to the backend as well). I haven't actually sat down and measured it yet though - I know it's hard to exceed ~2Gbps on a single stream, but I haven't yet evaluated our effective per-connection bandwidth limit.

If so, why would the service not announce multiple IPs?

Not an expert in this area, but my understanding: Anycast and DNS. With DNS caching layers IP-per-proxy can lead to roving hotspots and generally scales poorly, and with Anycast it's straightforward to put multiple servers behind one IP anyways.

So you'll often see multiple IPs, but a small number like "2" that are functioning more as independent failure domains than having any logical connection to the servers backing them. See e.g. https://aws.amazon.com/blogs/networking-and-content-delivery/using-static-ip-addresses-for-application-load-balancers/ for an example of this pattern.

To be clear, this is not universal: in some setups load balancing with one-connection-per-ip is probably just fine. But even in these scenarios, using a few extra connections (e.g. 2 or 3 connections per IP) should not cause any harm - bazel connections will be comparatively few and heavy vs what you might see from "average" clients on the network (web browsers, apps, etc), and should in no way strain anything on "number of connections" grounds.

What's the right behavior, then? Inspect MAX_CONCURRENT_STREAMS and decide based on that and --jobs how many channels to open? Do we need to add something to the protocol?

From what we've seen from java gRPC, and heard from the team when we file bugs there, inspecting MAX_CONCURRENT_STREAMS will be hard as it's learned fairly late in the process (after connections are starting to be opened). But I've looked around and I've also not seen any evidence of anyone defaulting particularly low values there, with 100+ being normal. So my suggestion is to pick a target number of connections as --jobs/50*, learn the number of IPs available, and open ceil(ips/target_connections) per IP. (If the number of IPs is also not knowable at that time, I guess we could just assume 2?)

I think this should work fairly well for all REAPI environments - at the small end, someone trying to do a --jobs=2000 build against a single server would see 40 connections be opened, which shouldn't be a problem at all for that server (servers can generally handle 10s of thousands of open connections); at the large end that same build against a LB setup will spread the load over 40 different proxy servers and should see higher effective throughput as a result.

*why 50 and not 100? gRPC behaviour again: RPCs are bound to connections early and then queue against them. Uniformly distributing random length RPCs to connections sees them get full non-uniformly, so some connections will have queueing while others have spare capacity. Leaving 2x headroom should make that very unlikely.

coeuvre commented 3 years ago

FYI: the fixed size channel pool is replaced with a dynamic channel pool which will open new connections on demand. de8f69dd9ef2e06d2809a2ae3eedc97810e0d181

bazelbuild / bazel