bufbuild / httplb

Client-side load balancing for net/http
https://pkg.go.dev/github.com/bufbuild/httplb
Apache License 2.0
48 stars 2 forks source link

Proposal: Add Host concept to Resolver and implement RFC 6555 #67

Open jchadwick-buf opened 6 months ago

jchadwick-buf commented 6 months ago

Overview

In dual-stack IPv4+IPv6 environments, it is not always possible to know with certainty whether or not an IPv6 address is actually routable from the current environment. Especially in containerized environments, oftentimes “dual stack” IPv4+IPv6 is actually an IPv4 stack with IPv6 loopback, or a similar configuration where Internet IPv6 addresses do not route properly. Therefore, so-called “Happy Eyeballs” algorithms, as proposed by RFC 6555, are employed in any situation where dual-stack IPv4+IPv6 networking is encountered to ensure a swift fallback when IPv6 networking is non-functional.

In Go’s net stack, this is implemented at the net.Dialer level: when resolving addresses, the net.Dialer will choose IPv6 addresses in preference as “primary” addresses, and IPv4 addresses as “fallback” addresses. The net.Dialer performs a race, first attempting to connect to the IPv6 address, and shortly after, trying to connect to the IPv4 address, concurrently. These connection attempts are raced, with the first connection that succeeds getting used and the other being closed and discarded.

In httplb, this doesn’t work: httplb creates individual connections to hosts, resolving names before the net.Dialer is called. This means the net.Dialer cannot perform its RFC 6555 race.

Proposal

This proposal asserts that the core problem with httplb today is that the Resolver returns unstructured addresses. This is appropriate for DNS, because DNS does not carry any information about hosts: it only has records that point to addresses. However, the reason why a clean IPv6 to IPv4 fallback can not be implemented is because the Resolver does not return enough information. In an ideal world, the Resolver would return both the IPv4 and IPv6 address(es) for a host as a single unit.

The current default behavior of httplb at HEAD is to prefer IPv4 addresses when present, only using IPv6 addresses when there are no IPv4 addresses present. This is probably widely compatible with existing deployments, and it is likely better than using both resolved IPv4 and IPv6 addresses, since it is very likely that some of the addresses point to the same hosts, and thus in dual stack environments it is very likely that using both IPv4 and IPv6 addresses would lead to improperly balanced load.

Resolver Interface

In order to rectify this, this proposal suggests that the Resolver return a slice of Host structures, which shall have:

For more advanced resolvers, these values may be able to be filled in some logical fashion that is able to support a proper RFC 6555 fallback. DNS however does not provide enough information, so any solution to this problem will carry at least some downsides.

DNS Resolver implementation

There is no ideal way to implement this with DNS, so a heuristic must be used. I suggest the following:

The net effect of this is that the fallback IPv4 address for a given IPv6 is effectively arbitrary; this means that if any of the hosts are unhealthy, the balancing may become uneven when IPv4 fallback addresses are used and result in multiple entries in the pool balancing to the same host. Unfortunately, there's really no way to prevent this from happening with DNS alone, but I believe this is a better overall outcome, as the result today is that not explicitly specifying IPv4 or IPv6 results in other suboptimal behavior, potentially making adoption of httplb difficult in systems that might need to tolerate a large variety of possible production environments.

Transport implementation

Right now the way that the transport implementation handles targeting is by overriding the host in the URL, which requires some overhead:

In order to allow for this fallback behavior, we need to move this override to a lower level. This gives us the opportunity to lower the overhead of the transport implementation in the vast majority of cases, since there are far fewer cases where the request or URL will need to be cloned, and the TLS configuration will never need to be patched.

Implement RFC 6555 fallback manually

It is possible for users to provide a custom Dial function. We could wrap this again into a custom dial function that performs the RFC 6555 Dial race using the underlying implementation. The downside here is that we need to implement this race ourselves, though it is not insurmountable.

Implement a custom *net.Resolver

It is challenging but possible to override the behavior of *net.Resolver. This can be done by setting the PreferGo field to true and setting the Dial function to return an in-memory net.Pipe() that speaks DNS, ideally using the x/net/dns/dnsmessage package. (I recently did this in my test implementation.)

While this looks ugly, it seems like it is actually intended by the Go developers, and despite the text on PreferGo being somewhat unclear, it will in fact work on all platforms:

    if runtime.GOOS == "plan9" {
        // TODO(bradfitz): for now we only permit use of the PreferGo
        // implementation when there's a non-nil Resolver with a
        // non-nil Dialer. This is a sign that the code is trying
        // to use their DNS-speaking net.Conn (such as an in-memory
        // DNS cache) and they don't want to actually hit the network.
        // Once we add support for looking the default DNS servers
        // from plan9, though, then we can relax this.
        if r == nil || r.Dial == nil {
            return false
        }
    }

Furthermore, while this is seemingly intended to work for the foreseeable future, there also seems to be intent to implement this more properly in the future:

    // TODO(bradfitz): optional interface impl override hook

By doing this, we net the ability to tell net.Dial about the list of hosts and it should be able to perform graceful IPv6 fallback, probably avoiding the fallback delay as necessary.

The only problem with this approach is that it relies on being able to override the Resolver field of the net.Dialer, which precludes the ability to specify a Dial function. We would need to refactor this API so that the *net.Resolver gets passed back to the user so they can use it in their Dial function implementation (we may need to rehaul the way the override works; it should probably be a function that returns a Dial function, given a *net.Resolver.)

The advantage of this approach is that if applied properly, it should give us fallback behavior that is very close to what Go is able to offer out-of-the-box.

Summary