TrueLayer / ginepro

A client-side gRPC channel implementation for tonic
Apache License 2.0
127 stars 24 forks source link

Connection timeouts #37

Open ThomWright opened 1 year ago

ThomWright commented 1 year ago

Bug description

Symptoms

  1. GRPC requests are taking too long to time out when Kubernetes network policies are misconfigured.
  2. The connection_timeout_is_not_fatal test takes ~75 seconds to finish.

Well-configured timeouts are important for system stability. Requests which take too long can hog resources and block other work from happening.

Causes

I can see two separate timeout problems:

  1. DNS resolution – when ResolutionStrategy::Lazy is used, there is currently no way to apply a timeout just for DNS resolution. If DNS never resolves, requests never complete.
  2. TCP connection – there is currently no way to set a connection timeout. On my machine, the socket seems to time out after ~75s. Even when we do set a connection timeout, tonic doesn't use it!

Even though we're setting our own fairly short timeouts around the overall request, I've seen some strange behaviour where requests are hanging for a long time. I think there's still something else going on that I don't understand, but I expect addressing the two points above will be generally helpful anyway.

To Reproduce

For the TCP connection timeout, just run the tests. I'll supply a test for lazy DNS resolution timeouts in a separate PR.

Expected behavior

Ability to control timeouts for TCP connections and DNS resolution.

Environment

Additional context

Solutions

The TCP connection timeout is simpler to solve (though I will admit took me a long time to find): we just need to set connect_timeout in the right places. First, topic doesn't respect connect_timeout, which will be fixed by hyperium/tonic#1215. When that is merged, we can create our own connect_timeout option on top of it in #38.

DNS resolution is harder. There are currently two options:

  1. Lazy resolution seems to be the default, but it also seems very hard to put a timeout around (at least, I can't think of a way!). It's possible to put an overall timeout around every request in the application, but this is 1. overly broad and 2. error-prone. It's very easy to forget or simply not know that it's needed ("why don't the other timeouts take care of it?").
  2. Eager resolution has a timeout option. In practice, using this will probably mean doing DNS resolution on service startup, when the LoadBalancedChannel is created. This might be a good thing, preventing services from successfully starting when DNS would never resolve.

Of the two, I wonder if we should favour Eager resolution, and consider changing the default to this.

~However, we might want a third option: Active lazy resolution (for want of a better name). Lazy resolution is currently passive, as in it happens in the background on a schedule. It is never actively called in the request flow, which is why it's hard to put a timeout around. Instead, could we implement something which actively calls probe_once() (with a timeout!) as part of the first request (or alternatively when GrpcServiceProbe.endpoints is empty)? This could give us lazy DNS resolution, but with timeouts.~

Scratch that, I took a different approach: tower-rs/tower#715. EDIT: Nope, that hasn't worked out. Back to the drawing board.

ThomWright commented 1 year ago

I'm considering a new approach to the DNS timeout problem. The design looks something like the diagram below.

What I have in mind is a new DnsTimeout middleware wrapped around the LoadBalancedChannel. I'm imagining it would work like so:

@conradludgate (or anyone else) what do you think?

Diagram

conradludgate commented 1 year ago

I'm going to take some time to fully digest what ginepro/tower/tonic are currently doing under the hood. Then I'll have a think about how this dnstimeout fits in

conradludgate commented 1 year ago

Ok, just to confirm that I have a clear picture of what the issue is:

Let's say we have 3 connections in our balance queue, but 1 of them has gone offline. We don't need to wait for DNS in this case. After a few attempts of using the dead connection we can remove it and continue using the 2 remaining connections (not sure if tower::Balance does this. TBC).

Let's say those 2 connections go down, we now have 0 connections in our balance queue. Therefore we need to wait for DNS to get new connections. This is also the init case. When this happens, we want to timeout early waiting for DNS, which would be a different timeout from the unary request round trip.

Of course, for DNS local to a kubernetes cluster, I would hope those DNS latencies would be low, so you could set the timeout for the DNS to be low but the timeout for the connection to be higher. Is that your understanding too?

conradludgate commented 1 year ago

After a few attempts of using the dead connection we can remove it

https://docs.rs/tower/0.4.13/tower/ready_cache/cache/struct.ReadyCache.html

If an error is returned, this indicats that the server failed and has been removed from the cache entirely.

ThomWright commented 1 year ago

Agreed!

After a few attempts of using the dead connection we can remove it and continue using the 2 remaining connections

I had to have a look, but I'm guessing this is because tonic's Reconnect will error from poll_ready() iff the connection broke and then the subsequent reconnection attempt also failed?

So, services can be removed from the balancer for one of two reasons:

  1. Evicted by our GrpcServiceProbe (DNS stopped returning it).
  2. Failing to become ready, most likely the reason above.

Cool. So yeah, either way we can end up with nothing to load balance across.

Of course, for DNS local to a kubernetes cluster, I would hope those DNS latencies would be low, so you could set the timeout for the DNS to be low but the timeout for the connection to be higher. Is that your understanding too?

Yep. I mean, how long DNS vs TCP + TLS vs actual requests will take can vary a lot, which is why separate timeouts are useful. E.g. we might be happy to wait for a really long request which can take 10s P99, but we wouldn't want to wait that long just for DNS or TCP + TLS. We already have (well, will soon have) a timeout for the TCP connection, so DNS seems like the last piece of the puzzle.

conradludgate commented 1 year ago

Cool. I'm all caught up then. I'll have a think about yours and some other solutions then.

In the end I think we already might have to re-implement some of tonic for #38 so I think we can move the dnstimeout further download the chain (in the Discover layer). But we'll still somehow have to surface the number of services registered to this layer