Hot reload downtime under load

utay commented 6 months ago

Describe the bug

In production a router process serves hundreds of requests per second; when it hot reloads because the supergraph has been refreshed, a few requests get 502s from the load balancer because it got connection refused on the router.

To Reproduce

I've been able to reproduce pretty consistently:

Start the router with hot reload enabled
Run any load testing tool such as ab -n 1000000 -c 20 http://127.0.0.1:4000/graphql
Modify the supergraph to make the router reload
:boom: Right around "reload complete", at least one request will fail with connection refused

Expected behavior

I'd expect 0 downtime as per the docs.

Additional context

Router 1.45.0

Geal commented 6 months ago

hi, thank you for the report. This looks like an issue we have seen elsewhere, we'll investigate and get back to you

abernix commented 5 months ago

This might be related to https://github.com/apollographql/router/pull/5235, which should land reasonably soon. If you wanted to try with that PR included, that might be a worthwhile try. 😄

apollographql / router

Hot reload downtime under load #5185