Router blocks and doesn't handle any requests after being hit with load beyond a threshold

akhilesh-godi commented 1 year ago

Describe the bug The router seems to move into a state where it cannot handle any requests after being hit with load beyond some threshold. This is causing the client that is trying to communicate with the router to timeout. There appear to be no logs or metrics coming up as well, however, the CPU utilisation seems to show increase. Even after the load is stopped, and then attempted to try to communicate to the router, the requests continue to timeout when attempted to send any number of requests. When sending requests again, there is an increase in CPU consumption on the router - however, the server seems to be in a non-responsive state.

To Reproduce Steps to reproduce the behavior:

Set container limits for CPU, Memory (in our case: cores = 1)
Run router service on the container
Once we reach a certain rps limit (in our case this is ~2000 rps per core), the router becomes unresponsive to any requests. All requests appear to be stuck, CPU utilisation continues to increase. All client requests timeout
Stop running load.
Try attempting to send a single request - the request is stuck (no response and we see a timeout that the client observes)
Send some more traffic - CPU utilisation increases but no metrics or logs appear. Requests appear to be stuck.

Expected behavior Requests should be sent appropriate status code. Metric and logging should show up the number of requests. Router should be able to respond to requests once load is removed. The router shouldn't be in a non-responsive state.

Output If applicable, add output to help explain your problem. (Shows no response from server)

Desktop (please complete the following information):

OS: [e.g. iOS] Linux

Additional context There is a client timeout of 4s set - which might be relevant.

akhilesh-godi commented 1 year ago

could you try tokio-console with the 1.7 version? What did you change in the fork?

The fork had some changes where we disabled http2 (since this was a problem and it was identified later by someone else as an issue too: https://github.com/apollographql/router/issues/2063) - however we disabled http2 in the fork much before this change of disabling via config was merged by the Apollo team given the above issue. We had reported this too: https://github.com/apollographql/router/issues/1956.

We also made changes to support conditional queries in query planner for federation v2 (@NoobMaster-96 will raise a PR for this) - however the issue was reproducible even without the fork by using the base images (v1.1.0) - therefore we don't have any reason to believe that changes in the fork were causing the problem.

Geal commented 1 year ago

There are various ways in which Buffer can fail:

poll_ready is called but call is not: it takes a slot and never gives it back, so at some point the buffer stops handling requests and new ones wait indefinitely
the channel is full
the buffer's task has panicked

akhilesh-godi commented 1 year ago

Ah I see - that makes sense. Thank you for explaining! I'll read through the code to get a better understanding.

Geal commented 1 year ago

@akhilesh-godi is it still happening?

akhilesh-godi commented 1 year ago

Hi @Geal - I've been unable to repro this from v1.8+ onwards.

However, this is consistently reproducible in lower versions. I set up the router locally and did not bring up any subgraphs. Tried bombarding the router with load. Limited fds to some 1000000. The router goes into stuck state consistently.

While the Buffer explanation seems valid. Would you think there are other situations which are not easily reproducible that could land us in such an issue.

Stability of the router is extremely important to us as we see very very large scales of traffic and cannot afford any resiliency bugs to bring down our stack.

Geal commented 1 year ago

alright, then that confirms that #2296 fixed it and we can close this.

There can be always be unforeseen scaling bugs in the router, like the file descriptor issue (which we did not catch earlier because all our load tests are done with high limits), or other Buffer instances (we've been removing them gradually, but there are still cases where they are unavoidable).

The places where I would look for possible scaling bugs right now are in request deduplication, timeout and rate limiting on the subgraph, and in hyper's connection pool(like #2063), but so far they held out well.

apollographql / router

Router blocks and doesn't handle any requests after being hit with load beyond a threshold #2377