Closed akhilesh-godi closed 1 year ago
could you try tokio-console with the 1.7 version? What did you change in the fork?
The fork had some changes where we disabled http2 (since this was a problem and it was identified later by someone else as an issue too: https://github.com/apollographql/router/issues/2063) - however we disabled http2 in the fork much before this change of disabling via config was merged by the Apollo team given the above issue. We had reported this too: https://github.com/apollographql/router/issues/1956.
We also made changes to support conditional queries in query planner for federation v2 (@NoobMaster-96 will raise a PR for this) - however the issue was reproducible even without the fork by using the base images (v1.1.0) - therefore we don't have any reason to believe that changes in the fork were causing the problem.
There are various ways in which Buffer
can fail:
poll_ready
is called but call
is not: it takes a slot and never gives it back, so at some point the buffer stops handling requests and new ones wait indefinitelyAh I see - that makes sense. Thank you for explaining! I'll read through the code to get a better understanding.
@akhilesh-godi is it still happening?
Hi @Geal - I've been unable to repro this from v1.8+ onwards.
However, this is consistently reproducible in lower versions. I set up the router locally and did not bring up any subgraphs. Tried bombarding the router with load. Limited fds to some 1000000. The router goes into stuck state consistently.
While the Buffer explanation seems valid. Would you think there are other situations which are not easily reproducible that could land us in such an issue.
Stability of the router is extremely important to us as we see very very large scales of traffic and cannot afford any resiliency bugs to bring down our stack.
alright, then that confirms that #2296 fixed it and we can close this.
There can be always be unforeseen scaling bugs in the router, like the file descriptor issue (which we did not catch earlier because all our load tests are done with high limits), or other Buffer
instances (we've been removing them gradually, but there are still cases where they are unavoidable).
The places where I would look for possible scaling bugs right now are in request deduplication, timeout and rate limiting on the subgraph, and in hyper's connection pool(like #2063), but so far they held out well.
Describe the bug The router seems to move into a state where it cannot handle any requests after being hit with load beyond some threshold. This is causing the client that is trying to communicate with the router to timeout. There appear to be no logs or metrics coming up as well, however, the CPU utilisation seems to show increase. Even after the load is stopped, and then attempted to try to communicate to the router, the requests continue to timeout when attempted to send any number of requests. When sending requests again, there is an increase in CPU consumption on the router - however, the server seems to be in a non-responsive state.
To Reproduce Steps to reproduce the behavior:
Expected behavior Requests should be sent appropriate status code. Metric and logging should show up the number of requests. Router should be able to respond to requests once load is removed. The router shouldn't be in a non-responsive state.
Output If applicable, add output to help explain your problem. (Shows no response from server)
Desktop (please complete the following information):
Additional context There is a client timeout of 4s set - which might be relevant.