Router blocks and doesn't handle any requests after being hit with load beyond a threshold

akhilesh-godi commented 1 year ago

Describe the bug The router seems to move into a state where it cannot handle any requests after being hit with load beyond some threshold. This is causing the client that is trying to communicate with the router to timeout. There appear to be no logs or metrics coming up as well, however, the CPU utilisation seems to show increase. Even after the load is stopped, and then attempted to try to communicate to the router, the requests continue to timeout when attempted to send any number of requests. When sending requests again, there is an increase in CPU consumption on the router - however, the server seems to be in a non-responsive state.

To Reproduce Steps to reproduce the behavior:

Set container limits for CPU, Memory (in our case: cores = 1)
Run router service on the container
Once we reach a certain rps limit (in our case this is ~2000 rps per core), the router becomes unresponsive to any requests. All requests appear to be stuck, CPU utilisation continues to increase. All client requests timeout
Stop running load.
Try attempting to send a single request - the request is stuck (no response and we see a timeout that the client observes)
Send some more traffic - CPU utilisation increases but no metrics or logs appear. Requests appear to be stuck.

Expected behavior Requests should be sent appropriate status code. Metric and logging should show up the number of requests. Router should be able to respond to requests once load is removed. The router shouldn't be in a non-responsive state.

Output If applicable, add output to help explain your problem. (Shows no response from server)

Desktop (please complete the following information):

OS: [e.g. iOS] Linux

Additional context There is a client timeout of 4s set - which might be relevant.

Geal commented 1 year ago

is there something specific to know about the queries? Are you sending the exact same query every time?

akhilesh-godi commented 1 year ago

There is nothing specific about the queries being sent. But the load test is being performed with the same set of queries (in our case this is 10 unique queries).

garypen commented 1 year ago

Which version of the router are you running? Are you using any customizations (rust plugins, etc...)? If this is kubernetes, do you have a networking sidecar (istio, etc...)?

If you remove the client timeout or set it to be a large number (120 seconds), does this affect behaviour?

akhilesh-godi commented 1 year ago

We are running on v1.1.0 However, the behavior was reproducible even in v1.7.0

We are not using any customisations

This is a kubernetes deployment where we have two containers on the same pod. The client and the server are running on the said containers. (We have a golang http client which is on the main container and the router (server) is on another container - which is a sidecar). We have no reason to believe that there is some problem in the networking layer as we have analysed the tcpdump for the connections.

We haven't yet set the client timeout to a much higher value. But increasing 4s to 5s did not seem to change behaviour. I'll try setting this to a very large value and see the behavior.

Geal commented 1 year ago

are you using the telemetry options with that router deployment?

Geal commented 1 year ago

what's the limit on the number of file descriptors? And the number of client connections?

Geal commented 1 year ago

I think I reproduced it locally, with a low limit on file descriptors: timeouts, then the router stays unresponsive, at 100% CPU usage. @Meemaw is this the same issue you encountered?

Meemaw commented 1 year ago

This sounds very similar to what we are observing, though hard to verify 🙂

vamsi360 commented 1 year ago

@Geal - we brought down our router cores to 1 and we are able to reproduce the same behavior of client timeouts when router goes to 100% cpu - the throughput drop is ~50% though the latencies are very good at the router.

You can see that between 21:36pm and 21:42pm in the throughput, the throughput dropped when we are at 100% CPU even though the latencies are very good.

But this issue is different from what was explained in the main issue - that issue occured to us even when there is no CPU maxing out at router.

cc: - @noobmaster-96 @lalithsrinivas @krishna15898 @akhilesh-godi

Geal commented 1 year ago

that could be consistent with a file descriptor limit: the client sessions that were prpperly established and setup connecrtions to subgraphs will continue to work correctly as long as the connection is not closed, but new connections will timeout because the router cannot accept them

krishna15898 commented 1 year ago

Stack trace and Flamegraph of a 1 second CPU profile using pprof from the time the issue was reproduced:

profiles-issue-reproduces.zip

vamsi360 commented 1 year ago

@Geal ack. Will we log "too many open files" to stdout of router in that case? We haven't seen it in our logs. We do see a lot of TIME_WAIT connections during that time.

Below chart points out that our connection re-use actually dropped during this time because of golang context timeouts - but the new connection start rate increased. But this is not leading to the increased re-use rate - this is in sync with the TIME_WAIT behavior using netstat => connections were established but going into TIME_WAIT quickly without re-use.

Also, one other observation wrt. the main post wireshark tcp dump, the router was accepting the request but was not responding post Ack of the request. This makes the golang client ctx timeout and close the connection.

Geal commented 1 year ago

@krishna15898 this seems different, looking at the stacktrace. @vamsi360 try to increase the number of available file descriptors (ulimit -n or LimitNOFILE in systemd) and see if it changes anything

akhilesh-godi commented 1 year ago

ulimit -n
1000000

This is already set to a large enough value. I will try increasing this further and report back on the observations tomorrow.

akhilesh-godi commented 1 year ago

While I do that, I echo @vamsi360's questions. Shouldn't we be seeing this in the logs in that case because I can see code which seems to suggest that error/debug/info logs where max open fds limit reached or such errors should have been printed?

Which also brings me to the other question: If the hypothesis that it is indeed because of the fd limits breached, why would there be so many open fds? is there some part of the code where you suspect a file descriptor leak?

Meemaw commented 1 year ago

@akhilesh-godi AFAIK each tcp connection needs a file descriptor. If you have requests pilling up you can quickly run out of them.

Geal commented 1 year ago

While I do that, I echo @vamsi360's questions. Shouldn't we be seeing this in the logs in that case because I can see code which seems to suggest that error/debug/info logs where max open fds limit reached or such errors should have been printed? I saw today that code has an issue, the max open files error came with another error code that what was expected.

Which also brings me to the other question: If the hypothesis that it is indeed because of the fd limits breached, why would there be so many open fds? is there some part of the code where you suspect a file descriptor leak?

For now I don't suspect a leak, the router is just very tolerant about keeping idle client connections so they can pile up

Geal commented 1 year ago

@akhilesh-godi AFAIK each tcp connection needs a file descriptor. If you have requests pilling up you can quickly run out of them.

Especially if one client request then triggers multiple subgraph requests. Subgraph connections are pooled so usually they're not in large numbers, but they count towards the limit

akhilesh-godi commented 1 year ago

Thank you for your responses.

@Meemaw I am aware that each new TCP conn would lead to opening of FDs. If requests pile up leading to new connections being attempted open FDs go up. But closing of said TPC conns should lower the count of open FDs. Therefore router being in a "stuck" state after the load has gone down is anomalous behavior because the suspect then is that conns are not being closed gracefully and therefore the number of open FDs not going down.

I will report back an update with running the experiment with increased FDs. However, it might be worth investigating if there is some buggy behavior too imo or there are some parameters that are leading to the FD being left open even after not being used anymore for a prolonged duration after which it is eventually closed. And tuning this param might be a possible solution if this is not a bug.

krishna15898 commented 1 year ago

We reproduced the issue when the throughput to router reduced drastically. Attached are the threaddump, and flamegraph of the cpu profile taken in a second with sampling rate 1000. We noted the FDs were 8000 when the limit set was 1000,000 (1 Million) so we were not running into FD limit issues. Archive.zip

LalithSrinivas commented 1 year ago

Adding to the above, we were well under the CPU limit [200% out of 300% (3 cores) was being used]

Geal commented 1 year ago

right, there are issues I have to fix with file descriptors, but this particular bug seems unrelated. Here the flamegraph shows that some requests are still handled. Is that from connections that were previously established?

vamsi360 commented 1 year ago

We see that out of the 5 threads in the above threaddump, only 1 is having majority of the frames and others are having just one entry - epoll_ctl

We are not able to generate more requests to the pod at this state. Could you check the below - we have given 3 cpus for the docker container but don't see activity from all the threads

FRAME: epoll_ctl -> FRAME: <mio::net::tcp::stream::TcpStream as mio::event::source::Source>::deregister -> FRAME: <tokio::io::poll_evented::PollEvented<E> as core::ops::drop::Drop>::drop -> FRAME: core::ptr::drop_in_place<tokio::net::tcp::stream::TcpStream> -> FRAME: core::ptr::drop_in_place<hyper::server::conn::Connection<tokio::net::tcp::stream::TcpStream,axum::routing::Router>> -> FRAME: <core::future::from_generator::GenFuture<T> as core::future::future::Future>::poll -> FRAME: <core::panic::unwind_safe::AssertUnwindSafe<F> as core::ops::function::FnOnce<()>>::call_once -> FRAME: tokio::runtime::task::harness::Harness<T,S>::poll -> FRAME: std::thread::local::LocalKey<T>::with -> FRAME: tokio::runtime::scheduler::multi_thread::worker::Context::run_task -> FRAME: tokio::runtime::scheduler::multi_thread::worker::Context::run -> FRAME: tokio::macros::scoped_tls::ScopedKey<T>::set -> FRAME: tokio::runtime::scheduler::multi_thread::worker::run -> FRAME: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll -> FRAME: tokio::runtime::task::harness::Harness<T,S>::poll -> FRAME: tokio::runtime::blocking::pool::Inner::run -> FRAME: std::sys_common::backtrace::__rust_begin_short_backtrace -> FRAME: core::ops::function::FnOnce::call_once{{vtable.shim}} -> FRAME: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once -> <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once -> std::sys::unix::thread::Thread::new::thread_start -> FRAME: start_thread -> FRAME: clone -> THREAD: tokio-runtime-w 4

vamsi360 commented 1 year ago

right, there are issues I have to fix with file descriptors, but this particular bug seems unrelated. Here the flamegraph shows that some requests are still handled. Is that from connections that were previously established?

we would also want to find out why the throughput drop happens and the router is not able to perform well in the scenario - even if its doing a portion of the requests - is there an internal limit of the router that we are hitting?

Geal commented 1 year ago

it feels like the thread are stuck, but they should not be on epoll_ctl under mio's register (EPOLL_CTL_ADD) or deregister (EPOLL_CTL_DEL). Would it be possible to run the router under strace -Tttf to see the actual syscalls and timings?

Geal commented 1 year ago

looking at the traces, you are using APQ and header propagation with regex. Are there any other options you are using? Is APQ using the Redis cache?

akhilesh-godi commented 1 year ago

It appears that the threads are stuck indeed. And perhaps the no throughput being able to be taken up by the router is only a special case of less throughput that it is able to respond to while other requests are stuck. This might not have anything to do with fd limits which seem to be under control.

Regarding the configurations we have:

supergraph:
  path: /graphql
  listen: 0.0.0.0:4000
  preview_defer_support: false
  introspection: true
sandbox:
  enabled: true
homepage:
  enabled: false
health-check:
  listen: 0.0.0.0:8088
  enabled: true
include_subgraph_errors:
  all: true
telemetry:
  metrics:
    prometheus:
      enabled: true
      listen: 0.0.0.0:4000
      path: /metrics
    common:
      attributes:
        subgraph:
          all:
            errors:
              extensions: # Include extensions data
                - name: subgraph_error_extended_type # Name of the attribute
                  path: .type # JSON query path to fetch data from extensions

headers:
  # Header rules for all subgraphs
  all:
    request:
      - propagate:
          matching: (?i)(User-Agent|...|...) 
traffic_shaping:
  all:
    timeout: 500ms
  subgraphs:
    xxxx:
      timeout: 300ms
    ...

And APQ between client and router is enabled. We aren't using Redis as the number of unique queries viz what we want to cache is very less and in-memory suffices for us.

Geal commented 1 year ago

FYI I'm moving the file descriptor issues here https://github.com/apollographql/router/pull/2395

krishna15898 commented 1 year ago

Hi @Geal , After a high load scenario when router starts being latent and we are not able to make even a single curl to it, We tried to take a thread dump and we got the below error -

{"timestamp":"2023-01-12T15:13:53.313514Z","level":"ERROR","fields":{"message":"router service call failed: request timed out"},"target":"apollo_router::axum_factory::handlers"}
{"timestamp":"2023-01-12T15:13:53.313549Z","level":"ERROR","fields":{"message":"response failed","classification":"Status code: 500 Internal Server Error","latency":"30000 ms"},"target":"tower_http::trace::on_failure"}

vamsi360 commented 1 year ago

@Geal We tried to simplify the workflow to reproduce and tried the below:

Disable APQ from the client requests.
Load test directly [using Vegeta] to the router for only one GQL Type query [heavy query] with higher rps - 1500 rps.
The router cpu is within limits (at max reaching ~80%) but soon goes into a state where it times out the requests
Existing connections to the router which are active still continue to serve the requests fine
New requests via curl fail with "router service call failed" error with 5xx after taking 30secs
Health check port returns the responses correcly.

Could you reproduce the above?

>
* Mark bundle as not supporting multiuse
< HTTP/1.1 500 Internal Server Error
< content-type: text/plain; charset=utf-8
< content-length: 26
< date: Thu, 12 Jan 2023 16:10:32 GMT
<
* Connection #0 to host 10.160.56.198 left intact
router service call failed%

Geal commented 1 year ago

I'll try to reproduce. What's the latency of the subgraph? Is it hitting the timeouts?

vamsi360 commented 1 year ago

I'll try to reproduce. What's the latency of the subgraph? Is it hitting the timeouts?

Yes. They all have timeouts configured - max being 2secs for one of them. Yes it was hitting timeouts for some of them sometimes

vamsi360 commented 1 year ago

We have set the router timeout to 3sec (1sec more than the max subgraph timeout) and still see the same "router service call fialed" exceptions, but after 3secs. Could you point us to how to fix this - which parts of code to check - for identifying the cause of timeouts?

NoobMaster-96 commented 1 year ago

In the image, you can see that when the throughput is high and the subgraphs are latent and failing we noticed that after some time the router drops the throughput then tries to recover then again drops and goes into this state where it is not able to recover. Please refer to the above graphs for details. We suspect this is something to do with the subgraph failures.

Geal commented 1 year ago

@vamsi360 the timeout code is here: https://github.com/apollographql/router/tree/dev/apollo-router/src/plugins/traffic_shaping/timeout

The router service call failed is here: https://github.com/apollographql/router/blob/dev/apollo-router/src/axum_factory/axum_http_server_factory.rs#L342-L347 That message and status do not make sense here, we should probably return a HTTP 504 status (gateway timeout). I'm opening https://github.com/apollographql/router/issues/2400 about this.

@NoobMaster-96 this is an interesting piece of info. When the subgraph are failing, what are they doing? Accepting connections then not answering? Rejecting connections? (so far I've been testing with a subgraph that accepts all connections but does not answer)

krishna15898 commented 1 year ago

These are the logs from the time our subgraphs were failing - which look like being emitted from subgraph_service call or fetch_node

{"timestamp":"2023-01-13T10:31:16.557922Z","level":"ERROR","fields":{"message":"Fetch error: HTTP fetch failed from 'subgraph_1': request timed out"},"target":"apollo_router::query_planner::execution"}
{"timestamp":"2023-01-13T10:31:16.557977Z","level":"ERROR","fields":{"message":"Fetch error: HTTP fetch failed from 'subgraph_2': request timed out"},"target":"apollo_router::query_planner::execution"}

Geal commented 1 year ago

Ok so the connection is established and the request is sent but there's no answer in time

Geal commented 1 year ago

I think I'm close to reproducing it. Sending about 2k rps to 2 cores, one subgraph that accepts requests but never answers, the others are easily overloaded. If I sent another request while the benchmark is running I often see that request just never getting an answer, even though there's a timeout on the entire request. Which would point us towards an issue before calling the supergraph service, because that's where the timeout is applied

akhilesh-godi commented 1 year ago

Interesting. Thank you for keeping us posted.

Are you able to reproduce what happens after the load ends? The router never recovers and does not respond after?

Geal commented 1 year ago

still not reproducing, the requests were not receiving an answer mainly because I overloaded the router, so it was taking time to get to it, but once the benchmark ended it was able to answer. Another option here would be to:

running the router with tokio-console: https://github.com/apollographql/router/blob/dev/CHANGELOG.md#add-support-for-tokio-console-pr-1632
starting a loadtest and getting it into the state where it stops responding
looking at tokio-console while sending new traffic
looking at the tasks that just appeared in tokio-console to see where they get stuck

Geal commented 1 year ago

There's another idea that you could test: remove the call to set_nodelay. I'm wondering if getting into a syscall right after spawning the task might be an issue

akhilesh-godi commented 1 year ago

Thanks! Will replicate this with tokio-console first and see if we have better logs to get a handle of the root cause. I'll keep you posted.

krishna15898 commented 1 year ago

Hey @Geal, after the router goes in a stuck state, we made a curl to the router which gives a 500 error "router service call failed". The logs printed are as follows

2023-01-16T10:56:25.030737Z TRACE runtime.resource.async_op{source="Sleep::reset"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=false
2023-01-16T10:56:25.435927Z TRACE tokio::task::waker: op="waker.wake" task.id=2037879656018870304
2023-01-16T10:56:25.436036Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}:request{http.method=POST http.route=/graphql http.flavor=HTTP/1.1 otel.kind=server}:runtime.resource{concrete_type="Sleep" kind="timer" loc.file="apollo-router/src/plugins/traffic_shaping/timeout/mod.rs" loc.line=57 loc.col=40}:runtime.resource.async_op{source="Sleep::new_timeout"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=true
2023-01-16T10:56:25.436076Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}:request{http.method=POST http.route=/graphql http.flavor=HTTP/1.1 otel.kind=server}:runtime.resource{concrete_type="Sleep" kind="timer" loc.file="apollo-router/src/plugins/traffic_shaping/timeout/mod.rs" loc.line=57 loc.col=40}:runtime.resource.async_op{source="Sleep::new_timeout"}:runtime.resource.async_op.poll: tokio::task::waker: op="waker.clone" task.id=2037879656018870304
2023-01-16T10:56:25.436110Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}:request{http.method=POST http.route=/graphql http.flavor=HTTP/1.1 otel.kind=server}:runtime.resource{concrete_type="Sleep" kind="timer" loc.file="apollo-router/src/plugins/traffic_shaping/timeout/mod.rs" loc.line=57 loc.col=40}:runtime.resource.async_op{source="Sleep::new_timeout"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=true
2023-01-16T10:56:25.436153Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}:request{http.method=POST http.route=/graphql http.flavor=HTTP/1.1 otel.kind=server}:supergraph{graphql.document="query <<QUERY REDACTED>>"}: tokio::task::waker: op="waker.drop" task.id=2037879656018870304
2023-01-16T10:56:25.436292Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}:request{http.method=POST http.route=/graphql http.flavor=HTTP/1.1 otel.kind=server}:supergraph{graphql.document="query <<QUERY REDACTED>>"}: tokio::task::waker: op="waker.drop" task.id=2037879656018870304
2023-01-16T10:56:25.436350Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}:request{http.method=POST http.route=/graphql http.flavor=HTTP/1.1 otel.kind=server}:supergraph{graphql.document="query <<QUERY REDACTED>>"}: apollo_router::plugins::telemetry: span is sampled then skip the apollo metrics
2023-01-16T10:56:25.436398Z ERROR runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}:request{http.method=POST http.route=/graphql http.flavor=HTTP/1.1 otel.kind=server}: apollo_router::axum_factory::handlers: router service call failed: request timed out
2023-01-16T10:56:25.436470Z ERROR runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}:request{http.method=POST http.route=/graphql http.flavor=HTTP/1.1 otel.kind=server apollo_private.duration_ns=3002551078 otel.status_code="ERROR"}: tower_http::trace::on_failure: response failed classification=Status code: 500 Internal Server Error latency=3002 ms
2023-01-16T10:56:25.436565Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}:encode_headers: hyper::proto::h1::role: Server::encode status=500, body=Some(Unknown), req_method=Some(POST)
2023-01-16T10:56:25.436606Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: hyper::proto::h1::encode: sized write, len = 26
2023-01-16T10:56:25.436621Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: hyper::proto::h1::io: buffer.queue self.len=136 buf.len=26
2023-01-16T10:56:25.436636Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: hyper::proto::h1::dispatch: no more write body allowed, user body is_end_stream = false
2023-01-16T10:56:25.436703Z DEBUG runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: hyper::proto::h1::io: flushed 162 bytes
2023-01-16T10:56:25.436713Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: hyper::proto::h1::conn: flushed({role=server}): State { reading: Init, writing: Init, keep_alive: Idle }
2023-01-16T10:56:25.436730Z TRACE runtime.resource.async_op{source="Sleep::reset"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=true
2023-01-16T10:56:25.436739Z TRACE runtime.resource.async_op{source="Sleep::reset"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=false
2023-01-16T10:56:25.436748Z TRACE runtime::resource::state_update: permits=1 permits.op="add"
2023-01-16T10:56:25.436753Z TRACE runtime::resource::state_update: permits=1 permits.op="add"
2023-01-16T10:56:25.436757Z TRACE runtime.resource.async_op{source="Sleep::reset"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=true
2023-01-16T10:56:25.436772Z TRACE runtime.resource.async_op{source="Sleep::reset"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=false
2023-01-16T10:56:25.436779Z TRACE runtime::resource::state_update: permits=1 permits.op="add"
2023-01-16T10:56:25.436783Z TRACE runtime::resource::state_update: permits=1 permits.op="add"
2023-01-16T10:56:25.436786Z TRACE runtime.resource.async_op{source="Sleep::reset"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=true
2023-01-16T10:56:25.436791Z TRACE runtime.resource.async_op{source="Sleep::reset"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=false
2023-01-16T10:56:25.436797Z TRACE runtime::resource::state_update: permits=1 permits.op="add"
2023-01-16T10:56:25.436800Z TRACE runtime::resource::state_update: permits=1 permits.op="add"
2023-01-16T10:56:25.436803Z TRACE runtime.resource.async_op{source="Sleep::reset"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=true
2023-01-16T10:56:25.436810Z TRACE runtime.resource.async_op{source="Sleep::reset"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=false
2023-01-16T10:56:25.436816Z TRACE runtime::resource::state_update: permits=1 permits.op="add"
2023-01-16T10:56:25.436819Z TRACE runtime.resource.async_op{source="Sleep::reset"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=true
2023-01-16T10:56:25.436826Z TRACE runtime.resource.async_op{source="Sleep::reset"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=false
2023-01-16T10:56:25.437746Z TRACE tokio::task::waker: op="waker.wake" task.id=2037879656018870304
2023-01-16T10:56:25.437766Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: hyper::proto::h1::conn: Conn::read_head
2023-01-16T10:56:25.437782Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: hyper::proto::h1::io: received 0 bytes
2023-01-16T10:56:25.437788Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: hyper::proto::h1::io: parse eof
2023-01-16T10:56:25.437793Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: hyper::proto::h1::conn: State::close_read()
2023-01-16T10:56:25.437800Z DEBUG runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: hyper::proto::h1::conn: read eof
2023-01-16T10:56:25.437821Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: hyper::proto::h1::conn: State::close_write()
2023-01-16T10:56:25.437828Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: hyper::proto::h1::conn: State::close_read()
2023-01-16T10:56:25.437832Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: hyper::proto::h1::conn: State::close_write()
2023-01-16T10:56:25.437836Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: hyper::proto::h1::conn: flushed({role=server}): State { reading: Closed, writing: Closed, keep_alive: Disabled }
2023-01-16T10:56:25.437888Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: hyper::proto::h1::conn: shut down IO complete
2023-01-16T10:56:25.437896Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: tokio::task::waker: op="waker.drop" task.id=2037879656018870304
2023-01-16T10:56:25.437907Z TRACE runtime.spawn{kind=task task.name= task.id=7151 loc.file="apollo-router/src/axum_factory/listeners.rs" loc.line=219 loc.col=29}: mio::poll: deregistering event source from poller    
2023-01-16T10:56:25.437961Z TRACE runtime.resource.async_op{source="Sleep::reset"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=true
2023-01-16T10:56:25.437972Z TRACE runtime.resource.async_op{source="Sleep::reset"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=false
2023-01-16T10:56:25.437979Z TRACE runtime::resource::state_update: permits=1 permits.op="add"
2023-01-16T10:56:25.437986Z TRACE runtime.resource.async_op{source="Sleep::reset"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=true
2023-01-16T10:56:25.437991Z TRACE runtime.resource.async_op{source="Sleep::reset"}:runtime.resource.async_op.poll: runtime::resource::poll_op: op_name="poll_elapsed" is_ready=false

And below is a screenshot of tokio-console in that state (after a single curl)-

Following is a screenshot of tokio-console when router is getting a 1rps load after getting into a stuck state -

Below is the configuration -

supergraph:
  path: /graphql
  listen: 0.0.0.0:4000
  preview_defer_support: false
  introspection: true
sandbox:
  enabled: true
homepage:
  enabled: false
health-check:
  listen: 0.0.0.0:8088
  enabled: true
cpu-profile:
  listen: 0.0.0.0:8090
  enabled: true
include_subgraph_errors:
  all: true
telemetry:
  metrics:
    prometheus:
      enabled: true
      listen: 0.0.0.0:4000
      path: /metrics
    common:
      attributes:
        subgraph:
          all:
            errors:
              extensions: # Include extensions data
                - name: subgraph_error_extended_type # Name of the attribute
                  path: .type # JSON query path to fetch data from extensions

headers:
  # Header rules for all subgraphs
  all:
    request:
      - propagate:
          matching: (?i)(x-headername)
traffic_shaping:
  router: # Rules applied to requests from clients to the router
    timeout: 3s
  all:
    timeout: 500ms
  subgraphs:
    subgraph_name1:
      timeout: 300ms

Geal commented 1 year ago

the presence of the task stuck in tower buffer feels weird, as there should not be any use of tower's Buffer layer in the hot path. Is it the stock router 1.7, or a custom build with a rust plugin?

Geal commented 1 year ago

what is cpu-profile in the configuration?

Geal commented 1 year ago

I'm asking about this because we have seen Buffer cause issues like that, where a router that otherwise looks fine, stops responding to requests. Buffer is used to make a service clonable by isolating it into a spawned task that gets requests through a channel, and returning a clonable wrapper around that channel that implements Service. It is useful but can create subtile bugs

krishna15898 commented 1 year ago

Attaching logs for one single curl from start to finish (3 seconds = router timeout) (the queries have been redacted) single-curl-logs-till-timeout.zip

This is a fork of apollo router itself but v1.1 (not 1.7) and no plugins

cpu-profile is an endpoint we added which uses pprof library to generate flame graph, and print the stack trace. But this was not called anytime in this reproduction of the issue.

Geal commented 1 year ago

could you try tokio-console with the 1.7 version? What did you change in the fork?

Geal commented 1 year ago

or even 1.8, because it comes with that change https://github.com/apollographql/router/pull/2296. There's a chance it might help here if it is a Buffer related issue

akhilesh-godi commented 1 year ago

We've been caught up with other things and haven't been able to get back here. It appears though that we have not been able to reproduce this with 1.8 at all - and that is surprising because I don't understand why that could be the root cause and how this solved the problem.

Will try and repro this further - today/tomorrow and post an update here.

apollographql / router

Router blocks and doesn't handle any requests after being hit with load beyond a threshold #2377