apollographql / router

A configurable, high-performance routing runtime for Apollo Federation 🚀
https://www.apollographql.com/docs/router/
Other
798 stars 267 forks source link

Better performance debugging experience #2748

Closed radekmie closed 4 months ago

radekmie commented 1 year ago

Describe the solution you'd like The traces sent to Apollo Studio could include more context. I'm not sure if Apollo Studio supports that, but maybe the "waiting on server" should be used here?

Describe alternatives you've considered We didn't look into the Open Telemetry logs, but I assume these should be on-par.

Additional context Here are two sample traces we see in our Apollo Studio: Sample A Sample B
image image

(First three lines are from Subgraph A, next two from Subgraph B.)

Both are 99.99+ percentile and we wanted to optimize our "long tails" of API response times. However, based on this information, we have little to no context what's the problem here:

  1. Is it Subgraph A? (Deployed to N containers; responds fast to own queries.)
  2. Is it Subgraph B? (Deployed to AWS lambda; responds fast to own queries; cold starts are confirmed not to happen for these two traces.)
  3. Is it Apollo Router? (Deployed to 1 container; responds fast to single-subgraph queries.)
  4. Is it network? (All subgraphs and Apollo Router are in the same subnet in the same AWS region.)
  5. Anything else?
We use Apollo Router v1.9.0 with the following config: ```yml supergraph: listen: 0.0.0.0:4000 path: / introspection: true apq: subgraph: all: enabled: true cors: allow_any_origin: true allow_headers: [] homepage: enabled: false headers: all: request: - propagate: matching: .* traffic_shaping: deduplicate_variables: true router: timeout: 90s all: deduplicate_query: true compression: gzip timeout: 90s include_subgraph_errors: all: true telemetry: apollo: field_level_instrumentation_sampler: 1.0 send_headers: all send_variable_values: all tracing: trace_config: sampler: 1.0 ```
radekmie commented 1 year ago

Just wanted to give an update here: we've tried all versions up to 1.19.1 for some time, and all of them suffer the same problem. Is there anything we could do to help the investigation?

abernix commented 1 year ago

For what it's worth, I do think that OpenTelemetry would show you a better representation of the data. Have you tried configuring with something like Datadog, Honeycomb or even Jaeger locally to see if that gives you a better picture?

radekmie commented 1 year ago

I haven't, thanks for the ideas. (I wasn't able to reproduce it locally, though.)

abernix commented 4 months ago

I'll close this issue as it hasn't picked up a lot of traction and we're trying to clean up some old issues. Please open a new issue with a reproduction if you continue to struggle with this. Thanks!