apollographql / router

A configurable, high-performance routing runtime for Apollo Federation 🚀
https://www.apollographql.com/docs/router/
Other
807 stars 272 forks source link

Metric tag "subgraph" is not applied to a first request to a subgraph #3918

Open kira-u opened 1 year ago

kira-u commented 1 year ago

Describe the bug Metric: apollo_router_http_requests_total Scenario: performed 10 queries to a subgraph A Observed result: Out of 10 requests, only 9 were tagged with subgraph. The tag was not applied to the first request. Subsequent requests were tagged fine. Without restarting the router, the behavior repeated when also called subgraph B, making the issue per-subgraph level.

To Reproduce Router version: 1.29.1 Router telemetry config:

telemetry:
  tracing:
    trace_config:
      service_name: "my-router"
      service_namespace: "my-router-123"
      sampler: 1.0
      parent_based_sampler: false
    experimental_response_trace_id:
      enabled: true
    propagation:
      trace_context: true
    otlp:
      endpoint: http://127.0.0.1:4317
  metrics:
    common:
      service_name: "my-router"
      service_namespace: "my-router-123"
    otlp:
      endpoint: http://127.0.0.1:4317

Steps to reproduce the behavior:

  1. Call router 2 times, invoking subgraph A
  2. Call router 2 times, invoking subgraph B
  3. I was using Datadog, the following query sum:apollo_router_http_requests_total{$env,$region,$service} by {subgraph}.as_count() would show that there: 1 request to subgraph A, 1 request to subgraph B, 4 requests with subgraph tag value = N/A.

Expected behavior The first request to a subgraph should be properly tagged with a subgraph name.

Desktop (please complete the following information):

Additional context Some could say the problem is not really important since it applied only to the first request to a subgraph, but there could be subgraphs that are invoked rarely, multiplied by the amount of router instances and frequency of restarts, the data loss could be noticeable (well, we noticed it, so it is noticeable:) ) I did not check any other metrics, only the one mentioned at the beginning.

kira-u commented 1 year ago

I reran my test scenario on a fresh server to confirm whether it was tag not applied, or the span was completely lost. After another server restart, for the same scenario of 2 requests to subg A and 2 requests to subg B, I got for the (left - subgraph Tag, right - metric value)

suite 1 N/A = 4 A = 2 B = 1 // must be 2, 1 means the metric is lost

suite 2 (another server restart) N/A = 4 A = 2 B = 2

From what I can see, this issue is unstable