apollographql / router

A configurable, high-performance routing runtime for Apollo Federation 🚀
https://www.apollographql.com/docs/router/
Other
804 stars 271 forks source link

Include schema hash in Uplink logs #4368

Open smyrick opened 10 months ago

smyrick commented 10 months ago

Is your feature request related to a problem? Please describe. When debugging Uplink issues it would be helpful to include the schema hash directly inline in error messages in things like #4147 or in any other error or info logs (See https://github.com/apollographql/router/blob/1ea5b695498a809bba6cf1e1fe7c9fe1638e8bf7/apollo-router/src/router/error.rs#L30)

Describe the solution you'd like Include extrametadata in logs like:

Describe alternatives you've considered We could do some splunk linking to tie error message by time ranges but often GH issues are created from just one error message so this would help others debug before they created GH issues

abernix commented 10 months ago

@BrynCooke Are we already including the schema hash on any of the attributes we emit during Uplink polling passes? Is that something that might be easily surfaced in logs?

abernix commented 10 months ago

Ah, sorry, I misread this, but the question is still about putting schema hashes on log contexts — not necessarily logs specifically about Uplink — but rather in the general sense. I assume this is to correlate schema-in-service to error messages?

smyrick commented 10 months ago

@abernix This was the customer error that they got using Uplink which should never deliver an invalid supergraph. We had no context to which launch or change caused this issue from logs and doing log searching is a little tedious with many instances and only ERROR logs enabled

could not create router: couldn't build Router Service: couldn't instantiate query planner; invalid schema: schema validation errors: Error extracting subgraph "my-subgraph-a" from the supergraph: this might be due to errors in subgraphs that were mistakenly ignored by federation 0.x versions but are rejected by federation 2. Please try composing your subgraphs with federation 2: this should help precisely pinpoint the problems and, once fixed, generate a correct federation 2 supergraph.
abernix commented 10 months ago

@smyrick That's useful context and I agree that such a log could benefit from more metadata to be concretely useful aside from lining up timestamps with graph variants. I also think a general solution that applies to all logs is probably also very useful — and might be implemented as simply as a launch_id attribute on an existing span or something — but that having the launch ID would go a long way here for most cases.

In terms of the schema that was received, Uplink will deliver a schema that causes this, though it should never be a thing that manifests out of the blue. In this case, this appears to just be a Federation v0.x schema being run on a Federation 2 runtime, is it possible that occurred?

smyrick commented 10 months ago

@abernix Yes that is what happened but the Router can run both Fed 1 or Fed 2 supergraphs, correct? Could it switch from Fed 2 back to a Fed 1 graph on a launch or would that break things in the Router (if they were not using Fed 2 features like Auth)

abernix commented 10 months ago

It's a best-effort compatibility mode with Federation 0.x, the same of which that exists in Gateway 2.x for the same graph. Best effort means that there were some legit bugs in 0.x composition which created impossible conditions that weren't validated against. The principles of Federation 2 simply makes it impossible to actually do the right thing.

The changes to the subgraphs to fix it are often trivial and as the message suggests, running Federation 2 composition will most always reveal those impossible conditions and ask for them to be fixed via validation errors. This all gets into the nuances here in the docs for migrating to Federation 2. Specifically, "Federation 2 is backward compatible with most Federation 1 supergraphs" (emphasis mine). In this case, this error results from a graph we just cannot be compatible with without someone fixing the impossible conditions in the subgraph (usually in a @key or @requires directive, if I'm not mistaken).

There is no fallback the Router itself can provide here without including the entirety of Federation 0.x's query planning execution, and also maintaining Rust-based execution, duplicatively for each.

garypen commented 9 months ago

If we added launch-id to the logged error we believe this would simplify the resolution of uplink issues.

bnjjj commented 9 months ago

Then I think it should be part of the new telemetry configuration with standard attributes https://www.apollographql.com/docs/router/configuration/telemetry/instrumentation/standard-attributes

It will end up as a span attribute but later when we will expand the logging and event configuration it will also be available for these configurations