apollographql / router

A configurable, high-performance routing runtime for Apollo Federation 🚀
https://www.apollographql.com/docs/router/
Other
800 stars 268 forks source link

Router produces 2 errors per request when subgraph response fails and content-type is not application/json #3572

Open nathanmarcos opened 1 year ago

nathanmarcos commented 1 year ago

Describe the bug Hello, I'm not sure if this was changed intentionally or it is a regression.

Given a subgraph that responds 5xx with content-type: text/plain, router produces the following error:

And the following errors:

When the subgraph is unavailable or the request times out due to a circuit breaker, we cannot control its response content type to always be application/json or its response body to contain a valid json with the data/errors attributes. Therefore, generating these two types of errors seems like a waste of resources since they do not add any extra value that helps us debug the issue.

To Reproduce Steps to reproduce the behavior:

  1. Setup router
  2. Set up a subgraph that responds 5xx with content-type: text/plain
  3. Shoot a query that hits the respective subgraph

Expected behavior Only one error is returned as it was on version 1.23.0.

abernix commented 1 year ago

This relates to https://github.com/apollographql/router/issues/2687 and was intentional because we were swallowing some error messages before. What was the impact to you? Was this just something you observed and were curious about?

nathanmarcos commented 1 year ago

Hello @abernix, we are constantly monitoring the application logs to ensure its health and find possible error messages that can be resolved to reduce log volume. We noticed extra messages in there and traced it down to router’s version 1.24.0. The application works fine, but unfortunately, these extra errors seem like a waste of resources since they do not add any extra value that helps us debug the issue. This reflects in extra 2x more storage usage for those logs and 2x more data transfer between router and the logs backend services and router and its clients.

Ideally, we would like to bring back the previous behavior. We could do that via string matching on a plugin, but that would be very error-prone. So I opened this issue to see if we could get it “fixed” at the root.

BrynCooke commented 1 year ago

I could get behind not specifying an error if the content type is not json/graphl and HTTP 200 is returned. We are also attaching the body of the response if this exists as a separate error.

Suggest that all errors for non-graphql responses are attached as extensions rather than separate errors.

nathanmarcos commented 11 months ago

Hello @BrynCooke , is there any progress on this?

We keep seeing our logs flooded with useless errors.

Currently running version 1.33.2.

nathanmarcos commented 6 months ago

Hello @BrynCooke and @abernix , is there any update on this one? We are currently running version 1.43.0.

nathanmarcos commented 6 months ago

This issue is happening for both 4xx and 5xx errors returned by the subgraphs.

BrynCooke commented 5 months ago

We have been working on a design to generally overhaul error handling in the router. The design includes work to ensure that the errors that all errors are documented, errors are not reused inappropriately, extensions contains all the information that is required rather than expanding to multiple errors, router to subgraph error handling is covered as well as client to router.

A poc was recently worked on https://github.com/apollographql/router/pull/4788 but we are now getting agreements from relevant teams before moving forward.

I realise that this has been hanging around for a long time, but we are close to getting agreement and we don't want to change the error handling only to change it again in the near future.

nathanmarcos commented 5 months ago

Thanks for the update @BrynCooke. I hope we can get some progress on that soon.