apollographql / router

A configurable, high-performance routing runtime for Apollo Federation 🚀
https://www.apollographql.com/docs/router/
Other
813 stars 271 forks source link

Enough functionality to implement adaptive load shedding #6148

Open garypen opened 1 month ago

garypen commented 1 month ago

An investigation into backpressure issues in the router.

Most of the changes are in various plugins to implement backpressure. However, those fixes are not enough to provide useful functionality...

The current implementation of the router create a new pipeline for each connection. This has the unfortunate impact of discarding state which is required for various load impacting layers to work correctly.

This exploration modifies the router to hold a single master pipeline which is clone'd for each connection. This allows the various tower connection limiting layers to work correctly.

I've got a version which works with standard tower layers, commented out here, but I've also got a potentially more interesting version which uses a load shedded based on Little's Law, which is what is active in this code.

Notes:

Modifying the pipeline to be cloneable has generally worked fine, but it has caused issues for the Limit layer. This layer looks "generally problematic" since it appears to make a number of assumptions about what request rejection actually means. I've done some minimal modification to try and make it work win a cloned pipeline, but tests are still failing and I'm not sure it does what it should do.

I also noticed that when implementing backpressure, various mock tests needed to be modified since test rejection happened earlier in the pipeline and a map_result() somewhere isn't triggered. That needs some investigation, but I think it's a small problem to address.

I modified the bridge query planner pool to prevent excessive queueing in this layer. Since I now want to control this by load_shedding before this service is reached, I only want enough channels to support the number of planners.

Summary:

This PR provides a router which will operate with approximately the same performance of the base router, but which controls memory and rejects excess load to prevent "over-commit" by the router. This is a very desirable property.

More testing is required, but this is looking promising so far.

Description here

Fixes #issue_number


Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

Exceptions

Note any exceptions here

Notes

[^1]: It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this. [^2]: Configuration is an important part of many changes. Where applicable please try to document configuration examples. [^3]: Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

svc-apollo-docs commented 1 month ago

✅ Docs Preview Ready

No new or changed pages found.

github-actions[bot] commented 1 month ago

@garypen, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.