Document recommended system requirements

sekhavati commented 2 years ago

Describe the solution you'd like I looked through the documentation but didn't find anything related to recommended system requirements for the Apollo Router. Specifically I'm thinking about things like:

how many requests can one instance handle at a time
how much RAM do I need to provision for my expected volume of traffic
same as above but for CPU
etc

It doesn't need to be a precise science but a ballpark figure would be helpful.

abernix commented 2 years ago

Thanks for opening this great question! The router is capable of processing large amounts of traffic. We will certainly work on building out improved guidance on this in the future, but I will caveat that with: the answer does and will quickly get pretty scientific and there is no ballpark figure since various factors come into play quickly:

The size of the schema
The performance of the subgraphs (latency here has material impact)
The number of (necessarily!) sequential subgraph fetches in-flight. Parallel fetches have much less impact, but are sometimes not possible depending on the subgraphs being queried.

That's to say, any ballpark figures we could produce can become variable very quickly based on conditions that are unique to you. We don't want to mislead people and there is really no substitute for real-world testing of your actual conditions.

As a way of demonstrating some of this variability, it's been a while since I wrote this blog post (and performance has undoubtably changed, so take them with a grain of salt until we have the opportunity to re-evaluate them – something we will do eventually), but this blog post I wrote in 2021 demonstrates some of the potential but also what it looks like to introduce variables into subgraph performance. The operations and schema we're working with in that blog post are still relatively simple, but we've consistently heard adopters seeing large performance improvements over the Gateway with their real-world workloads.

As hopefully a practical suggestion: The Router does scale horizontally and is able to use many CPU cores and available memory. This pairs well with automatic horizontal scaling patterns, like that offered with Kubernetes. For example, a conceivably fine initial approach is to start with lower values and setup pod scaling policies which allow the Kubernetes to scale based on for example, resource metrics. Keeping your eye on CPU throttling metrics in Kubernetes also affords a better understanding of where you might want to configure the allocation of additional resources.

Of course, you certainly shouldn't start with anything prohibitively low — for example, I wouldn't start with less than 0.5 CPU or 256MB of RAM just because you're being conservative. Unlike the Node.js Gateway runtime — which is single threaded and has its own memory management that poses challenges when sitting in a proxy position — the Router can work a lot more efficiently with the resources that you do allocate. You'll see benefits to providing more memory and CPU, so if you're in the position to allocate a bit more, you can start higher and reel things in you find that it's over provisioned.

I hope this helps answer some questions in the near term, and we will work out building recommendations over time. We really need quantitive data to do this in a meaningful way and we want to spend the time to do that. I wouldn't expect this in the next few months, but we will convey our findings as we have them. We hope to make our performance infrastructure more dynamic as well, and make that information available more publicly.

Thanks again for opening this!

sekhavati commented 2 years ago

Thanks for the detailed response @abernix, I appreciate there's not a simple answer.

I think for now I'll do what you suggested which is over provision, monitor, then scale back as required.

Regarding your blog post and the graphs showing figures of up to 20k req/sec, do you happen to recall the system specs of the machine you used during those tests? Not sure if I missed that detail, but would help give some context to those numbers

garypen commented 2 years ago

@sekhavati The blog post does document those details. A range of machine were used, with the router(s) executing on GCP e2-standard-8 systems.

abernix commented 2 years ago

@sekhavati I'm on my phone but I wrote the experiment setup details into the blog post. The machine sizes are there!

apollographql / router

Document recommended system requirements #1532