Closed marc-barry closed 1 year ago
We are tracking the memory usage over the weekend (which is low-traffic time). Here are point-in-time top pods
:
kubectl top pods -n apollo-router
NAME CPU(cores) MEMORY(bytes)
apollo-router-7fd44dcc9b-27zcg 11m 711Mi
apollo-router-7fd44dcc9b-2fxkc 16m 740Mi
apollo-router-7fd44dcc9b-4tgp5 12m 789Mi
apollo-router-7fd44dcc9b-6xbbd 14m 793Mi
apollo-router-7fd44dcc9b-757nf 17m 762Mi
apollo-router-7fd44dcc9b-7x55r 15m 465Mi
apollo-router-7fd44dcc9b-8q8bd 12m 510Mi
apollo-router-7fd44dcc9b-96whd 12m 284Mi
apollo-router-7fd44dcc9b-99krm 12m 346Mi
apollo-router-7fd44dcc9b-btchx 14m 464Mi
apollo-router-7fd44dcc9b-fvmr6 18m 693Mi
apollo-router-7fd44dcc9b-gf8qv 13m 463Mi
apollo-router-7fd44dcc9b-gj55t 17m 396Mi
apollo-router-7fd44dcc9b-jps72 11m 745Mi
apollo-router-7fd44dcc9b-kjqg8 13m 733Mi
apollo-router-7fd44dcc9b-kvjfr 14m 681Mi
apollo-router-7fd44dcc9b-l9dxb 14m 356Mi
apollo-router-7fd44dcc9b-mqxtp 12m 773Mi
apollo-router-7fd44dcc9b-mwx5m 14m 640Mi
apollo-router-7fd44dcc9b-ncmcl 14m 781Mi
apollo-router-7fd44dcc9b-nj5l7 13m 546Mi
apollo-router-7fd44dcc9b-pchls 10m 669Mi
apollo-router-7fd44dcc9b-pmzs5 15m 473Mi
apollo-router-7fd44dcc9b-pqsxx 14m 520Mi
apollo-router-7fd44dcc9b-t5nhd 15m 803Mi
apollo-router-7fd44dcc9b-vkrlv 17m 747Mi
apollo-router-7fd44dcc9b-vpkhq 13m 776Mi
apollo-router-7fd44dcc9b-vxj46 14m 754Mi
apollo-router-7fd44dcc9b-wg6cz 13m 689Mi
apollo-router-7fd44dcc9b-z7rfj 13m 604Mi
I think it is useful to see the scale of memory we are seeing. Our resources limits are:
resources:
limits:
cpu: 1000m
memory: 1500Mi
requests:
cpu: 100m
memory: 350Mi
We do get OOMKilled
reasons for pods which means they did climb to greater than 1500Mi
.
Do you have an estimation of the leak rate? Is it proportional to traffic or independent? Is it something you have seen in previous router versions? We fixed a leak in 1.11, related to telemetry, that might be another one
We made quite a few changes to try and isolate the issue. we were using OpenTelemtry tracing with https://github.com/apollographql/router/releases/tag/v1.10.0 but we suspected something was up with the tracing and so we disabled it. We then updated to https://github.com/apollographql/router/releases/tag/v1.11.0 as we thought there might be some improvements (but we have left tracing disabled).
Do you have an estimation of the leak rate?
I'll try and get some better data on this. I'll restart the pods and then look at their base memory and we'll start from there.
Is it proportional to traffic or independent?
It appears to be proportional to traffic volume.
Is it something you have seen in previous router versions?
Yes, we are confident we have seen this in https://github.com/apollographql/router/releases/tag/v1.10.0 but we didn't collect good enough metrics before that to know if the releases before also exhibited the same behaviour.
Here is the memory usage for all 30
pods after a kubectl rollout restart deployment/apollo-router -n apollo-router
. I'll keep collecting so we can observe the change.
kubectl top pods -n apollo-router
NAME CPU(cores) MEMORY(bytes)
apollo-router-797f5b9d59-2gphr 15m 96Mi
apollo-router-797f5b9d59-5v554 6m 86Mi
apollo-router-797f5b9d59-6tl7b 7m 84Mi
apollo-router-797f5b9d59-76jzd 6m 86Mi
apollo-router-797f5b9d59-7hhqh 6m 95Mi
apollo-router-797f5b9d59-88822 6m 89Mi
apollo-router-797f5b9d59-9wlrq 5m 96Mi
apollo-router-797f5b9d59-bkkxv 5m 89Mi
apollo-router-797f5b9d59-g2fs5 4m 84Mi
apollo-router-797f5b9d59-g9cv7 11m 96Mi
apollo-router-797f5b9d59-gz92w 5m 84Mi
apollo-router-797f5b9d59-hjnpg 6m 87Mi
apollo-router-797f5b9d59-jqxtw 5m 97Mi
apollo-router-797f5b9d59-km6wk 5m 94Mi
apollo-router-797f5b9d59-lftxg 5m 83Mi
apollo-router-797f5b9d59-qd8h8 5m 88Mi
apollo-router-797f5b9d59-qdxg6 7m 97Mi
apollo-router-797f5b9d59-qxgjz 9m 95Mi
apollo-router-797f5b9d59-r947q 6m 89Mi
apollo-router-797f5b9d59-t8sfd 7m 88Mi
apollo-router-797f5b9d59-tn46x 6m 93Mi
apollo-router-797f5b9d59-v9b4p 6m 86Mi
apollo-router-797f5b9d59-vbp5k 5m 85Mi
apollo-router-797f5b9d59-w7952 7m 88Mi
apollo-router-797f5b9d59-wm2s9 7m 91Mi
apollo-router-797f5b9d59-wmqr8 37m 92Mi
apollo-router-797f5b9d59-wmzxs 6m 93Mi
apollo-router-797f5b9d59-wv2nf 6m 83Mi
apollo-router-797f5b9d59-z6dns 25m 99Mi
apollo-router-797f5b9d59-zxkgf 5m 87Mi
After about 4.5 hours this is the memory consumption now:
date
Sun 26 Feb 2023 22:37:58 EST
kubectl top pods -n apollo-router
NAME CPU(cores) MEMORY(bytes)
apollo-router-797f5b9d59-2gphr 6m 150Mi
apollo-router-797f5b9d59-5v554 5m 179Mi
apollo-router-797f5b9d59-6tl7b 5m 137Mi
apollo-router-797f5b9d59-76jzd 4m 483Mi
apollo-router-797f5b9d59-7hhqh 7m 133Mi
apollo-router-797f5b9d59-88822 6m 147Mi
apollo-router-797f5b9d59-9wlrq 5m 128Mi
apollo-router-797f5b9d59-bkkxv 6m 157Mi
apollo-router-797f5b9d59-g2fs5 6m 139Mi
apollo-router-797f5b9d59-g9cv7 7m 647Mi
apollo-router-797f5b9d59-gz92w 5m 669Mi
apollo-router-797f5b9d59-hjnpg 6m 139Mi
apollo-router-797f5b9d59-jqxtw 6m 132Mi
apollo-router-797f5b9d59-km6wk 6m 136Mi
apollo-router-797f5b9d59-lftxg 5m 164Mi
apollo-router-797f5b9d59-qd8h8 6m 131Mi
apollo-router-797f5b9d59-qdxg6 6m 573Mi
apollo-router-797f5b9d59-qxgjz 6m 514Mi
apollo-router-797f5b9d59-r947q 6m 137Mi
apollo-router-797f5b9d59-t8sfd 5m 145Mi
apollo-router-797f5b9d59-tn46x 6m 166Mi
apollo-router-797f5b9d59-v9b4p 6m 145Mi
apollo-router-797f5b9d59-vbp5k 5m 138Mi
apollo-router-797f5b9d59-w7952 5m 671Mi
apollo-router-797f5b9d59-wm2s9 5m 143Mi
apollo-router-797f5b9d59-wmqr8 6m 145Mi
apollo-router-797f5b9d59-wmzxs 6m 148Mi
apollo-router-797f5b9d59-wv2nf 6m 307Mi
apollo-router-797f5b9d59-z6dns 5m 142Mi
apollo-router-797f5b9d59-zxkgf 5m 443Mi
All pods get roughly the same request rate but we have some larger requests which happen periodically. As you can see some pods are using substantially more memory than others. We are at low traffic now and so I'll monitor tomorrow morning and throughout the day. I'll also pull some graphs for overall memory consumption across 24 hours and longer.
Here's the memory trend of all pods added together since the restart event. The red line is the total memory limit, the green is the request and the blue is the actual used. You can see the gradual climb. As I mentioned earlier our traffic volume at this time is low and fairly consistent.
does this happen at the same rate in 1.10.3? What is the rhai script doing?
does this happen at the same rate in 1.10.3?
Unfortunately, we don't have the data on this.
We do push a lot of subgraphs changes out during the day for subgraphs which get picked up by managed federation, so we'll also be able to see what impact that has throughout the day today.
What is the rhai script doing?
Here is the Rhai script copied verbatim:
// vendored from https://www.apollographql.com/docs/router/configuration/header-propagation/#response-header-propagation
fn supergraph_service(service) {
let add_cookies_to_response = |response| {
if response.context["set_cookie_headers"]?.len > 0 {
let cookie = "";
for header in response.context["set_cookie_headers"] {
cookie += header + "; ";
}
response.headers["set-cookie"] = cookie;
}
response.headers["rhai"] = "true";
};
service.map_response(add_cookies_to_response);
}
fn subgraph_service(service, subgraph) {
let store_cookies_from_subgraphs = |response| {
if response.headers.values("set-cookie")?.len > 0 {
if response.context["set_cookie_headers"] == () {
response.context.set_cookie_headers = []
}
response.context.set_cookie_headers += response.headers.values("set-cookie");
}
};
service.map_response(store_cookies_from_subgraphs);
}
Memory update from this morning from before traffic ramp up:
date
Mon 27 Feb 2023 08:00:02 EST
kubectl top pods -n apollo-router
NAME CPU(cores) MEMORY(bytes)
apollo-router-797f5b9d59-2gphr 16m 160Mi
apollo-router-797f5b9d59-5v554 13m 183Mi
apollo-router-797f5b9d59-6tl7b 14m 146Mi
apollo-router-797f5b9d59-76jzd 10m 161Mi
apollo-router-797f5b9d59-7hhqh 10m 156Mi
apollo-router-797f5b9d59-88822 14m 150Mi
apollo-router-797f5b9d59-9wlrq 10m 168Mi
apollo-router-797f5b9d59-bkkxv 8m 173Mi
apollo-router-797f5b9d59-g2fs5 13m 165Mi
apollo-router-797f5b9d59-g9cv7 10m 338Mi
apollo-router-797f5b9d59-gz92w 9m 253Mi
apollo-router-797f5b9d59-hjnpg 9m 660Mi
apollo-router-797f5b9d59-jqxtw 11m 152Mi
apollo-router-797f5b9d59-km6wk 13m 188Mi
apollo-router-797f5b9d59-lftxg 18m 170Mi
apollo-router-797f5b9d59-qd8h8 13m 473Mi
apollo-router-797f5b9d59-qdxg6 14m 194Mi
apollo-router-797f5b9d59-qxgjz 13m 309Mi
apollo-router-797f5b9d59-r947q 13m 155Mi
apollo-router-797f5b9d59-t8sfd 16m 172Mi
apollo-router-797f5b9d59-tn46x 13m 175Mi
apollo-router-797f5b9d59-v9b4p 10m 149Mi
apollo-router-797f5b9d59-vbp5k 10m 160Mi
apollo-router-797f5b9d59-w7952 9m 146Mi
apollo-router-797f5b9d59-wm2s9 14m 151Mi
apollo-router-797f5b9d59-wmqr8 9m 175Mi
apollo-router-797f5b9d59-wmzxs 14m 570Mi
apollo-router-797f5b9d59-wv2nf 13m 317Mi
apollo-router-797f5b9d59-z6dns 13m 465Mi
apollo-router-797f5b9d59-zxkgf 10m 454Mi
Another snapshot of memory as we saw some traffic ramp up:
date
Mon 27 Feb 2023 09:41:34 EST
kubectl top pods -n apollo-router
NAME CPU(cores) MEMORY(bytes)
apollo-router-797f5b9d59-2gphr 63m 259Mi
apollo-router-797f5b9d59-5v554 49m 262Mi
apollo-router-797f5b9d59-6tl7b 48m 268Mi
apollo-router-797f5b9d59-76jzd 93m 342Mi
apollo-router-797f5b9d59-7hhqh 99m 492Mi
apollo-router-797f5b9d59-88822 119m 207Mi
apollo-router-797f5b9d59-9wlrq 49m 216Mi
apollo-router-797f5b9d59-bkkxv 67m 470Mi
apollo-router-797f5b9d59-g2fs5 67m 493Mi
apollo-router-797f5b9d59-g9cv7 27m 360Mi
apollo-router-797f5b9d59-gz92w 52m 282Mi
apollo-router-797f5b9d59-hjnpg 108m 676Mi
apollo-router-797f5b9d59-jqxtw 134m 260Mi
apollo-router-797f5b9d59-km6wk 80m 230Mi
apollo-router-797f5b9d59-lftxg 125m 206Mi
apollo-router-797f5b9d59-qd8h8 27m 365Mi
apollo-router-797f5b9d59-qdxg6 103m 226Mi
apollo-router-797f5b9d59-qxgjz 89m 331Mi
apollo-router-797f5b9d59-r947q 31m 254Mi
apollo-router-797f5b9d59-t8sfd 60m 318Mi
apollo-router-797f5b9d59-tn46x 107m 263Mi
apollo-router-797f5b9d59-v9b4p 74m 275Mi
apollo-router-797f5b9d59-vbp5k 31m 483Mi
apollo-router-797f5b9d59-w7952 104m 249Mi
apollo-router-797f5b9d59-wm2s9 22m 210Mi
apollo-router-797f5b9d59-wmqr8 34m 220Mi
apollo-router-797f5b9d59-wmzxs 80m 597Mi
apollo-router-797f5b9d59-wv2nf 34m 338Mi
apollo-router-797f5b9d59-z6dns 67m 486Mi
apollo-router-797f5b9d59-zxkgf 27m 473Mi
There was a blip in Google Cloud Platform this morning in their observability stack and that is the blip you see in the chart (which can be ignored as anything significant):
I can confirm that there seems to be another memory leak in the router (1.11.0). Its not that bad as the previous one (with tracing), but not negligible.
Seems like memory grows by 1.5GB in ~4 days, so ~400MB / day.
I assume main leak is still related to tracing, as graphs looked better before it was enabled on 22th of February (though there was still a leak back then).
Here's our current state:
date
Mon 27 Feb 2023 22:23:15 EST
kubectl top pods -n apollo-router
NAME CPU(cores) MEMORY(bytes)
apollo-router-797f5b9d59-2gphr 19m 888Mi
apollo-router-797f5b9d59-6lb7d 28m 439Mi
apollo-router-797f5b9d59-76jzd 24m 735Mi
apollo-router-797f5b9d59-7hhqh 17m 664Mi
apollo-router-797f5b9d59-88822 18m 602Mi
apollo-router-797f5b9d59-8mmfg 19m 439Mi
apollo-router-797f5b9d59-9rtcv 14m 667Mi
apollo-router-797f5b9d59-9wlrq 14m 674Mi
apollo-router-797f5b9d59-bkm8q 18m 776Mi
apollo-router-797f5b9d59-bxvbj 15m 678Mi
apollo-router-797f5b9d59-cmxcl 16m 713Mi
apollo-router-797f5b9d59-cxtkm 18m 424Mi
apollo-router-797f5b9d59-g2fs5 16m 692Mi
apollo-router-797f5b9d59-g9cv7 12m 703Mi
apollo-router-797f5b9d59-gfzcm 18m 577Mi
apollo-router-797f5b9d59-hjnpg 20m 837Mi
apollo-router-797f5b9d59-jqxtw 21m 845Mi
apollo-router-797f5b9d59-km6wk 19m 879Mi
apollo-router-797f5b9d59-lftxg 14m 803Mi
apollo-router-797f5b9d59-pb7pd 18m 784Mi
apollo-router-797f5b9d59-plpjw 19m 444Mi
apollo-router-797f5b9d59-r947q 17m 801Mi
apollo-router-797f5b9d59-t8sfd 17m 737Mi
apollo-router-797f5b9d59-tn46x 18m 937Mi
apollo-router-797f5b9d59-vbp5k 15m 657Mi
apollo-router-797f5b9d59-w7952 12m 677Mi
apollo-router-797f5b9d59-wmzxs 16m 706Mi
apollo-router-797f5b9d59-wv2nf 15m 877Mi
apollo-router-797f5b9d59-z6dns 17m 707Mi
apollo-router-797f5b9d59-zxkgf 19m 604Mi
You can see that we are at > 20GiB of total used memory and climbing. We'll be in the pods being OOMKilled
region tomorrow and I'll let this go until we see the first signs of this and then we'll have to restart. As a temp precaution, we are considering a cron-based restart something like a 6
hours cadence while we work with you on finding the issue.
Is there anything we can do to dump memory usage by an object or anything like that? I'm assuming that it would be pretty evident what is consuming all the memory if this were possible. We do have the ability to run canary pods with slightly different configurations should that be helpful for troubleshooting.
I found https://medium.com/lumen-engineering-blog/tutorial-profiling-cpu-and-ram-usage-of-rust-micro-services-running-on-kubernetes-fbc32714da93 which is a great article. I was hoping that there was a Docker image built with debug symbols but I didn't find one. While I know it would be easy to build one it would be awesome if there was one readily available so we could run CPU and memory profiles easily.
I found https://medium.com/lumen-engineering-blog/tutorial-profiling-cpu-and-ram-usage-of-rust-micro-services-running-on-kubernetes-fbc32714da93 which is a great article. I was hoping that there was a Docker image built with debug symbols but I didn't find one. While I know it would be easy to build one it would be awesome if there was one readily available so we could run CPU and memory profiles easily.
Is any of this useful: https://github.com/apollographql/router/blob/dev/CHANGELOG.md#router-debug-docker-images-now-run-under-the-control-of-heaptrack-issue-2135 ?
We don't have debug symbols enabled, but the information is still useful. I did mention in that PR that "at some point" we could maybe consider enabling debug symbols for our debug builds. Maybe now is the right time?
We don't have debug symbols enabled, but the information is still useful. I did mention in that PR that "at some point" we could maybe consider enabling debug symbols for our debug builds. Maybe now is the right time?
Debug symbols only increase the size of the build and don't affect the performance of the profile. And there is value in the information that they provide. So I think π to enable them for debug builds. If you wanted to have debug builds with and without symbols you could have images of the form ghcr.io/apollographql/router:<image version>-debug
and ghcr.io/apollographql/router:<image version>-debug-symbols
.
an update on this: I've tracked it down to an issue with the runtime we use for query planning, we're looking at ways to fix it now
Any theories on how to fix this yet? We are basically cron
restarting Router at this time.
The fix is coming: https://github.com/apollographql/federation-rs/pull/259
The problem comes from creating a new runtime for every schema, which leaks memory. This fix reuses the same runtime, but we had to make sure it supported planning queries for two schemas at the same time (during the update)
https://github.com/apollographql/federation-rs/pull/259 merged. I'm going now watch to see when this makes it into a Router release so we can update.
@marc-barry @Meemaw could you try with https://github.com/apollographql/router/pull/2706 ? The new router-bridge verson was not enough, it needded some changes on the router side
@marc-barry @Meemaw could you try with #2706 ? The new router-bridge verson was not enough, it needded some changes on the router side
@Geal we deploy this with your Helm Chart. How would we test this specific PR using the Helm setup? I assume an image gets built and placed somewhere that we are able to pick up through values configuration.
@garypen was there a way to use the helm chart with a specific branch?
@garypen was there a way to use the helm chart with a specific branch?
There isn't. It's against a docker image. I think if you cut a release from a branch, if that makes a docker image, then that would work.
@marc-barry @Meemaw could you try with #2706 ? The new router-bridge verson was not enough, it needded some changes on the router side
Is it possible to simply get #2706 merged into the mainline? Or is it that you wanted us to test it first before doing so? As the changes look generally useful and beneficial for all users.
@marc-barry We were looking for help testing it. The changes are definitely both generally useful and beneficial for all users, but the changes themselves are a bit more than trivial so we were looking for as much validation as we could get to make sure it was tangible.
That said, we're now at the point where we're comfortable putting this into a release, based on part on testing that was afforded to us by folks who are able to run from the branch. π
We're looking to get this into a release today or tomorrow.
We are running https://github.com/apollographql/router/releases/tag/v1.13.1 with a reasonably high load now. I will report back in a day or two on the results.
@marc-barry any update?
OK. We needed to collect a few days' worths of data and also during those few days we had to redeploy Router as we made some changes to its configuration. So far the Router has been up for multiple days and we see now evidence of the ever-increasing memory issue we saw before. Even better, we see memory reduce a little at some points during low-volume periods.
Thanks for addressing this issue and I'll report back if we see any regression as we have alerts setup on memory usage and growth rates.
We managed to run an entire week without an OOM from Apollo Router. But if you look at the following you will see that there is still memory growth. For our setup, it seems that about a week is what it takes to hit our defined limits. I just deployed https://github.com/apollographql/router/releases/tag/v1.14.0 and will watch that one. Is there a ticket or plan to tackle the remaining memory growth issues that I can make reports towards?
I wonder if https://github.com/apollographql/router/pull/2882 will have any impact.
It should. The router can get high memory fragmentation, and jemalloc is good at dealing with it. We've seen memory usage drop and OOM disappear in some deployments. You should test it from this branch, which also brings good performance gains https://github.com/apollographql/router/pull/2995
Describe the bug
We see Apollo Router grow in memory usage over a few days until it finally gets OOM killed. The memory keeps increasing over time little by little.
To Reproduce Steps to reproduce the behavior:
We are running Apollo Router with the following configuration:
Note that the
origins
for CORS have been adjusted to remove our domains.We restarted all pods this morning as they were close to being OOM killed but this is what a 6 our window of memory usage looks like:
The memory will continue to grow until it hits the 1.5GiB limit and is finally killed by the underlying system.
Expected behavior
We expect Apollo Router to hit a steady state of memory usage but instead, it increases linearly until it runs out of memory.
Output If applicable, add output to help explain your problem.
Desktop (please complete the following information):
Additional context
We are running this on Kubernetes and we currently have
20
pods serving traffic. The CPU in the time window looks like the following:The spikes are due to the managed federation updates when we update the subgraphs in Apollo Studio's system. Hot reload is disabled.