Suspect a memory leak in Apollo Router

marc-barry commented 1 year ago

Describe the bug

We see Apollo Router grow in memory usage over a few days until it finally gets OOM killed. The memory keeps increasing over time little by little.

To Reproduce Steps to reproduce the behavior:

We are running Apollo Router with the following configuration:

cors:
  allow_credentials: true
  origins:
  - http://localhost
headers:
  all:
    request:
    - propagate:
        matching: .*
health_check:
  listen: 0.0.0.0:8088
include_subgraph_errors:
  all: true
rhai:
  main: main.rhai
  scripts: /dist/rhai
supergraph:
  listen: 0.0.0.0:80
  path: /graphql
telemetry:
  metrics:
    common:
      resources:
        service.name: apollo-router
    prometheus:
      enabled: true
      listen: 0.0.0.0:9090
      path: /metrics
traffic_shaping:
  all:
    compression: gzip
    deduplicate_query: false
    timeout: 95s
  deduplicate_variables: false
  router:
    timeout: 95s

Note that the origins for CORS have been adjusted to remove our domains.

We restarted all pods this morning as they were close to being OOM killed but this is what a 6 our window of memory usage looks like:

The memory will continue to grow until it hits the 1.5GiB limit and is finally killed by the underlying system.

Expected behavior

We expect Apollo Router to hit a steady state of memory usage but instead, it increases linearly until it runs out of memory.

Output If applicable, add output to help explain your problem.

Desktop (please complete the following information):

OS: Linux
Version v1.11.0 (https://github.com/apollographql/router/releases/tag/v1.11.0)

Additional context

We are running this on Kubernetes and we currently have 20 pods serving traffic. The CPU in the time window looks like the following:

The spikes are due to the managed federation updates when we update the subgraphs in Apollo Studio's system. Hot reload is disabled.

marc-barry commented 1 year ago

We are tracking the memory usage over the weekend (which is low-traffic time). Here are point-in-time top pods:

kubectl top pods -n apollo-router
NAME                             CPU(cores)   MEMORY(bytes)
apollo-router-7fd44dcc9b-27zcg   11m          711Mi
apollo-router-7fd44dcc9b-2fxkc   16m          740Mi
apollo-router-7fd44dcc9b-4tgp5   12m          789Mi
apollo-router-7fd44dcc9b-6xbbd   14m          793Mi
apollo-router-7fd44dcc9b-757nf   17m          762Mi
apollo-router-7fd44dcc9b-7x55r   15m          465Mi
apollo-router-7fd44dcc9b-8q8bd   12m          510Mi
apollo-router-7fd44dcc9b-96whd   12m          284Mi
apollo-router-7fd44dcc9b-99krm   12m          346Mi
apollo-router-7fd44dcc9b-btchx   14m          464Mi
apollo-router-7fd44dcc9b-fvmr6   18m          693Mi
apollo-router-7fd44dcc9b-gf8qv   13m          463Mi
apollo-router-7fd44dcc9b-gj55t   17m          396Mi
apollo-router-7fd44dcc9b-jps72   11m          745Mi
apollo-router-7fd44dcc9b-kjqg8   13m          733Mi
apollo-router-7fd44dcc9b-kvjfr   14m          681Mi
apollo-router-7fd44dcc9b-l9dxb   14m          356Mi
apollo-router-7fd44dcc9b-mqxtp   12m          773Mi
apollo-router-7fd44dcc9b-mwx5m   14m          640Mi
apollo-router-7fd44dcc9b-ncmcl   14m          781Mi
apollo-router-7fd44dcc9b-nj5l7   13m          546Mi
apollo-router-7fd44dcc9b-pchls   10m          669Mi
apollo-router-7fd44dcc9b-pmzs5   15m          473Mi
apollo-router-7fd44dcc9b-pqsxx   14m          520Mi
apollo-router-7fd44dcc9b-t5nhd   15m          803Mi
apollo-router-7fd44dcc9b-vkrlv   17m          747Mi
apollo-router-7fd44dcc9b-vpkhq   13m          776Mi
apollo-router-7fd44dcc9b-vxj46   14m          754Mi
apollo-router-7fd44dcc9b-wg6cz   13m          689Mi
apollo-router-7fd44dcc9b-z7rfj   13m          604Mi

I think it is useful to see the scale of memory we are seeing. Our resources limits are:

resources:
  limits:
    cpu: 1000m
    memory: 1500Mi
  requests:
    cpu: 100m
    memory: 350Mi

We do get OOMKilled reasons for pods which means they did climb to greater than 1500Mi.

Geal commented 1 year ago

Do you have an estimation of the leak rate? Is it proportional to traffic or independent? Is it something you have seen in previous router versions? We fixed a leak in 1.11, related to telemetry, that might be another one

marc-barry commented 1 year ago

We made quite a few changes to try and isolate the issue. we were using OpenTelemtry tracing with https://github.com/apollographql/router/releases/tag/v1.10.0 but we suspected something was up with the tracing and so we disabled it. We then updated to https://github.com/apollographql/router/releases/tag/v1.11.0 as we thought there might be some improvements (but we have left tracing disabled).

Do you have an estimation of the leak rate?

I'll try and get some better data on this. I'll restart the pods and then look at their base memory and we'll start from there.

Is it proportional to traffic or independent?

It appears to be proportional to traffic volume.

Is it something you have seen in previous router versions?

Yes, we are confident we have seen this in https://github.com/apollographql/router/releases/tag/v1.10.0 but we didn't collect good enough metrics before that to know if the releases before also exhibited the same behaviour.

marc-barry commented 1 year ago

Here is the memory usage for all 30 pods after a kubectl rollout restart deployment/apollo-router -n apollo-router. I'll keep collecting so we can observe the change.

kubectl top pods -n apollo-router
NAME                             CPU(cores)   MEMORY(bytes)
apollo-router-797f5b9d59-2gphr   15m          96Mi
apollo-router-797f5b9d59-5v554   6m           86Mi
apollo-router-797f5b9d59-6tl7b   7m           84Mi
apollo-router-797f5b9d59-76jzd   6m           86Mi
apollo-router-797f5b9d59-7hhqh   6m           95Mi
apollo-router-797f5b9d59-88822   6m           89Mi
apollo-router-797f5b9d59-9wlrq   5m           96Mi
apollo-router-797f5b9d59-bkkxv   5m           89Mi
apollo-router-797f5b9d59-g2fs5   4m           84Mi
apollo-router-797f5b9d59-g9cv7   11m          96Mi
apollo-router-797f5b9d59-gz92w   5m           84Mi
apollo-router-797f5b9d59-hjnpg   6m           87Mi
apollo-router-797f5b9d59-jqxtw   5m           97Mi
apollo-router-797f5b9d59-km6wk   5m           94Mi
apollo-router-797f5b9d59-lftxg   5m           83Mi
apollo-router-797f5b9d59-qd8h8   5m           88Mi
apollo-router-797f5b9d59-qdxg6   7m           97Mi
apollo-router-797f5b9d59-qxgjz   9m           95Mi
apollo-router-797f5b9d59-r947q   6m           89Mi
apollo-router-797f5b9d59-t8sfd   7m           88Mi
apollo-router-797f5b9d59-tn46x   6m           93Mi
apollo-router-797f5b9d59-v9b4p   6m           86Mi
apollo-router-797f5b9d59-vbp5k   5m           85Mi
apollo-router-797f5b9d59-w7952   7m           88Mi
apollo-router-797f5b9d59-wm2s9   7m           91Mi
apollo-router-797f5b9d59-wmqr8   37m          92Mi
apollo-router-797f5b9d59-wmzxs   6m           93Mi
apollo-router-797f5b9d59-wv2nf   6m           83Mi
apollo-router-797f5b9d59-z6dns   25m          99Mi
apollo-router-797f5b9d59-zxkgf   5m           87Mi

marc-barry commented 1 year ago

After about 4.5 hours this is the memory consumption now:

date
Sun 26 Feb 2023 22:37:58 EST
kubectl top pods -n apollo-router
NAME                             CPU(cores)   MEMORY(bytes)
apollo-router-797f5b9d59-2gphr   6m           150Mi
apollo-router-797f5b9d59-5v554   5m           179Mi
apollo-router-797f5b9d59-6tl7b   5m           137Mi
apollo-router-797f5b9d59-76jzd   4m           483Mi
apollo-router-797f5b9d59-7hhqh   7m           133Mi
apollo-router-797f5b9d59-88822   6m           147Mi
apollo-router-797f5b9d59-9wlrq   5m           128Mi
apollo-router-797f5b9d59-bkkxv   6m           157Mi
apollo-router-797f5b9d59-g2fs5   6m           139Mi
apollo-router-797f5b9d59-g9cv7   7m           647Mi
apollo-router-797f5b9d59-gz92w   5m           669Mi
apollo-router-797f5b9d59-hjnpg   6m           139Mi
apollo-router-797f5b9d59-jqxtw   6m           132Mi
apollo-router-797f5b9d59-km6wk   6m           136Mi
apollo-router-797f5b9d59-lftxg   5m           164Mi
apollo-router-797f5b9d59-qd8h8   6m           131Mi
apollo-router-797f5b9d59-qdxg6   6m           573Mi
apollo-router-797f5b9d59-qxgjz   6m           514Mi
apollo-router-797f5b9d59-r947q   6m           137Mi
apollo-router-797f5b9d59-t8sfd   5m           145Mi
apollo-router-797f5b9d59-tn46x   6m           166Mi
apollo-router-797f5b9d59-v9b4p   6m           145Mi
apollo-router-797f5b9d59-vbp5k   5m           138Mi
apollo-router-797f5b9d59-w7952   5m           671Mi
apollo-router-797f5b9d59-wm2s9   5m           143Mi
apollo-router-797f5b9d59-wmqr8   6m           145Mi
apollo-router-797f5b9d59-wmzxs   6m           148Mi
apollo-router-797f5b9d59-wv2nf   6m           307Mi
apollo-router-797f5b9d59-z6dns   5m           142Mi
apollo-router-797f5b9d59-zxkgf   5m           443Mi

All pods get roughly the same request rate but we have some larger requests which happen periodically. As you can see some pods are using substantially more memory than others. We are at low traffic now and so I'll monitor tomorrow morning and throughout the day. I'll also pull some graphs for overall memory consumption across 24 hours and longer.

marc-barry commented 1 year ago

Here's the memory trend of all pods added together since the restart event. The red line is the total memory limit, the green is the request and the blue is the actual used. You can see the gradual climb. As I mentioned earlier our traffic volume at this time is low and fairly consistent.

Memory

Geal commented 1 year ago

does this happen at the same rate in 1.10.3? What is the rhai script doing?

marc-barry commented 1 year ago

does this happen at the same rate in 1.10.3?

Unfortunately, we don't have the data on this.

We do push a lot of subgraphs changes out during the day for subgraphs which get picked up by managed federation, so we'll also be able to see what impact that has throughout the day today.

What is the rhai script doing?

Here is the Rhai script copied verbatim:

// vendored from https://www.apollographql.com/docs/router/configuration/header-propagation/#response-header-propagation

fn supergraph_service(service) {
  let add_cookies_to_response = |response| {
    if response.context["set_cookie_headers"]?.len > 0 {
      let cookie = "";
      for header in response.context["set_cookie_headers"] {
        cookie += header + "; ";
      }
      response.headers["set-cookie"] = cookie;
    }
    response.headers["rhai"] = "true";
  };

  service.map_response(add_cookies_to_response);
}

fn subgraph_service(service, subgraph) {
  let store_cookies_from_subgraphs = |response| {
    if response.headers.values("set-cookie")?.len > 0 {
      if response.context["set_cookie_headers"] == () {
        response.context.set_cookie_headers = []
      }

      response.context.set_cookie_headers += response.headers.values("set-cookie");
    }
  };

  service.map_response(store_cookies_from_subgraphs);
}

marc-barry commented 1 year ago

Memory update from this morning from before traffic ramp up:

date
Mon 27 Feb 2023 08:00:02 EST
kubectl top pods -n apollo-router
NAME                             CPU(cores)   MEMORY(bytes)
apollo-router-797f5b9d59-2gphr   16m          160Mi
apollo-router-797f5b9d59-5v554   13m          183Mi
apollo-router-797f5b9d59-6tl7b   14m          146Mi
apollo-router-797f5b9d59-76jzd   10m          161Mi
apollo-router-797f5b9d59-7hhqh   10m          156Mi
apollo-router-797f5b9d59-88822   14m          150Mi
apollo-router-797f5b9d59-9wlrq   10m          168Mi
apollo-router-797f5b9d59-bkkxv   8m           173Mi
apollo-router-797f5b9d59-g2fs5   13m          165Mi
apollo-router-797f5b9d59-g9cv7   10m          338Mi
apollo-router-797f5b9d59-gz92w   9m           253Mi
apollo-router-797f5b9d59-hjnpg   9m           660Mi
apollo-router-797f5b9d59-jqxtw   11m          152Mi
apollo-router-797f5b9d59-km6wk   13m          188Mi
apollo-router-797f5b9d59-lftxg   18m          170Mi
apollo-router-797f5b9d59-qd8h8   13m          473Mi
apollo-router-797f5b9d59-qdxg6   14m          194Mi
apollo-router-797f5b9d59-qxgjz   13m          309Mi
apollo-router-797f5b9d59-r947q   13m          155Mi
apollo-router-797f5b9d59-t8sfd   16m          172Mi
apollo-router-797f5b9d59-tn46x   13m          175Mi
apollo-router-797f5b9d59-v9b4p   10m          149Mi
apollo-router-797f5b9d59-vbp5k   10m          160Mi
apollo-router-797f5b9d59-w7952   9m           146Mi
apollo-router-797f5b9d59-wm2s9   14m          151Mi
apollo-router-797f5b9d59-wmqr8   9m           175Mi
apollo-router-797f5b9d59-wmzxs   14m          570Mi
apollo-router-797f5b9d59-wv2nf   13m          317Mi
apollo-router-797f5b9d59-z6dns   13m          465Mi
apollo-router-797f5b9d59-zxkgf   10m          454Mi

Another snapshot of memory as we saw some traffic ramp up:

date
Mon 27 Feb 2023 09:41:34 EST
kubectl top pods -n apollo-router
NAME                             CPU(cores)   MEMORY(bytes)
apollo-router-797f5b9d59-2gphr   63m          259Mi
apollo-router-797f5b9d59-5v554   49m          262Mi
apollo-router-797f5b9d59-6tl7b   48m          268Mi
apollo-router-797f5b9d59-76jzd   93m          342Mi
apollo-router-797f5b9d59-7hhqh   99m          492Mi
apollo-router-797f5b9d59-88822   119m         207Mi
apollo-router-797f5b9d59-9wlrq   49m          216Mi
apollo-router-797f5b9d59-bkkxv   67m          470Mi
apollo-router-797f5b9d59-g2fs5   67m          493Mi
apollo-router-797f5b9d59-g9cv7   27m          360Mi
apollo-router-797f5b9d59-gz92w   52m          282Mi
apollo-router-797f5b9d59-hjnpg   108m         676Mi
apollo-router-797f5b9d59-jqxtw   134m         260Mi
apollo-router-797f5b9d59-km6wk   80m          230Mi
apollo-router-797f5b9d59-lftxg   125m         206Mi
apollo-router-797f5b9d59-qd8h8   27m          365Mi
apollo-router-797f5b9d59-qdxg6   103m         226Mi
apollo-router-797f5b9d59-qxgjz   89m          331Mi
apollo-router-797f5b9d59-r947q   31m          254Mi
apollo-router-797f5b9d59-t8sfd   60m          318Mi
apollo-router-797f5b9d59-tn46x   107m         263Mi
apollo-router-797f5b9d59-v9b4p   74m          275Mi
apollo-router-797f5b9d59-vbp5k   31m          483Mi
apollo-router-797f5b9d59-w7952   104m         249Mi
apollo-router-797f5b9d59-wm2s9   22m          210Mi
apollo-router-797f5b9d59-wmqr8   34m          220Mi
apollo-router-797f5b9d59-wmzxs   80m          597Mi
apollo-router-797f5b9d59-wv2nf   34m          338Mi
apollo-router-797f5b9d59-z6dns   67m          486Mi
apollo-router-797f5b9d59-zxkgf   27m          473Mi

There was a blip in Google Cloud Platform this morning in their observability stack and that is the blip you see in the chart (which can be ignored as anything significant):

Memory (1)

Meemaw commented 1 year ago

I can confirm that there seems to be another memory leak in the router (1.11.0). Its not that bad as the previous one (with tracing), but not negligible.

Seems like memory grows by 1.5GB in ~4 days, so ~400MB / day.

I assume main leak is still related to tracing, as graphs looked better before it was enabled on 22th of February (though there was still a leak back then).

marc-barry commented 1 year ago

Here's our current state:

date
Mon 27 Feb 2023 22:23:15 EST
kubectl top pods -n apollo-router
NAME                             CPU(cores)   MEMORY(bytes)
apollo-router-797f5b9d59-2gphr   19m          888Mi
apollo-router-797f5b9d59-6lb7d   28m          439Mi
apollo-router-797f5b9d59-76jzd   24m          735Mi
apollo-router-797f5b9d59-7hhqh   17m          664Mi
apollo-router-797f5b9d59-88822   18m          602Mi
apollo-router-797f5b9d59-8mmfg   19m          439Mi
apollo-router-797f5b9d59-9rtcv   14m          667Mi
apollo-router-797f5b9d59-9wlrq   14m          674Mi
apollo-router-797f5b9d59-bkm8q   18m          776Mi
apollo-router-797f5b9d59-bxvbj   15m          678Mi
apollo-router-797f5b9d59-cmxcl   16m          713Mi
apollo-router-797f5b9d59-cxtkm   18m          424Mi
apollo-router-797f5b9d59-g2fs5   16m          692Mi
apollo-router-797f5b9d59-g9cv7   12m          703Mi
apollo-router-797f5b9d59-gfzcm   18m          577Mi
apollo-router-797f5b9d59-hjnpg   20m          837Mi
apollo-router-797f5b9d59-jqxtw   21m          845Mi
apollo-router-797f5b9d59-km6wk   19m          879Mi
apollo-router-797f5b9d59-lftxg   14m          803Mi
apollo-router-797f5b9d59-pb7pd   18m          784Mi
apollo-router-797f5b9d59-plpjw   19m          444Mi
apollo-router-797f5b9d59-r947q   17m          801Mi
apollo-router-797f5b9d59-t8sfd   17m          737Mi
apollo-router-797f5b9d59-tn46x   18m          937Mi
apollo-router-797f5b9d59-vbp5k   15m          657Mi
apollo-router-797f5b9d59-w7952   12m          677Mi
apollo-router-797f5b9d59-wmzxs   16m          706Mi
apollo-router-797f5b9d59-wv2nf   15m          877Mi
apollo-router-797f5b9d59-z6dns   17m          707Mi
apollo-router-797f5b9d59-zxkgf   19m          604Mi

Memory (2)

You can see that we are at > 20GiB of total used memory and climbing. We'll be in the pods being OOMKilled region tomorrow and I'll let this go until we see the first signs of this and then we'll have to restart. As a temp precaution, we are considering a cron-based restart something like a 6 hours cadence while we work with you on finding the issue.

Is there anything we can do to dump memory usage by an object or anything like that? I'm assuming that it would be pretty evident what is consuming all the memory if this were possible. We do have the ability to run canary pods with slightly different configurations should that be helpful for troubleshooting.

marc-barry commented 1 year ago

I found https://medium.com/lumen-engineering-blog/tutorial-profiling-cpu-and-ram-usage-of-rust-micro-services-running-on-kubernetes-fbc32714da93 which is a great article. I was hoping that there was a Docker image built with debug symbols but I didn't find one. While I know it would be easy to build one it would be awesome if there was one readily available so we could run CPU and memory profiles easily.

garypen commented 1 year ago

I found https://medium.com/lumen-engineering-blog/tutorial-profiling-cpu-and-ram-usage-of-rust-micro-services-running-on-kubernetes-fbc32714da93 which is a great article. I was hoping that there was a Docker image built with debug symbols but I didn't find one. While I know it would be easy to build one it would be awesome if there was one readily available so we could run CPU and memory profiles easily.

Is any of this useful: https://github.com/apollographql/router/blob/dev/CHANGELOG.md#router-debug-docker-images-now-run-under-the-control-of-heaptrack-issue-2135 ?

We don't have debug symbols enabled, but the information is still useful. I did mention in that PR that "at some point" we could maybe consider enabling debug symbols for our debug builds. Maybe now is the right time?

marc-barry commented 1 year ago

We don't have debug symbols enabled, but the information is still useful. I did mention in that PR that "at some point" we could maybe consider enabling debug symbols for our debug builds. Maybe now is the right time?

Debug symbols only increase the size of the build and don't affect the performance of the profile. And there is value in the information that they provide. So I think 👍 to enable them for debug builds. If you wanted to have debug builds with and without symbols you could have images of the form ghcr.io/apollographql/router:<image version>-debug and ghcr.io/apollographql/router:<image version>-debug-symbols.

Geal commented 1 year ago

an update on this: I've tracked it down to an issue with the runtime we use for query planning, we're looking at ways to fix it now

marc-barry commented 1 year ago

Any theories on how to fix this yet? We are basically cron restarting Router at this time.

Geal commented 1 year ago

The fix is coming: https://github.com/apollographql/federation-rs/pull/259

The problem comes from creating a new runtime for every schema, which leaks memory. This fix reuses the same runtime, but we had to make sure it supported planning queries for two schemas at the same time (during the update)

marc-barry commented 1 year ago

https://github.com/apollographql/federation-rs/pull/259 merged. I'm going now watch to see when this makes it into a Router release so we can update.

Geal commented 1 year ago

@marc-barry @Meemaw could you try with https://github.com/apollographql/router/pull/2706 ? The new router-bridge verson was not enough, it needded some changes on the router side

marc-barry commented 1 year ago

@marc-barry @Meemaw could you try with #2706 ? The new router-bridge verson was not enough, it needded some changes on the router side

@Geal we deploy this with your Helm Chart. How would we test this specific PR using the Helm setup? I assume an image gets built and placed somewhere that we are able to pick up through values configuration.

Geal commented 1 year ago

@garypen was there a way to use the helm chart with a specific branch?

garypen commented 1 year ago

@garypen was there a way to use the helm chart with a specific branch?

There isn't. It's against a docker image. I think if you cut a release from a branch, if that makes a docker image, then that would work.

marc-barry commented 1 year ago

@marc-barry @Meemaw could you try with #2706 ? The new router-bridge verson was not enough, it needded some changes on the router side

Is it possible to simply get #2706 merged into the mainline? Or is it that you wanted us to test it first before doing so? As the changes look generally useful and beneficial for all users.

abernix commented 1 year ago

@marc-barry We were looking for help testing it. The changes are definitely both generally useful and beneficial for all users, but the changes themselves are a bit more than trivial so we were looking for as much validation as we could get to make sure it was tangible.

That said, we're now at the point where we're comfortable putting this into a release, based on part on testing that was afforded to us by folks who are able to run from the branch. 😄

We're looking to get this into a release today or tomorrow.

marc-barry commented 1 year ago

We are running https://github.com/apollographql/router/releases/tag/v1.13.1 with a reasonably high load now. I will report back in a day or two on the results.

Meemaw commented 1 year ago

@marc-barry any update?

marc-barry commented 1 year ago

OK. We needed to collect a few days' worths of data and also during those few days we had to redeploy Router as we made some changes to its configuration. So far the Router has been up for multiple days and we see now evidence of the ever-increasing memory issue we saw before. Even better, we see memory reduce a little at some points during low-volume periods.

Thanks for addressing this issue and I'll report back if we see any regression as we have alerts setup on memory usage and growth rates.

marc-barry commented 1 year ago

We managed to run an entire week without an OOM from Apollo Router. But if you look at the following you will see that there is still memory growth. For our setup, it seems that about a week is what it takes to hit our defined limits. I just deployed https://github.com/apollographql/router/releases/tag/v1.14.0 and will watch that one. Is there a ticket or plan to tackle the remaining memory growth issues that I can make reports towards?

marc-barry commented 1 year ago

I wonder if https://github.com/apollographql/router/pull/2882 will have any impact.

Geal commented 1 year ago

It should. The router can get high memory fragmentation, and jemalloc is good at dealing with it. We've seen memory usage drop and OOM disappear in some deployments. You should test it from this branch, which also brings good performance gains https://github.com/apollographql/router/pull/2995

apollographql / router

Suspect a memory leak in Apollo Router #2690