1Password / connect

Access your 1Password secrets using a 1Password Connect Server
https://developer.1password.com/docs/connect
152 stars 29 forks source link

High memory usage/possible memory leak #23

Closed simoncolincap closed 1 week ago

simoncolincap commented 2 years ago

We use 1Password connect and operator with the connect helm chart across several Kubernetes clusters and have noticed very high memory usage (up to 32GB on a large cluster). In our monitoring we can see that the memory consumption goes up steadily even when no new 1Password items are created. Could this be a memory leak? 1Password connect version: 1.2.0 1Password operator version: 1.0.1 1Password connect helm chart version: 1.4.0 Kubernetes versions: between 1.18.5 and 1.19.10

jpcoenen commented 2 years ago

Hi @simoncolincap,

Thank you for reaching out 🙌 What you describe definitely sounds like it's worth investigating.

To reproduce this locally, some extra information would be really helpful:

If you could try to answer these questions as well as possible, that would help us a great deal in locating the issue 😃

Joris

simoncolincap commented 2 years ago

Hi, The memory consumption of both sync and api is increasing with the api container using around twice the memory of the sync container. On most of our clusters we have less than 10 secrets that come from 1Password and the memory increases steadily at around 10MB/day but it will sometimes jump up by tens of MBs at once. If we restart the connect server memory usage will be low again but will still increase with time. The only requests to connect are coming from the 1password operator, in the logs there's usually between one and two requests per second. The 32GB memory usage was on a larger cluster where there are many 1Password secrets (more than 1000, 1300 at the moment) that are being created and destroyed automatically (around 100 creations and destructions a day) but even then this seems like a lot of memory.

jpcoenen commented 2 years ago

Thank you for the extra details. That will definitely help us reproduce the issue.

I just noticed that you're not on the latest version of Connect and the operator. Could you try updating to the latest version of Connect (v1.5.0) so we can confirm whether the issue is still present?

simoncolincap commented 2 years ago

We upgraded the Connect Helm chart to 1.5.0 which upgraded the 1Password Operator to 1.1.0 and 1Password Connect to 1.5.0. The issue is still there but it looks like the memory consumption slowed down slightly, it's closer to 7MB per day now on the normal clusters. On our largest cluster the memory usage is still very high, it gains around 10GB a day.

jpcoenen commented 2 years ago

Hey! I wanted to let you know that we released v1.5.2 of Connect a while back, which addresses a memory issue we found.

Let me know if that fixes the problem you're seeing.

simoncolincap commented 2 years ago

Hi, we tried the new version of connect. Before the update the memory usage would end up between 6 and 7Gb in a day, after the update memory usage will either go up to 7Gb or around 3.8Gb. We haven't found a pattern or a reason why it's higher/lower on some days.

tomjohnburton commented 2 years ago

I also raised this #46

jpcoenen commented 2 years ago

Thanks for the updates. Quick update from our side: we recently found another potential cause.

While we're working on a fix, could you give an indication of how many requests per day these clusters are currently handling and by how much the memory consumption is increasing during that period? That could tell us whether our finding could explain the memory consumption you're both seeing.

tomjohnburton commented 2 years ago

We have around 14 secrets, polling at the default (every 15s)

Between 08/09/22 and 12/09/22 the:

Screenshot 2022-09-12 at 17 12 42
jpcoenen commented 2 years ago

Thank you for providing those extra details! 🙌

Although at this moment it is difficult to say if it is the only cause, I am convinced at least a share of it is caused by the issue we're investigating. We'll keep you posted on the progress.

tomjohnburton commented 1 year ago

Hey @jpcoenen, we had a 1password pod restart due to a 5Gb memory usage. Just wondering if you had an update?

jpcoenen commented 1 year ago

I'm sorry for the delay, @tomjohnburton. We are still trying to get the problem we identified fixed. Unfortunately, the root cause lies in a library Connect uses and getting the fix in there is taking longer than expected.

We are going to look into forking that library to bypass the external approval required.

tomjohnburton commented 1 year ago

Hi @jpcoenen , any update? Which library is holding us up?

Thanks

jpcoenen commented 1 year ago

Hey @tomjohnburton! Apologies for the late response; I've been out for a while.

Thanks for your patience so far. I am working on this right now to get the changes into a private fork of the package in question. I'll report back in this thread when that is ready to test.

jpcoenen commented 1 year ago

Hey all! We have just released v1.6.1 of Connect which includes a fix for 2 separate memory leaks:

Feel free to give it a try and let me know if it makes any difference!

TimCostrop commented 1 year ago

Since no one commented yet on the new release:

I applied the 1.6.1 connect-api and connect-sync via overriding the helm chart yesterday. So far, it looks like the connect-sync performs better memory-wise than before.

image

But, the connect-api still keeps rising its memory. Please note this is only data from running a single day, so I may be too fast in my findings, but if someone could confirm these that would be great.

I can add another graph within a week of running these versions.

Both connect pods are running on a 1G limit qua resources.

jpcoenen commented 1 year ago

Sorry for my later reply; I was out-of-office for a while.

Happy to see that we made some progress, but that line for connect-api is still way too steep. Apparently, trying to reproduce the issue locally is not the perfect solution here.

What would be really valuable, is some profiling data to see what is going on inside your Connect instance. If we added an option to expose Go's profiling data on a specific port (not the same one as the Connect API), would you be willing to download and share that with us? That file would not contain any memory contents itself, just a list of memory allocations.

It'd require updating Connect (the current version does not yet have support for this), setting a specific environment variable to enable profiling and then downloading that file after a day or so. @TimCostrop or anyone else in this thread, would you be up for that?

TimCostrop commented 1 year ago

If those are easily applied via my configuration file injected into the helm chart, or can easily be set active via a deployment change, I'm game.

But if my understanding is correct, this will only be the case with a future version of the connect-api?

jpcoenen commented 1 year ago

That's great, thanks!

We've internally been discussing (it was a lengthy discussion) the best method to expose this data without making accidentally making it publicly accessible. We've concluded that writing it to the container's filesystem would be an acceptable method. However, that would require some means of extracting the data and we realisz that is not always trivial. So would running something like kubectl cp <pod>:/path/to/profiling/data ./ be an option in your environment?

TimCostrop commented 1 year ago

I've copied items from a pod before, that's not really an issue for our environment. The only thing which seems different from other pods is the fact that the connect-api has a sidecar connect-sync, so I'm not 100% sure how the kubectl copy command will handle that.

antoineco commented 1 year ago

As of v1.7.0 this issue still exists. More than happy to provide profiling data.

In a very much idle cluster (4 secrets):

leak

jpcoenen commented 1 year ago

@antoineco Thanks for offering to help! We have recently introduced a profiler in Connect. However, the changes to the Helm chart were missing. I've just opened a PR to get that in.

antoineco commented 1 year ago

@jpcoenen works for me. Please reach out and let me know how you'd like that profiling data to be sent to you.

jpcoenen commented 1 year ago

We've just released the new version of the chart. To get the profiler set up, update the chart and then enable the profiler. Looking at the chart you shared, I think we can increase the interval to 12h.

helm repo update
helm upgrade connect 1password/connect --set connect.profiler.enabled=true --set connect.profiler.interval=12h 

That will make the internal profiler write a profiler snapshot once ever 12h. It will keep only the 12 most recent snapshots.

Then after having Connect run for a few days, you can run the following to extract the profiles:

kubectl cp -c profiler-data <pod>:/data profiler
tar czf profiler.tar.gz profiler

Could you share that with me via email (joris.coenen[email symbol]1password.com)? Thanks!

TimCostrop commented 1 year ago

Hi @jpcoenen I tried to apply the latest helm chart with the requested profiler settings, but chart version 1.12.0 with connect version 1.7.1 does not run on our cluster:

Warning Failed 5s (x3 over 7s) kubelet Error: container has runAsNonRoot and image will run as root (pod: "onepassword-connect-7f8d97b57b-b5wdv_general(ed611dcd-60a2-4416-bc96-04413a838ef9)", container: profiler-data)

The profiler sidecar wants to run as root, which is not allowed in our environment. I'm also not seeing a lot of configuration options in the profiler sidecar image definition to change this.

jpcoenen commented 1 year ago

@TimCostrop Thanks for looking into getting the profiler snapshots! I really appreciate how willing you all are to help us figure out the root cause of this issue. In the meantime, I have received some profiler snapshots from @antoineco, which do seem to confirm a suspicion I already had: Go does not report any memory leaks, even though Kubernetes reports steadily increased memory consumption. Since that aligns with what I am seeing locally, this one set of profiler data is enough for now. I'll let you know if that changes.

My current hypothesis is that we're not actually dealing with a leak in the traditional meaning, but that there is either an issue with how Kubernetes is reporting memory usage or how it is taking back memory from the Go runtime.

There are quite a few similar reports, especially when it comes to Kubernetes and Golang: https://github.com/golang/go/issues/44554 https://stackoverflow.com/questions/68289644/relationship-between-container-memory-working-set-bytes-and-process-resident-mem

I am currently trying to figure out what we can best do there. I am wondering whether it would work to apply some "memory pressure" to the container by limiting its maximum memory consumption. That might trigger container runtime to take back the memory the Go runtime has released.

jpcoenen commented 1 year ago

I am really starting to get convinced of the above hypothesis that this is caused by k8s not reclaiming the memory that Go/Connect has released. Adding the following simple resource limits complete removes the upwards trend of the memory consumption for me:

resources:
  limits:
    memory: 20Mi
  requests:
    memory: 15Mi

Without any memory limits:

image

With some memory limits in place:

image


That specific limit might be a bit too harsh, but it does seem to suggest that it is a part of the problem.

You should be able to specify some resource limits to Connect through our helm chart through the connect.api.resources and connect.sync.resources values. Feel free to give that a shot, and let me know if that changes anything for you. If you have already set limits, can you let me know if Connect goes over those (i.e. the pod gets killed) or stays around/below them?

In the meantime, I'll see if we can set some sane defaults for these values.

jpcoenen commented 1 year ago

I've done some further experimentation today. Using Chaos Mesh I've simulated a pod that consumes a lot of memory. The experiment suggest that Kubernetes will reclaim the memory that is apparently in-use by Connect the moment it needs it for another container.

Here is a graph of the memory consumption of my Connect pod:

CleanShot 2023-08-02 at 15 01 46

Around 14:10 I started the memory stress test which claimed a lot of memory in a different pod:

CleanShot 2023-08-02 at 15 05 56

This immediately led to a drop in apparent memory consumption for the Connect pod.

This suggests that the high numbers reported for the container_memory_working_set_bytes (blue) and especially container_memory_usage_bytes (yellow) are not really an issue. They will drop when needed.


Having the reported memory consumption grow over time is still not nice. So we need to put some good default memory limitations in place. However, I'd also like to hear if anyone in this thread ever encountered any problems with the high reported memory usage. For example, has the Connect pod ever been recycled for anyone because of high memory usage?

TimCostrop commented 1 year ago

AFAIK the connect pod hasn't ever recycled with us, but it has triggered various monitoring alerts due to memory pressure for our cluster when no resource limits were defined.

Our current setup limits the connect-api and connect-sync to 1G memory, which seems like a lot going through the thread again/looking at the given memory usage examples, but this can also mean it never really has a memory issue given the high amount given and not really needed.

jpcoenen commented 1 year ago

Thanks for the extra info, @TimCostrop.

AFAIK the connect pod hasn't ever recycled with us, but it has triggered various monitoring alerts due to memory pressure for our cluster when no resource limits were defined.

Okay, that's good to know. Is there maybe some flexibility in what kind of memory consumption you are monitoring? My experiments indicate that the container_memory_rss metric reported by k8s is actually a more realistic reprentation of Connect memory's consumption. It does not grow over time. Is there any chance that you can monitor that instead?

Our current setup limits the connect-api and connect-sync to 1G memory, which seems like a lot going through the thread again/looking at the given memory usage examples, but this can also mean it never really has a memory issue given the high amount given and not really needed.

In my experiments, Connect's RSS memory consumption never went beyond 50MiB, even when continuously bombarding it with read requests. So I think 128 MiB or 256 MiB should be plenty. You should be able to verify this by checking container_memory_rss of your Connect pod.

The only exception I can think of, is if you are using Connect to read very big files. Currently, Connect decrypts files in-memory. So if you want it to serve a file of 500 MiB, it will need at least 500 MiB of memory.

TimCostrop commented 1 year ago

I've added the container_memory_rss to our general metric dashboard, and monitoring that usage, it stays fairly steady around 20MB the past week.

2023-08-03_13-44

I will update our connect-api and connect-sync resource limits accordingly, since it doesn't really need the 1G we have assigned to it. We use the connect-api mainly to provide user/pass type credentials, not to serve any files.

jpcoenen commented 1 year ago

That's good to hear!

Digging a bit deeper, it seems that the increase mainly comes from cache pages, which are included in k8s default's memory consumption number (discussed in https://github.com/kubernetes/kubernetes/issues/43916). These are a result of the OS trying to optimize filesystem access, which we use to interact with the database. Memory used for this can indeed be recycled by the OS at any point.

If anyone else who wants to confirm that that is indeed what they're seeing, have a look at the container_memory_cache metric. For my Connect pod, that steadily increases at the same rate as container_memory_usage_bytes. If that is the case, the pod should not cause any out-of-memory issues.

If you are bothered by the high numbers, it is possible to instruct your OS to memory used for caching by (regularly) running the following on the k8s host:

sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
antoineco commented 1 year ago

As already confirmed per email, since updating from v1.7.0 to v1.7.1 the memory usage has become much more reasonable for us (no change to the running cluster). It seems to remain more or less stable below 20 MiB.

zoom-out, pre+post update:

image

zoomed-in, post update:

image

tomjohnburton commented 1 year ago

I'm extremely confident that it has something to do with the operator

image

You can see clearly when I scale down the operator to 0

jpcoenen commented 1 year ago

Hey @tomjohnburton. So far, the investigation has mostly focused on the Connect containers. But if you are still experiencing issues, it makes sense to dig into the operator as well.

The graph looks interesting for use. Would you mind sharing a slightly bigger version, including the legend and full axes? And could you share how you obtained it? That helps us reproducing the same measurement.

omairvaiyani commented 1 year ago

Without distracting too much from the conversation, I wanted to add an anecdote of our experience. At just 50-100 concurrent requests, our ConnectServer (on ECS Fargate vCPU=0.25, mem=512mb), our availability drops to near 0 as the CPU bottlenecks at 100%. We can bump this capacity to 2-300 by increasing the vCPU=1, mem=2gb, but the CPU still spikes to 100% until the requests drop. To note, we're just fetching 1-2 text items from a vault using a title filter. Nothing intense. I would not expect such hardware requirements?

tomjohnburton commented 1 year ago

@jpcoenen it turned out in my case that because I was using a forked version of the operator, it was spamming the connect server with requests. It would poll the 1password server 1000s of times rather than respecting the POLL_INTERVAL. I have since reverted to using the official docker images

I am now running the latest chart 0.13.0 on all my clusters. Half have no memory leak and half do.

I honestly can't see what the difference could be

I'm going to run another experiment and scale down the operator on a relatively unused cluster for a while. I'll report back with graphs and metrics to see if it made a difference

tomjohnburton commented 1 year ago

UPDATE @jpcoenen

Already just turning off the operator, I'm seeing a slight drop in memory (1mb) will track that over the next couple days. Mystery why the operator causes this one some of the clusters and not others

Screenshot 2023-09-06 at 16 51 03
AndyTitu commented 1 week ago

We have conducted more research and validated what's been discussed in this issue. Please reference this other comment for the full findings. Closing this issue for now as solved, but feel free to open another one if you still encounter similar issues.