GoogleCloudPlatform / gcs-fuse-csi-driver

The Google Cloud Storage FUSE Container Storage Interface (CSI) Plugin.
Apache License 2.0
121 stars 30 forks source link

gcsfusecsi-metrics-collector container getting OOM killed #373

Open pdfrod opened 1 week ago

pdfrod commented 1 week ago

I'm experiencing occasional OOM kills of the gcsfusecsi-metrics-collector container (part of gcsfusecsi-node DaemonSet). This container has a somewhat low memory limit (30Mi). Is there a way to costumize the memory limit of this container?

hime commented 1 week ago

Hi @pdfrod, this is interesting behavior. Can you provide GKE cluster version? Can you provide the number of pods that you are running on each node? Can you also confirm if this is causing issues in your workload? In this case, I can provide the steps to disable metrics exporting.

Could you share the Cluster ID with me? You can get the id by running `gcloud container clusters describe --location

| grep id:`
pdfrod commented 1 week ago

Sure, here's the info you requested @hime.

I should probably mention that I don't remember seeing this problem when there were just a couple of deployments using this driver. Now that I have 12 deployments using the driver, I'm seeing OOM kills of the metrics collector every day.

If there's a way to disable the metrics collector container, that would be even better as currently I'm not using those metrics.

Let me know if you need more info.

hime commented 1 week ago

Thank you @pdfrod. could you add the following volumeAttribute to your spec? See details here

volumeAttributes:
  ...
  disableMetrics: "true"

Please let me know if that stops the OOMs. We are working on fixing this issue.

pdfrod commented 1 week ago

Cool, I'll give that a try. Thanks!

hime commented 1 week ago

Hi @pdfrod, Thanks for reporting this issue! I have created and merged #375 to disable metrics exporting by default. We're going to have to research a good way to scale this solution for customers running many workloads on the same VM.

pdfrod commented 1 week ago

Cool, thanks a lot!

Since I've disabled metrics on my cluster I haven't seen any OOM kills, so it's looking good so far.