argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.33k stars 5.26k forks source link

Excessive disk usage by argocd-repo-server up to Pod Eviction #9665

Closed RandGenXYZ closed 2 years ago

RandGenXYZ commented 2 years ago

Checklist:

Describe the bug

After updating to 2.4.0, argo-cd-repo-server is now spiking disk-usage in /var/lib/kubelet until no ephemeral storage is left, leading to its eviction.

Previously, disk usage was pretty much unaffected by the repo-server. It is now clearly chain-pulling Helm Charts locally until the ephemeral storage allocated to it by the kubelet is no longer sufficient to contain the data. The Pod is then evicted and rescheduled on another node, where the same behaviour is shown.

I have updated 2 days ago from version v2.3.0-rc5, and since then, the repo-server has not ceased crashing on the nodes due to DiskPressure it is causing to itself.

It looks like the repo-server is stuck pulling the "metrics-server" Helm Chart in the same version, multiple time a second, into the /tmp of the container. This leads to extremely rapid clogging on the /var/lib/kubelet, the pod crashing in less than 2mn to fill a partition of about 5GB.

To Reproduce

Expected behavior

I'd expect the same behavior as in the previous version: disk usage is pretty much unaffected by the repo-server activity.

Screenshots

This shows the activity on a node where the repo-server is scheduled:

This shows the disk usage on the ephemary disk partition (/var/lib/kubelet) of a node where the repo-server is scheduled:

The spikes are moments when the Pod is scheduled on the node, right before it is evicted and crashes, leading to a sharp decline in disk usage due to kubelet cleaning up the mess. The period of calmness before the first spikes is when ArgoCD was still in v2.3.0, spikes start the day of the 2.4.0 update

Evicted pods:

Version

argocd: v2.4.0+91aefab
  BuildDate: 2022-06-10T17:44:14Z
  GitCommit: 91aefabc5b213a258ddcfe04b8e69bb4a2dd2566
  GitTreeState: clean
  GoVersion: go1.18.3
  Compiler: gc
  Platform: linux/amd64

Logs

Paste any relevant application logs here.
RandGenXYZ commented 2 years ago

The problem seems to stem from the fact that the helm chart in question is not available anymore on the remote helm repository.

RandGenXYZ commented 2 years ago

Downgrading to the previous version did not seem to change anything to the problem, so it may actually be an issue with previous versions too. I never had this behaviour before on charts that were not fetchable anymore.

crenshaw-dev commented 2 years ago

Thanks for the report @RandGenXYZ. I think this is a duplicate of https://github.com/argoproj/argo-cd/issues/8773. lmk if the issues are distinct, and I'll reopen.