argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.32k stars 5.26k forks source link

The timeout of git fetch in repo server correlates with unbounded growth of ephemeral storage use, up to tens of Gi #18831

Open andrii-korotkov-verkada opened 2 months ago

andrii-korotkov-verkada commented 2 months ago

Checklist:

Describe the bug

At some point there were quite a lot of logs like

Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unknown desc = failed to initialize repository resources: rpc error: code = Internal desc = Failed to fetch default: `git fetch origin --tags --force --prune` failed timeout after 1m30s

The disk usage of repo server was growing unbounded, and even with a large ephemeral storage request and limit the pods would get evicted rather quickly. After increasing the exec timeout to 2m30s the timeouts were gone and ephemeral storage use was stable at ~2.3Gi instead of 50Gi+. I'm pretty sure that's correlated since that's the only related change to repo server I was making at the moment. Looks like partially loaded data is not getting cleaned up if there's a timeout.

To Reproduce

Get a large enough repo of multiple Gi with many updates to trigger exec timeouts on a repo server.

Expected behavior

The ephemeral storage usage is limited.

Screenshots

Screenshot 2024-06-26 at 9 37 12 AM

Version

Custom build from master + https://github.com/argoproj/argo-cd/pull/18694 around the time of v2.12.0-rc1 release.

Logs

Failed to load target state: failed to generate manifest for source 1 of 1: rpc error: code = Unknown desc = failed to initialize repository resources: rpc error: code = Internal desc = Failed to fetch default: `git fetch origin --tags --force --prune` failed timeout after 1m30s
todaywasawesome commented 2 months ago

Can you try exec-ing into the container to look at the files and figure out which files are getting stuck here? OOM is ok but no great. Maybe we can add cleanup of specific files to prevent the leak in the first place. via @crenshaw-dev

andrii-korotkov-verkada commented 2 months ago

OOM is ok but no great

nit: OODisk

I'm trying to figure out the repro, probably would need to reduce the exec timeout during off-business hours and exec into live pod, since can't do that with pod that already errors.

christianh814 commented 2 months ago

@andrii-korotkov-verkada you might be able to exec into a failed pod with kubectl debug