argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.73k stars 5.4k forks source link

Helm implementation fails badly with just >10 Helm using applications #5933

Open KlavsKlavsen opened 3 years ago

KlavsKlavsen commented 3 years ago

Describe the bug With 20+ Helm using applications from 10+ different helm repos (which is the reality after charts.helm.sh went away), ArgoCd often hangs in sync state unknown or with a repo add error - and all applications go to that state - rendering argocd unusable.

ONLY fix currently is to scale up reposerver (I need 4 replicas for my 25 applications - a few of them being larger.. like kube-prometheus chart) and make sure all repoServer pods are deleted an recreated. Then I can choose "Hard refresh" on one application at a time - and it'll resolve.

Problem seems to be Helm actions being done simultaneously - and certain actions screws Helm cache - so the pod needs removing (so helm cache is cleared).

To Reproduce See issue #3451

Expected behavior The fix for this would seem to be something like:

1) ensure "Helm result cache" is NOT long lived.. "positive cache" may live as long as git repo has not changed, but for sync - negative results should NOT live for more than a few minutes at max as they seem to hit some Helm issue which then needs manually hard refresh on every application.

2) Helm appearently messes up its cache on times.. so on negative Helm results - you should probably do this inside the pod: rm -rf ~/.helm/cache/archive/ rm -rf ~/.helm/repository/cache/ helm repo update

This is what I "get" by kill'ing reposerver pods - hence fixing the issue.

3) Helm runs "repo add" way too often.. on every sync run? I've seen it fail a lot on "helm repo add stable https://charts.helm.sh/stable` failed signal: killed" which really should be cached FOREVER in each reposerver pod.. so it should NEVER need to add repo again.. This should be fixed, so it does not add repo again and again.

Using ArgoCD 1.8.4.

KlavsKlavsen commented 3 years ago

I just upgraded to 2.0 - and it has the exact same issues :(

KlavsKlavsen commented 3 years ago

Currently I get the "Error" and this text: "rpc error: code = Unknown desc = helm dependency build failed signal: killed" - as as described in #3451

KlavsKlavsen commented 3 years ago

Killing all 4 repoServer pods and doing a "hard refresh" resolves the issue (for now).

KlavsKlavsen commented 3 years ago

I now have to do this several times a day.. running 2.0. It has error "rpc error: code = Unknown desc = helm repo add bitnami https://charts.bitnami.com/bitnami failed signal: killed" or "rpc error: code = Unknown desc = helm dependency build failed signal: killed" EVERY time - after I enabled ARGOCD_HELM_INDEX_CACHE_DURATION to 300s.. so that didn't help :(

KlavsKlavsen commented 3 years ago

We've found in node kernel logs that it quite often kills Helm processes (our memory limit was set to 384Mi).. We've now increased it to 1Gi to see if that helps also. We're using 20 different helm repos.. and we're moving this to our own cache of helm repos - so we'll have ONLY one repo - with ONLY the charts we need.. that should also help on memory usage I suspect.

agaudreault commented 2 years ago

I get intermittent errors with the repo-server, I have 5 different pods using ~80Mi and the limit is 512Mi. Although the request is 128Mi.

rpc error: code = Unknown desc = `helm dependency build` failed signal: killed

@KlavsKlavsen I was wondering if and how you fixed it

I am using version 2.2.3

KlavsKlavsen commented 2 years ago

I have not seen the problem after setting limit to 1gb and also we switched to OCI

Gabryel8818 commented 1 year ago

Any updates?

sksaranraj commented 1 year ago

Any updates?

ahoka commented 3 weeks ago

For what it's worth increasing the memory limit for the repo server seems to fix this. Some better diagnostic about this would be an improvement.