argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.93k stars 5.46k forks source link

Failing to deploy services due to "helm dependency build" failure #5107

Open zonnie opened 3 years ago

zonnie commented 3 years ago

Checklist:

Describe the bug

Some Applications based on helm fail to deploy due to somekind of internal filesystem issue.

For example, one of the apps that are in Unknown states

Name:               redis-rule-hit-count-buffer-test-2
Project:            default
Server:             https://kubernetes.default.svc
Namespace:          test-2
URL:                https://35.232.222.64/applications/redis-rule-hit-count-buffer-test-2
Repo:               git@github.com:gc-org/gc-saas-prod.git
Target:             HEAD
Path:               prod/common/infra/redis
SyncWindow:         Sync Allowed
Sync Policy:        Automated (Prune)
Sync Status:        Unknown
Health Status:      Healthy

CONDITION        MESSAGE                                                                                                                                                                                                                                                  LAST TRANSITION
ComparisonError  rpc error: code = Unknown desc = Manifest generation error (cached): `helm dependency build` failed exit status 1: Error: unable to move current charts to tmp dir: link error: cannot rename charts to tmpcharts: rename charts tmpcharts: file exists  2020-12-22 10:14:56 +0200 IST

This doesn't eventually resolve itself, it's stays this way...

To Reproduce

I'm not sure how to reproduce, this happens from time to time and causes complete deadlock My Chart.yaml

name: redis
version: 0.1.0
apiVersion: v2
dependencies:
  - name: redis
    version: 11.2.2
    repository: https://charts.bitnami.com/bitnami

My app-of-apps

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: redis-servers-test-2
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
spec:
  project: default
  source:
    repoURL: git@github.com:gc-org/gc-saas-prod.git
    targetRevision: HEAD
    path: prod/cluster_1/customers/customer_2/redis
    helm:
      releaseName: redis-servers-test-2
  destination:
    server: https://kubernetes.default.svc
    namespace: test-2
  syncPolicy:
    automated:
      prune: true
      selfHeal: true
      allowEmpty: true

My template

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  name: redis-dashboard-top-widgets-cache-test-2
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io
  annotations:
    argocd.argoproj.io/sync-wave: "3"
spec:
  project: default
  source:
    repoURL: git@github.com:gc-org/gc-saas-prod.git
    targetRevision: HEAD
    path: prod/common/infra/redis
    helm:
      values: |
        redis:
          fullnameOverride: redis-dashboard-top-widgets-cache
          redisPort: 6413
          master:
            nodeSelector:
              cus: test-2
            service:redis-rule-hit-count-buffer-master
              port: 6413
          image:
            tag: 6.0.6
          cluster:
            enabled: false
          existingSecret: "redis"
          existingSecretPasswordKey: redis-password
      releaseName: redis-dashboard-top-widgets-cache
  destination:
    server: https://kubernetes.default.svc
    namespace: test-2
  syncPolicy:
    syncOptions:
    - CreateNamespace=true
    retry:
      limit: -1 # unlimited
      backoff:
        duration: 5s
        factor: 2
        maxDuration: 5m
    automated:
      prune: true
      selfHeal: true
      allowEmpty: true

Expected behavior

The Application should be deployed successfully

Screenshots

Screen Shot 2020-12-22 at 10 37 37

Version

argocd: v1.7.10+bcb05b0.dirty
  BuildDate: 2020-11-21T00:34:29Z
  GitCommit: bcb05b0c2e0f8006aa2d2abaf780e73c9e73c945
  GitTreeState: dirty
  GoVersion: go1.15.5
  Compiler: gc
  Platform: darwin/amd64
argocd-server: v1.8.1+c2547dc
  BuildDate: 2020-12-10T02:59:21Z
  GitCommit: c2547dca95437fdbb4d1e984b0592e6b9110d37f
  GitTreeState: clean
  GoVersion: go1.14.12
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: v3.8.1 2020-07-16T00:58:46Z
  Helm Version: v3.4.1+gc4e7485
  Kubectl Version: v1.17.8

Logs

Logs from the argocd-application-controller

time="2020-12-22T08:14:58Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2020-12-22T08:14:56Z\",\"message\":\"rpc error: code = Unknown desc = Manifest generation error (cached): `helm dependency build` failed exit status 1: Error: unable to move current charts to tmp dir: link error: cannot rename charts to tmpcharts: rename charts tmpcharts: file exists\",\"type\":\"ComparisonError\"}]}}" application=redis-rule-hit-count-buffer-test-2

Logs from argocd-repo-server

time="2020-12-22T08:40:44Z" level=error msg="finished unary call with code Unknown" error="Manifest generation error (cached): `helm dependency build` failed exit status 1: Error: unable to move current charts to tmp dir: link error: cannot rename charts to tmpcharts: rename charts tmpcharts: file exists" grpc.code=Unknown grpc.method=GenerateManifest grpc.request.deadline="2020-12-22T08:41:43Z" grpc.service=repository.RepoServerService grpc.start_time="2020-12-22T08:40:44Z" grpc.time_ms=474.433 span.kind=server system=grpc
time="2020-12-22T08:40:44Z" level=info msg="manifest cache hit: &ApplicationSource{RepoURL:git@github.com:gc-org/gc-saas-prod.git,Path:prod/common/infra/redis,TargetRevision:HEAD,Helm:&ApplicationSourceHelm{ValueFiles:[],Parameters:[]HelmParameter{},ReleaseName:,Values:redis:\n  redisPort: 6380\n  master:\n    service:\n      port: 6380\n  image:\n    tag: 6.0.6\n  nodeSelector:\n    cus: test-1\n  cluster:\n    enabled: false\n  existingSecret: \"redis\"\n  existingSecretPasswordKey: redis-password\n,FileParameters:[]HelmFileParameter{},Version:,},Kustomize:nil,Ksonnet:nil,Directory:nil,Plugin:nil,Chart:,}/cead2aa7818699b6c3ef04fc7e35390ee0fcbee0"
zonnie commented 3 years ago

@zonnie only fix I've found "until 2.1.7 hopefullyn solves it with Helm 3.7.1" - is to simply KILL all argocd repo pods - and that will flush the cache which causes the problem. The same way I have to kill all application-controller pods when sync hangs forever (happens with kube-prometheus chart f.ex.)

Doesn't work for me...it gets to the "crapped out" state pretty quickly - only @gzur solution mitigated the issue

gzur commented 3 years ago

The below example, AFAIK, is helm v3 correct ? apiVersion: v2 is for helm 3 while apiVersion: v1 is for helm 2 - correct ?

Correct.

I must admit that I am a bit surprised that https://github.com/helm/helm/pull/9889 did not resolve your issue. There was a possible race-condition in the previous code-path, which that fix has probably not addressed.

  1. How many charts are you running?
  2. And do these charts have any subcharts?
  3. How is the connectivity to the Helm Repository hosting these charts?

The reason I ask is that at my previous job - where we were experiencing this issue - we were downloading a metric ton of subcharts hosted by a Helm Repo that had connectivity issues - which was causing helm dep update to take a long time.

We suspected that this was what exposed the aforementioned race condition.

zonnie commented 3 years ago

The below example, AFAIK, is helm v3 correct ? apiVersion: v2 is for helm 3 while apiVersion: v1 is for helm 2 - correct ?

Correct.

I must admit that I am a bit surprised that https://github.com/helm/helm/pull/9889 did not resolve your issue. There was a possible race-condition in the previous code-path, which that fix has probably not addressed.

  1. How many charts are you running?

  2. And do these charts have any subcharts?

  3. How is the connectivity to the Helm Repository hosting these charts?

The reason I ask is that at my previous job - where we were experiencing this issue - we were downloading a metric ton of subcharts hosted by a Helm Repo that had connectivity issues - which was causing helm dep update to take a long time.

We suspected that this was what exposed the aforementioned race condition.

So

  1. When u say running you mean the entire ArgoCD application count? If so we have them in the 600+ area. If u mean the specific creation act - it's around 10 at that specific moment which we create.
  2. If by that u mean app-of-apps than we do - our "tree" has 3 top level applications, in 2 of those we have around 10-20 argo applications. The last one is a single argo application.
  3. Our charts are of 2 types - in house charts are stored in GitHub in our repo. The 3rd parties are mostly Bitnami charts.

Thanks so much for your attention @gzur

gzur commented 3 years ago

Thanks so much for your attention @gzur

Yeah, I don't understand why I'm so inordinately invested in this issue 😂

One thing caught my eye though, @zonnie, you wrote:

[...] Still getting

rpc error: code = Unknown desc = Manifest generation error (cached): `helm dependency build` failed exit status 1: Error: unable to move current charts to tmp dir: link error: cannot rename charts to tmpcharts: rename charts tmpcharts: file exists

Sad 😢

That error string: unable to move current charts to tmp dir got removed in the commit that addressed https://github.com/helm/helm/pull/9889

So what version of ArgoCD are you ACTUALLY running?

EDIT: Oh wait, nevermind, I just realized that the helm version bump PR has NOT been released (as stated by @KlavsKlavsen above)

So I guess it's just best to wait for the new version.

foxracle commented 2 years ago

I have the exact same issue - with argocd 2.2.0.rc1 I have tried every solutions mentioned above, they does not work at all.

after downgrade to v2.1.7 yesterday, it seems like that it can syced quickly now but error message like "helm repo add charts.helm.sh https://charts.helm.sh/stable` failed" appears once after downgrade.

all other error messages likes 'context deadline exceeded' or 'helm dependency build" failure' gone.

zonnie commented 2 years ago

@foxracle I have to say that @gzur solution worked for me. I have quite a lot of Argo Applications (hundreds)...it used to be unusable, now I have no issues

foxracle commented 2 years ago

@zonnie I have tried these, I also try to flush all redis cache, but nothing happened. I do not think it is a problem of load or resources lack of argo-repo-server or something else.

argocd:
  controller:
    extraArgs:
      - --repo-server-timeout-seconds
      - "500" 
  repoServer:
    env:
      - name: "ARGOCD_EXEC_TIMEOUT"
        value: "5m"

In fact, I have installed two argocd services in two different k8s cluster, one is v1.7.8 and manages two k8s cluster, one is v2.2.0.rc1 and manages one k8s cluster. they are all use the same git repo with 50+ apps to sync. the v1.7.8 is ok with no args tuning, but v2.2.0.rc1 is not. after I tried every solutions I googled, I give up by downgrading to v2.1.7 stable version

qtheya commented 2 years ago

Seeing as this issue is "resolved" temporarily - by killing repo-server pods, so they get recreated - its clearly a caching problem.

repo-server and redis pods ...

sdelrio commented 2 years ago

After this week security upgrade ArgoCD to v2.2.5 (from v2.2.2 in my case) repoServer started to give this "helm dependency build" failure. Even v2.2.2 already includes helm 3.7 I've increased the timeouts and added a 2nd, 3rd and 4th replica but the repoServer starts to eat all cpu. When rolling back to v2.2.2 this issue disappears. It works for a while, even forcing manually sync 220 apps, but after some hours start filing with the "help dependency build" failure.

begemotik commented 2 years ago

After this week security upgrade ArgoCD to v2.2.5 (from v2.2.2 in my case) repoServer started to give this "helm dependency build" failure. Even v2.2.2 already includes helm 3.7 I've increased the timeouts and added a 2nd, 3rd and 4th replica but the repoServer starts to eat all cpu. When rolling back to v2.2.2 this issue disappears. It works for a while, even forcing manually sync 220 apps, but after some hours start filing with the "help dependency build" failure.

I suppose you are using helm chart to deploy the ArgoCD, the problem is related to the chart itself. Try out the latest argo version with the chart 3.29.5

sdelrio commented 2 years ago

Yes, I'm using helm chart to deploy ArgoCD. Using last chart I saw 3.33.5 (5 Feb 2022) Using this chart version and rolling back to v2.2.2 also fix the issue. The previous chart version I was using was 3.29.5

begemotik commented 2 years ago

Yes, I'm using helm chart to deploy ArgoCD. Using last chart I saw 3.33.5 (5 Feb 2022) Using this chart version and rolling back to v2.2.2 also fix the issue. The previous chart version I was using was 3.29.5

Just try out the 3.29.5 and specify image tag 2.2.5, it will work fine. We were trying to identify the problem with chart causing that and we have some clues, but no confirmation so far

sdelrio commented 2 years ago

Yes, I'm using helm chart to deploy ArgoCD. Using last chart I saw 3.33.5 (5 Feb 2022) Using this chart version and rolling back to v2.2.2 also fix the issue. The previous chart version I was using was 3.29.5

Just try out the 3.29.5 and specify image tag 2.2.5, it will work fine. We were trying to identify the problem with chart causing that and we have some clues, but no confirmation so far

Thank you very much. I will test 2.2.5 + chart 3.29.5. For what I just saw the main differences are Poddisruptionbudget and initcontainer copyutil for copy argocd binary, but they seems to be disabled by default.

sdelrio commented 2 years ago

We were trying to identify the problem with chart causing that and we have some clues, but no confirmation so far

For what I saw doing a helmdiff seems some env ares are added, and and a new volume. But when I was using chart 3.33.5 and argo v2.2.5 the files were still being generated at /tmp so maybe the repoServer trying to delete the new /helm-working-dir but files never there.

         env:
         - name: HELM_CACHE_HOME
           value: /helm-working-dir
         - name: HELM_CONFIG_HOME
           value: /helm-working-dir
         - name: HELM_DATA_HOME
           value: /helm-working-dir      

        volumes:
        - name: helm-working-dir
          emptyDir: {}           
willemm commented 2 years ago

We're having similar issues.

One thing I would like to add (which is why I'm commenting) is that I believe an application should never get into a permanently broken state just because of a temporary issue like this.

i.e. if I click 'hard refresh' on the applications that are broken like this, they become fixed. I don't want to have to do that manually, it should just sort itself out.

tomikonio commented 2 years ago

Hi, Long thread here. Is there a TLDR - can I tell ArgoCD to ignore helm chart dependencies so that the sync will apply changes to other components?

decipher27 commented 2 years ago

I am also facing the same issue with the below error on argoCD:

ComparisonError: rpc error: code = Unknown desc = Manifest generation error (cached): open /tmp/https___github.com_atlanhq_cloud-common/platform/k8/ui/Chart.yaml: no such file or directory

decipher27 commented 2 years ago

This sync error was not present earlier but when I added the Chart.yaml it worked!

foxracle commented 2 years ago

these are two terrible problems after working with argocd two years from version 1.7.x ~ 2.3.x, especially in an urgent deploy case.

most time, I know it is not a performance issue after I followed everything from the official doc[high availability], it is just a cache problem. during these 2 years, there is still only a workaround to fix it:recreate argocd-repo-server and flush all data in redis cluster. thanks God!

crenshaw-dev commented 2 years ago

Quick note on this one:

ComparisonError: rpc error: code = DeadlineExceeded desc = context deadline exceeded

That's a generic error message from the golang context package, and it just means "a timeout happened somewhere."

We've made efforts recently to always wrap all error messages to provide more context. Hopefully in future versions the reason for the timeout will be much more clear.

chenshap commented 1 year ago

hello, did anyone find a solution? we are experiencing this on argocd version 2.5.6

ArieLevs commented 1 year ago

+1 here, still happens with v2.6.1 constant "helm dependency build" failure errors cannot see any pressure with repo server

zengzhengrong commented 1 year ago

I have the same , It mean should delete Chart.lock file of git repo?

rpc error: code = Unknown desc = Manifest generation error (cached): `helm dependency build` failed exit status 1: Error: the lock file (Chart.lock) is out of sync with the dependencies file (Chart.yaml). Please update the dependencies
slntopp commented 2 months ago

Had same issue tho it was completely unrelated to Helm dependencies Actual issue was CronJob template which had typo in Kind