Open zonnie opened 3 years ago
@zonnie only fix I've found "until 2.1.7 hopefullyn solves it with Helm 3.7.1" - is to simply KILL all argocd repo pods - and that will flush the cache which causes the problem. The same way I have to kill all application-controller pods when sync hangs forever (happens with kube-prometheus chart f.ex.)
Doesn't work for me...it gets to the "crapped out" state pretty quickly - only @gzur solution mitigated the issue
The below example, AFAIK, is
helm
v3 correct ?apiVersion: v2
is forhelm
3 whileapiVersion: v1
is forhelm
2 - correct ?
Correct.
I must admit that I am a bit surprised that https://github.com/helm/helm/pull/9889 did not resolve your issue. There was a possible race-condition in the previous code-path, which that fix has probably not addressed.
The reason I ask is that at my previous job - where we were experiencing this issue - we were downloading a metric ton of subcharts hosted by a Helm Repo that had connectivity issues - which was causing helm dep update
to take a long time.
We suspected that this was what exposed the aforementioned race condition.
The below example, AFAIK, is
helm
v3 correct ?apiVersion: v2
is forhelm
3 whileapiVersion: v1
is forhelm
2 - correct ?Correct.
I must admit that I am a bit surprised that https://github.com/helm/helm/pull/9889 did not resolve your issue. There was a possible race-condition in the previous code-path, which that fix has probably not addressed.
How many charts are you running?
And do these charts have any subcharts?
How is the connectivity to the Helm Repository hosting these charts?
The reason I ask is that at my previous job - where we were experiencing this issue - we were downloading a metric ton of subcharts hosted by a Helm Repo that had connectivity issues - which was causing
helm dep update
to take a long time.We suspected that this was what exposed the aforementioned race condition.
So
Thanks so much for your attention @gzur
Thanks so much for your attention @gzur
Yeah, I don't understand why I'm so inordinately invested in this issue 😂
One thing caught my eye though, @zonnie, you wrote:
[...] Still getting
rpc error: code = Unknown desc = Manifest generation error (cached): `helm dependency build` failed exit status 1: Error: unable to move current charts to tmp dir: link error: cannot rename charts to tmpcharts: rename charts tmpcharts: file exists
Sad 😢
That error string: unable to move current charts to tmp dir
got removed in the commit that addressed https://github.com/helm/helm/pull/9889
So what version of ArgoCD are you ACTUALLY running?
EDIT: Oh wait, nevermind, I just realized that the helm version bump PR has NOT been released (as stated by @KlavsKlavsen above)
So I guess it's just best to wait for the new version.
I have the exact same issue - with argocd 2.2.0.rc1 I have tried every solutions mentioned above, they does not work at all.
after downgrade to v2.1.7 yesterday, it seems like that it can syced quickly now but error message like "helm repo add charts.helm.sh https://charts.helm.sh/stable` failed" appears once after downgrade.
all other error messages likes 'context deadline exceeded' or 'helm dependency build" failure' gone.
@foxracle I have to say that @gzur solution worked for me. I have quite a lot of Argo Applications (hundreds)...it used to be unusable, now I have no issues
@zonnie I have tried these, I also try to flush all redis cache, but nothing happened. I do not think it is a problem of load or resources lack of argo-repo-server or something else.
argocd:
controller:
extraArgs:
- --repo-server-timeout-seconds
- "500"
repoServer:
env:
- name: "ARGOCD_EXEC_TIMEOUT"
value: "5m"
In fact, I have installed two argocd services in two different k8s cluster, one is v1.7.8 and manages two k8s cluster, one is v2.2.0.rc1 and manages one k8s cluster. they are all use the same git repo with 50+ apps to sync. the v1.7.8 is ok with no args tuning, but v2.2.0.rc1 is not. after I tried every solutions I googled, I give up by downgrading to v2.1.7 stable version
Seeing as this issue is "resolved" temporarily - by killing repo-server pods, so they get recreated - its clearly a caching problem.
repo-server and redis
pods ...
After this week security upgrade ArgoCD to v2.2.5 (from v2.2.2 in my case) repoServer started to give this "helm dependency build" failure. Even v2.2.2 already includes helm 3.7 I've increased the timeouts and added a 2nd, 3rd and 4th replica but the repoServer starts to eat all cpu. When rolling back to v2.2.2 this issue disappears. It works for a while, even forcing manually sync 220 apps, but after some hours start filing with the "help dependency build" failure.
After this week security upgrade ArgoCD to v2.2.5 (from v2.2.2 in my case) repoServer started to give this "helm dependency build" failure. Even v2.2.2 already includes helm 3.7 I've increased the timeouts and added a 2nd, 3rd and 4th replica but the repoServer starts to eat all cpu. When rolling back to v2.2.2 this issue disappears. It works for a while, even forcing manually sync 220 apps, but after some hours start filing with the "help dependency build" failure.
I suppose you are using helm chart to deploy the ArgoCD, the problem is related to the chart itself. Try out the latest argo version with the chart 3.29.5
Yes, I'm using helm chart to deploy ArgoCD. Using last chart I saw 3.33.5 (5 Feb 2022) Using this chart version and rolling back to v2.2.2 also fix the issue. The previous chart version I was using was 3.29.5
Yes, I'm using helm chart to deploy ArgoCD. Using last chart I saw 3.33.5 (5 Feb 2022) Using this chart version and rolling back to v2.2.2 also fix the issue. The previous chart version I was using was 3.29.5
Just try out the 3.29.5 and specify image tag 2.2.5, it will work fine. We were trying to identify the problem with chart causing that and we have some clues, but no confirmation so far
Yes, I'm using helm chart to deploy ArgoCD. Using last chart I saw 3.33.5 (5 Feb 2022) Using this chart version and rolling back to v2.2.2 also fix the issue. The previous chart version I was using was 3.29.5
Just try out the 3.29.5 and specify image tag 2.2.5, it will work fine. We were trying to identify the problem with chart causing that and we have some clues, but no confirmation so far
Thank you very much. I will test 2.2.5 + chart 3.29.5.
For what I just saw the main differences are Poddisruptionbudget
and initcontainer copyutil
for copy argocd binary, but they seems to be disabled by default.
We were trying to identify the problem with chart causing that and we have some clues, but no confirmation so far
For what I saw doing a helmdiff seems some env ares are added, and and a new volume. But when I was using chart 3.33.5 and argo v2.2.5 the files were still being generated at /tmp
so maybe the repoServer trying to delete the new /helm-working-dir
but files never there.
env:
- name: HELM_CACHE_HOME
value: /helm-working-dir
- name: HELM_CONFIG_HOME
value: /helm-working-dir
- name: HELM_DATA_HOME
value: /helm-working-dir
volumes:
- name: helm-working-dir
emptyDir: {}
We're having similar issues.
One thing I would like to add (which is why I'm commenting) is that I believe an application should never get into a permanently broken state just because of a temporary issue like this.
i.e. if I click 'hard refresh' on the applications that are broken like this, they become fixed. I don't want to have to do that manually, it should just sort itself out.
Hi, Long thread here. Is there a TLDR - can I tell ArgoCD to ignore helm chart dependencies so that the sync will apply changes to other components?
I am also facing the same issue with the below error on argoCD:
ComparisonError: rpc error: code = Unknown desc = Manifest generation error (cached): open /tmp/https___github.com_atlanhq_cloud-common/platform/k8/ui/Chart.yaml: no such file or directory
This sync error was not present earlier but when I added the Chart.yaml it worked!
these are two terrible problems after working with argocd two years from version 1.7.x ~ 2.3.x, especially in an urgent deploy case.
helm dependency build
failed timeout after 1m30smost time, I know it is not a performance issue after I followed everything from the official doc[high availability], it is just a cache problem. during these 2 years, there is still only a workaround to fix it:recreate argocd-repo-server and flush all data in redis cluster. thanks God!
Quick note on this one:
ComparisonError: rpc error: code = DeadlineExceeded desc = context deadline exceeded
That's a generic error message from the golang context
package, and it just means "a timeout happened somewhere."
We've made efforts recently to always wrap all error messages to provide more context. Hopefully in future versions the reason for the timeout will be much more clear.
hello, did anyone find a solution? we are experiencing this on argocd version 2.5.6
+1 here, still happens with v2.6.1
constant "helm dependency build" failure
errors
cannot see any pressure with repo server
I have the same , It mean should delete Chart.lock file of git repo?
rpc error: code = Unknown desc = Manifest generation error (cached): `helm dependency build` failed exit status 1: Error: the lock file (Chart.lock) is out of sync with the dependencies file (Chart.yaml). Please update the dependencies
Had same issue tho it was completely unrelated to Helm dependencies
Actual issue was CronJob template which had typo in Kind
Checklist:
argocd version
.Describe the bug
Some
Application
s based onhelm
fail to deploy due to somekind of internal filesystem issue.For example, one of the apps that are in
Unknown
statesThis doesn't eventually resolve itself, it's stays this way...
To Reproduce
I'm not sure how to reproduce, this happens from time to time and causes complete deadlock My
Chart.yaml
My app-of-apps
My
template
Expected behavior
The
Application
should be deployed successfullyScreenshots
Version
Logs
Logs from the
argocd-application-controller
Logs from
argocd-repo-server