argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.78k stars 5.43k forks source link

Getting random "error getting cached app state: cache: key is missing" #9622

Open ghost opened 2 years ago

ghost commented 2 years ago

Describe the bug

I have a Gitlab pipeline running every night that ist using argocd to delete and recreate several resources (to reset an automatic test environment).

That pipeline fails in about 30-50% because of random resources giving

level=fatal msg="rpc error: code = Unknown desc = error getting cached app state: cache: key is missing"

To Reproduce

What I do is

# pause project so resources get actually deleted and won't be  re-synced immediately
argocd proj windows add testapps --applications '*' --duration 24h --kind deny --schedule "0 0 * * *"

# delete services 
argocd app delete-resource service1 --kind "Deployment" --all
argocd app delete-resource service2 --kind "Deployment" --all
argocd app delete-resource service3 --kind "Deployment" --all

# do non argo-related stuff here, setup test database etc

# re-enable argocd syncing
argocd proj windows delete testapps 0

# refresh parent of service1-3 (application of applications), just in case
argocd app get --refresh testapps-projects >/dev/null
argocd app wait testapps-projects --sync --health --timeout 120

# refresh and wait for service1-3 to be ready

argocd app get --refresh service1 >/dev/null
argocd app get --refresh service2 >/dev/null
argocd app get --refresh service3 >/dev/null

argocd app wait service1 --sync --health --timeout 120
argocd app wait service2 --sync --health --timeout 120
argocd app wait service3 --sync --health --timeout 120

Sometimes one of the "app wait", sometimes even one of the "app delete-resource" at the beginning fails with "cache: key is missing". It is complete random.

I even added an argocd app list -p testapps right before the app delete-resource, and it is showing all the services 1-3 as Synced and Healthy, yet it fails when deleting that resource.

Things I ruled out:

Expected behavior

apps listed as Synced and Healthy should not fail when you try to manually delete or refresh them a few seconds later.

Version

argocd: v2.3.1+b65c169
  BuildDate: 2022-03-11T00:01:03Z
  GitCommit: b65c1699fa2a2daa031483a3890e6911eac69068
  GitTreeState: clean
  GoVersion: go1.17.6
  Compiler: gc
  Platform: linux/amd64
argocd-server: v2.3.4+ac8b7df
  BuildDate: 2022-05-18T11:41:37Z
  GitCommit: ac8b7df9467ffcc0920b826c62c4b603a7bfed24
  GitTreeState: clean
  GoVersion: go1.17.10
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: v4.4.1 2021-11-11T23:36:27Z
  Helm Version: v3.8.0+gd141386
  Kubectl Version: v0.23.1
  Jsonnet Version: v0.18.0```

Logs

Server log shows nothing special. just repeating the argocd client error.

migueleliasweb commented 2 years ago

My experience has been that the argo components might have come up faster than the redis container, causing a bunch of problems with the cache structure. Usually killing all argocd components but redis works for me.

That said, I'm stilll looking for a way to properly fix this as I have a simillar issue on provisioning ArgoCD in ephemeral environments.

ghost commented 2 years ago

I did put a argocd app get --refresh appname >/dev/null right before each and every argocd appcommand (like argocd app delete or argocd app wait). This seems to help.

What puzzles me is that a missing cache entry can cause trouble. To me a cache is something that can vanish anytime, and missing cache entries should only slow things down (because you have to recreate the cached data from slower sources) but never make things fail. But maybe the wording of this error message is just misleading.

migueleliasweb commented 2 years ago

Yeah, I agree with you. The cache in this case is more like a critical piece of infrastructure. It's odd it has been coded this way.

gerbal commented 1 year ago

We're seeing the same error throughout the ArgoCD GUI and CLI. A number of basic functionalities are broken by this error, including previewing change diffs and running argocd app manifests

dgrezza commented 1 year ago

I faced a similar issue on ArgoCD v2.5.4, I've tried to change the Redis and restart all argoCD-related services, but it does not help. Is there any fix or permanent solution for this?

lucasoarruda commented 1 year ago

same here using argocd core, UI and cli

crenshaw-dev commented 1 year ago

Yeah, I agree with you. The cache in this case is more like a critical piece of infrastructure. It's odd it has been coded this way.

I believe the intent has always been for everything to work even without Redis. But clearly something or some things were not coded according to that intent.

rumstead commented 1 year ago

We also occasionally see this during cluster/node upgrades.

Below issues feel related:

  1. 10554

  2. 12970

It also looks like the log line changed.

shay-ul commented 1 year ago

In my case, this issue was happening because not all ArgoCD components were running the same image version. Some components had the :latest tag instead of pinned ArgoCD Image version, and after fixing this everything came back to normal. Make sure all ArgoCD components run the exact same image version.

rumstead commented 1 year ago

In my case, this issue was happening because not all ArgoCD components were running the same image version. Some components had the :latest tag instead of pinned ArgoCD Image version, and after fixing this everything came back to normal. Make sure all ArgoCD components run the exact same image version.

I have a feeling by changing the image version and bouncing the pods, that fixed the issue.

PRNDA commented 7 months ago

In my case, this issue was happening because not all ArgoCD components were running the same image version. Some components had the :latest tag instead of pinned ArgoCD Image version, and after fixing this everything came back to normal. Make sure all ArgoCD components run the exact same image version.

It works! Thank you my hero!