argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.49k stars 5.32k forks source link

Argo UI shows old pods as healthy #9226

Open 0dragosh opened 2 years ago

0dragosh commented 2 years ago

Checklist:

Describe the bug

ArgoCD UI is showing old pods (that don’t exist anymore) from old replicasets as healthy in the UI. When you go to details on those pods it's empty, and delete errors out — because they haven’t existed in a while.

It also shows the new replicaset with the new pods correctly, in parallel with the old pods. I’ve tried a hard refresh to no avail, running ArgoCD HA.

To Reproduce

Deploy a new revision, creating a new replicaset.

Expected behavior

Old pods that don't exist in the Kubernetes API should not show up in the UI

Version

argocd: v2.3.3+07ac038.dirty
  BuildDate: 2022-03-30T05:14:36Z
  GitCommit: 07ac038a8f97a93b401e824550f0505400a8c84e
  GitTreeState: dirty
  GoVersion: go1.18
  Compiler: gc
  Platform: darwin/arm64
argocd-server: v2.3.3+07ac038
  BuildDate: 2022-03-30T00:06:18Z
  GitCommit: 07ac038a8f97a93b401e824550f0505400a8c84e
  GitTreeState: clean
  GoVersion: go1.17.6
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: v4.4.1 2021-11-11T23:36:27Z
  Helm Version: v3.8.0+gd141386
  Kubectl Version: v0.23.1
  Jsonnet Version: v0.18.0

I suspect it's got something to do with improper cache invalidation on the Redis side.

0dragosh commented 2 years ago
old_pod

Adding a screenshot to better showcase what I mean, maybe it's not clear.

crenshaw-dev commented 2 years ago

Looked at this today with @0dragosh. The bug only occurs on instances using Azure AKS's instance start/stop feature. We cleared the resource-tree key for this app in Redis, and the controller re-populated the resource tree with the phantom Pods. We confirmed that kubectl get po shows the Pods as no longer existing on the cluster.

So this looks like a bug in how the Argo CD controller populates the resource tree.

OmerKahani commented 2 years ago

We had the same use case in a very old GKE cluster (it was created a long time ago but is updated to 1.20). My thought was the ArgoCD uses a different API than kubectl get

@0dragosh can it be related to an old cluster?

crenshaw-dev commented 2 years ago

@alexmt did you have any luck reproducing this?

rs-cole commented 2 years ago

Running into this as well, each sync creates more and more. revisionHistoryLimit is set to 3 but I currently have 23 and counting sitting around.

crenshaw-dev commented 2 years ago

@rs-cole as a workaround, we found that a Force Refresh cleared the old pods.

0dragosh commented 2 years ago

@rs-cole as a workaround, we found that a Force Refresh cleared the old pods.

Plus a restart of the argocd application controller.

rs-cole commented 2 years ago

~Hitting Refresh - > Hard reset and restarting the application controller didn't seem to do the trick. Old non existant replicasets continued to grow, the applications RevisionHistoryLimit is set to 3. Perhaps I'm utilizing argocd incorrectly?~

Disregard, my reading comprehension is terrible. Fixed my issue with: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#clean-up-policy

ilyastoliar commented 2 years ago

Looked at this today with @0dragosh. The bug only occurs on instances using Azure AKS's instance start/stop feature. We cleared the resource-tree key for this app in Redis, and the controller re-populated the resource tree with the phantom Pods. We confirmed that kubectl get po shows the Pods as no longer existing on the cluster.

So this looks like a bug in how the Argo CD controller populates the resource tree.

We are experiencing it with EKS as well

crenshaw-dev commented 2 years ago

Just encountered it an Intuit for (afaik) the first time. A force refresh and controller restart cleared the error.

Argo CD version: v2.4.6 Kubernetes provider: EKS Kubernetes version: v1.21.12-eks-a64ea69

yoshigev commented 1 year ago

We encountered a similar issue with AKS.

We have a CronJob that creates Job + Pod resources. Those resources are automatically deleted due to either ttlSecondsAfterFinished, successfulJobsHistoryLimit or failedJobsHistoryLimit configuration of the CronJob.

Quite often, ArgoCD continues displaying those deleted resources for months after they were deleted, both for failed and for successful job executions.

Analysis of the API calls of ArgoCD (from the browser's log):

Regarding the Hard Refresh - is it possible to make it work without restarting the controller?

joseaio commented 1 year ago

Same problem here with start/stop AKS cluster.

There is a workaround without ArgoCD controlle restarts (via API or similar?) to remove old/ghost nodes?

0dragosh commented 1 year ago

@joseaio I could not figure out another workaround

crenshaw-dev commented 1 year ago

Again on 2.5.7.

h4tt3n commented 1 year ago

We are seeing the exact same behavior. ArgoCD shows ghost resources that have been deleted days or weeks ago, and the UI does not expose any means of removing them ( delete obviously doesn't work).

CPlommer commented 1 year ago

We're encountering this as well, specifically with cronjobs.

Argo CD version: v2.5.10 Kubernetes provider: GKE Kubernetes server version: v1.24.9-gke.3200

The workaround to stop the old pods from surfacing in ArgoCD is, as stated earlier in this thread:

  1. Force Refresh to clear the old pods
  2. Restart the argocd application controller

Sorry for the silly questions, but I'm coming up empty googling ....
Is Force Refresh the same as Hard Refresh, via the ArgoCD UI? How do you restart the argocd application controller? https://argo-cd.readthedocs.io/en/stable/operator-manual/server-commands/argocd-application-controller/ doesn't say....

mikesmitty commented 1 year ago

@crenshaw-dev we're seeing this on a daily basis and would prefer not to resort to cycling the app controller every day. Is there any info we can provide to help diagnose the cause?

crenshaw-dev commented 1 year ago

Is Force Refresh the same as Hard Refresh, via the ArgoCD UI?

@CPlommer yep, they're the same!

@mikesmitty honestly, I'm stumped. I mean I know that the app controller is filling Redis with "resource tree" values that include ghost resources. But I have no clue where to start figuring out why the app controller thinks they still exist.

When the app controller starts up, it launches a bunch of watches to keep track of what's happening on the destination cluster (like pods being deleted). So I think either the app controller either isn't correctly handling the updates, or k8s isn't correctly sending them. I tend to think it's the former.

And it's possible that the app controller is logging those failures, but I'm not sure what kind of messages to look for. I'd have to start with "any error or warn messages" and see if anything looks suspicious.

mikesmitty commented 1 year ago

Now that you mention it, our issue is caused by a few particularly large apps that somewhat routinely cause ArgoCD to get throttled by Anthos Connect Gateway when syncing the app to remote clusters. The connect gateway api is essentially a reverse proxy for the control plane that, for better or worse, uses control plane errors to throttle (e.g. "the server has received too many requests and has asked us to try again later" or "the server rejected our request for an unknown reason"). The quota on these clusters is pretty high so the throttling has largely been only a minor nuisance when devs sync a large number of apps at once. Looking for errors around watches in the app controller logs, I did find these messages however:

{"error":"the server rejected our request for an unknown reason", "level":"error", "msg":"Failed to start missing watch", "server":"https://connectgateway.googleapis.com/..."}
{"error":"error getting openapi resources: the server rejected our request for an unknown reason", "level":"error", "msg":"Failed to reload open api schema", "server":"https://connectgateway.googleapis.com/..."}
"Watch failed" err="the server rejected our request for an unknown reason"
sourceLocation: {
    file: "retrywatcher.go"
    line: "130"
}
{"error":"the server has received too many requests and has asked us to try again later", "level":"error", "msg":"Failed to start missing watch", "server":"https://connectgateway.googleapis.com/..."}
{"error":"error getting openapi resources: the server has received too many requests and has asked us to try again later", "level":"error", "msg":"Failed to reload open api schema", "server":"https://connectgateway.googleapis.com/..."}
"Watch failed" err="the server has received too many requests and has asked us to try again later"
sourceLocation: {
    file: "retrywatcher.go"
    line: "130"
}

If you need more verbose logs let me know. I'd prefer to not turn on debug logging on this ArgoCD instance due to volume, but I think I might be able to artificially induce the errors on a test cluster.

While doing some searching around I found this issue that seems like it could be tangentially related as well: https://github.com/argoproj/argo-cd/issues/9339

crenshaw-dev commented 1 year ago

Latest theory: we miss watch events, Argo CD goes on blissfully unaware that the resources have been deleted.

I still don't know why the 24hr full-resync doesn't clear the old resources. https://github.com/argoproj/gitops-engine/blob/ed70eac8b7bd6b2f276502398fdbccccab5d189a/pkg/cache/cluster.go#L712

crenshaw-dev commented 1 year ago

@ashutosh16 do you have a link to the branch where you modified the code to reproduce this?

nipun-groww commented 1 year ago

Now that you mention it, our issue is caused by a few particularly large apps that somewhat routinely cause ArgoCD to get throttled by Anthos Connect Gateway when syncing the app to remote clusters. The connect gateway api is essentially a reverse proxy for the control plane that, for better or worse, uses control plane errors to throttle (e.g. "the server has received too many requests and has asked us to try again later" or "the server rejected our request for an unknown reason"). The quota on these clusters is pretty high so the throttling has largely been only a minor nuisance when devs sync a large number of apps at once. Looking for errors around watches in the app controller logs, I did find these messages however:

{"error":"the server rejected our request for an unknown reason", "level":"error", "msg":"Failed to start missing watch", "server":"https://connectgateway.googleapis.com/..."}
{"error":"error getting openapi resources: the server rejected our request for an unknown reason", "level":"error", "msg":"Failed to reload open api schema", "server":"https://connectgateway.googleapis.com/..."}
"Watch failed" err="the server rejected our request for an unknown reason"
sourceLocation: {
    file: "retrywatcher.go"
    line: "130"
}
{"error":"the server has received too many requests and has asked us to try again later", "level":"error", "msg":"Failed to start missing watch", "server":"https://connectgateway.googleapis.com/..."}
{"error":"error getting openapi resources: the server has received too many requests and has asked us to try again later", "level":"error", "msg":"Failed to reload open api schema", "server":"https://connectgateway.googleapis.com/..."}
"Watch failed" err="the server has received too many requests and has asked us to try again later"
sourceLocation: {
    file: "retrywatcher.go"
    line: "130"
}

If you need more verbose logs let me know. I'd prefer to not turn on debug logging on this ArgoCD instance due to volume, but I think I might be able to artificially induce the errors on a test cluster.

While doing some searching around I found this issue that seems like it could be tangentially related as well: #9339

Hi, we are using kube-oidc-proxy, and facing somewhat of a similar issue. In our case, we see a dip in argocd_app_reconcile_count and some of the resources(Pods, ReplicaSet) in each application are not shown in ArgoCD UI but are present in the cluster, sometimes it shows older data.

Whenever there is a dip in the reconcile count, wee have found the below logs in OIDC proxy pod: E0613 16:02:42.441439 1 proxy.go:215] unable to authenticate the request via TokenReview due to an error:() rate: Wait(n=1) would exceed context deadline

At the same time, the ArgoCD application-controller throws the below error: { "error": "unable to retrieve the complete list of server APIs: apm.k8s.elastic.co/v1: the server has asked for the client to provide credentials, apm.k8s.elastic.co/v1beta1: the server has asked for the client to provide credentials, auto.gke.io/v1: the server has asked for the client to provide credentials, auto.gke.io/v1alpha1: the server has asked for the client to provide credentials, billingbudgets.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, binaryauthorization.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, cloudbuild.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, cloudfunctions.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, cloudscheduler.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, configcontroller.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, core.strimzi.io/v1beta2: the server has asked for the client to provide credentials, custom.metrics.k8s.io/v1beta1: the server has asked for the client to provide credentials, external.metrics.k8s.io/v1beta1: the server has asked for the client to provide credentials, k8s.nginx.org/v1: the server has asked for the client to provide credentials, kafka.strimzi.io/v1alpha1: the server has asked for the client to provide credentials, kafka.strimzi.io/v1beta1: the server has asked for the client to provide credentials, kafka.strimzi.io/v1beta2: the server has asked for the client to provide credentials, keda.sh/v1alpha1: the server has asked for the client to provide credentials, kiali.io/v1alpha1: the server has asked for the client to provide credentials, kibana.k8s.elastic.co/v1: the server has asked for the client to provide credentials, kms.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, logging.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, monitoring.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, monitoring.coreos.com/v1alpha1: the server has asked for the client to provide credentials, networkconnectivity.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, networking.istio.io/v1alpha3: the server has asked for the client to provide credentials, networking.istio.io/v1beta1: the server has asked for the client to provide credentials, networkservices.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, osconfig.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, recaptchaenterprise.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, servicenetworking.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, serviceusage.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, sourcerepo.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, spanner.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, storage.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, storagetransfer.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, vpcaccess.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, wgpolicyk8s.io/v1alpha2: the server has asked for the client to provide credentials", "level": "error", "msg": "Partial success when performing preferred resource discovery", "server": "https://cluster-1.data.int, "time": "2023-06-13T16:02:53Z" } Attaching Graph of argocd_app_k8s_request_total:

image
mkilchhofer commented 11 months ago

I saw this issue today for the first time in our setup at @swisspost on a DaemonSet pod of Aquasecurity. There was a "ghost" Pod which I could not remove.

The only thing helped was to do

$ kubectl rollout restart sts/argocd-application-controller
statefulset.apps/argocd-application-controller restarted

After this the Pod was gone.

Before restarting the application-controller I tried:

We are currently on version "v2.8.2+dbdfc71".

dtrouillet commented 3 months ago

Hi,

For information. we faced this issue to on ArgoCD v2.10.7+b060053.

RoyerRamirez commented 1 month ago

Hi,

We're also seeing this same issue. It's only occurring on deployments that contain significant number of pods (1k plus). Our other smaller deployments are fine.

jmmclean commented 1 week ago

Issue on EKS with ArgoCD v2.11.3, the pods from kubectl get pods is different than the display from ArgoCD. I snooped the eventsource response and saw that its populating with old stale data (ghost pods). a hard refresh took an extremely long time and did not help.