Open 0dragosh opened 2 years ago
Adding a screenshot to better showcase what I mean, maybe it's not clear.
Looked at this today with @0dragosh. The bug only occurs on instances using Azure AKS's instance start/stop feature. We cleared the resource-tree
key for this app in Redis, and the controller re-populated the resource tree with the phantom Pods. We confirmed that kubectl get po
shows the Pods as no longer existing on the cluster.
So this looks like a bug in how the Argo CD controller populates the resource tree.
We had the same use case in a very old GKE cluster (it was created a long time ago but is updated to 1.20). My thought was the ArgoCD uses a different API than kubectl get
@0dragosh can it be related to an old cluster?
@alexmt did you have any luck reproducing this?
Running into this as well, each sync creates more and more. revisionHistoryLimit is set to 3 but I currently have 23 and counting sitting around.
@rs-cole as a workaround, we found that a Force Refresh cleared the old pods.
@rs-cole as a workaround, we found that a Force Refresh cleared the old pods.
Plus a restart of the argocd application controller.
~Hitting Refresh - > Hard reset and restarting the application controller didn't seem to do the trick. Old non existant replicasets continued to grow, the applications RevisionHistoryLimit is set to 3. Perhaps I'm utilizing argocd incorrectly?~
Disregard, my reading comprehension is terrible. Fixed my issue with: https://kubernetes.io/docs/concepts/workloads/controllers/deployment/#clean-up-policy
Looked at this today with @0dragosh. The bug only occurs on instances using Azure AKS's instance start/stop feature. We cleared the
resource-tree
key for this app in Redis, and the controller re-populated the resource tree with the phantom Pods. We confirmed thatkubectl get po
shows the Pods as no longer existing on the cluster.So this looks like a bug in how the Argo CD controller populates the resource tree.
We are experiencing it with EKS as well
Just encountered it an Intuit for (afaik) the first time. A force refresh and controller restart cleared the error.
Argo CD version: v2.4.6 Kubernetes provider: EKS Kubernetes version: v1.21.12-eks-a64ea69
We encountered a similar issue with AKS.
We have a CronJob that creates Job + Pod resources. Those resources are automatically deleted due to either ttlSecondsAfterFinished
, successfulJobsHistoryLimit
or failedJobsHistoryLimit
configuration of the CronJob.
Quite often, ArgoCD continues displaying those deleted resources for months after they were deleted, both for failed and for successful job executions.
Analysis of the API calls of ArgoCD (from the browser's log):
managed-resources
API with the resource name, the response is an empty JSON.resource
API with the resource name, the response is 404.Regarding the Hard Refresh - is it possible to make it work without restarting the controller?
Same problem here with start/stop AKS cluster.
There is a workaround without ArgoCD controlle restarts (via API or similar?) to remove old/ghost nodes?
@joseaio I could not figure out another workaround
Again on 2.5.7.
We are seeing the exact same behavior. ArgoCD shows ghost resources that have been deleted days or weeks ago, and the UI does not expose any means of removing them ( delete obviously doesn't work).
We're encountering this as well, specifically with cronjobs.
Argo CD version: v2.5.10 Kubernetes provider: GKE Kubernetes server version: v1.24.9-gke.3200
The workaround to stop the old pods from surfacing in ArgoCD is, as stated earlier in this thread:
Sorry for the silly questions, but I'm coming up empty googling ....
Is Force Refresh the same as Hard Refresh, via the ArgoCD UI?
How do you restart the argocd application controller? https://argo-cd.readthedocs.io/en/stable/operator-manual/server-commands/argocd-application-controller/ doesn't say....
@crenshaw-dev we're seeing this on a daily basis and would prefer not to resort to cycling the app controller every day. Is there any info we can provide to help diagnose the cause?
Is Force Refresh the same as Hard Refresh, via the ArgoCD UI?
@CPlommer yep, they're the same!
@mikesmitty honestly, I'm stumped. I mean I know that the app controller is filling Redis with "resource tree" values that include ghost resources. But I have no clue where to start figuring out why the app controller thinks they still exist.
When the app controller starts up, it launches a bunch of watches to keep track of what's happening on the destination cluster (like pods being deleted). So I think either the app controller either isn't correctly handling the updates, or k8s isn't correctly sending them. I tend to think it's the former.
And it's possible that the app controller is logging those failures, but I'm not sure what kind of messages to look for. I'd have to start with "any error or warn messages" and see if anything looks suspicious.
Now that you mention it, our issue is caused by a few particularly large apps that somewhat routinely cause ArgoCD to get throttled by Anthos Connect Gateway when syncing the app to remote clusters. The connect gateway api is essentially a reverse proxy for the control plane that, for better or worse, uses control plane errors to throttle (e.g. "the server has received too many requests and has asked us to try again later" or "the server rejected our request for an unknown reason"). The quota on these clusters is pretty high so the throttling has largely been only a minor nuisance when devs sync a large number of apps at once. Looking for errors around watches in the app controller logs, I did find these messages however:
{"error":"the server rejected our request for an unknown reason", "level":"error", "msg":"Failed to start missing watch", "server":"https://connectgateway.googleapis.com/..."}
{"error":"error getting openapi resources: the server rejected our request for an unknown reason", "level":"error", "msg":"Failed to reload open api schema", "server":"https://connectgateway.googleapis.com/..."}
"Watch failed" err="the server rejected our request for an unknown reason"
sourceLocation: {
file: "retrywatcher.go"
line: "130"
}
{"error":"the server has received too many requests and has asked us to try again later", "level":"error", "msg":"Failed to start missing watch", "server":"https://connectgateway.googleapis.com/..."}
{"error":"error getting openapi resources: the server has received too many requests and has asked us to try again later", "level":"error", "msg":"Failed to reload open api schema", "server":"https://connectgateway.googleapis.com/..."}
"Watch failed" err="the server has received too many requests and has asked us to try again later"
sourceLocation: {
file: "retrywatcher.go"
line: "130"
}
If you need more verbose logs let me know. I'd prefer to not turn on debug logging on this ArgoCD instance due to volume, but I think I might be able to artificially induce the errors on a test cluster.
While doing some searching around I found this issue that seems like it could be tangentially related as well: https://github.com/argoproj/argo-cd/issues/9339
Latest theory: we miss watch events, Argo CD goes on blissfully unaware that the resources have been deleted.
I still don't know why the 24hr full-resync doesn't clear the old resources. https://github.com/argoproj/gitops-engine/blob/ed70eac8b7bd6b2f276502398fdbccccab5d189a/pkg/cache/cluster.go#L712
@ashutosh16 do you have a link to the branch where you modified the code to reproduce this?
Now that you mention it, our issue is caused by a few particularly large apps that somewhat routinely cause ArgoCD to get throttled by Anthos Connect Gateway when syncing the app to remote clusters. The connect gateway api is essentially a reverse proxy for the control plane that, for better or worse, uses control plane errors to throttle (e.g. "the server has received too many requests and has asked us to try again later" or "the server rejected our request for an unknown reason"). The quota on these clusters is pretty high so the throttling has largely been only a minor nuisance when devs sync a large number of apps at once. Looking for errors around watches in the app controller logs, I did find these messages however:
{"error":"the server rejected our request for an unknown reason", "level":"error", "msg":"Failed to start missing watch", "server":"https://connectgateway.googleapis.com/..."}
{"error":"error getting openapi resources: the server rejected our request for an unknown reason", "level":"error", "msg":"Failed to reload open api schema", "server":"https://connectgateway.googleapis.com/..."}
"Watch failed" err="the server rejected our request for an unknown reason" sourceLocation: { file: "retrywatcher.go" line: "130" }
{"error":"the server has received too many requests and has asked us to try again later", "level":"error", "msg":"Failed to start missing watch", "server":"https://connectgateway.googleapis.com/..."}
{"error":"error getting openapi resources: the server has received too many requests and has asked us to try again later", "level":"error", "msg":"Failed to reload open api schema", "server":"https://connectgateway.googleapis.com/..."}
"Watch failed" err="the server has received too many requests and has asked us to try again later" sourceLocation: { file: "retrywatcher.go" line: "130" }
If you need more verbose logs let me know. I'd prefer to not turn on debug logging on this ArgoCD instance due to volume, but I think I might be able to artificially induce the errors on a test cluster.
While doing some searching around I found this issue that seems like it could be tangentially related as well: #9339
Hi, we are using kube-oidc-proxy, and facing somewhat of a similar issue. In our case, we see a dip in argocd_app_reconcile_count and some of the resources(Pods, ReplicaSet) in each application are not shown in ArgoCD UI but are present in the cluster, sometimes it shows older data.
Whenever there is a dip in the reconcile count, wee have found the below logs in OIDC proxy pod:
E0613 16:02:42.441439 1 proxy.go:215] unable to authenticate the request via TokenReview due to an error:() rate: Wait(n=1) would exceed context deadline
At the same time, the ArgoCD application-controller throws the below error:
{ "error": "unable to retrieve the complete list of server APIs: apm.k8s.elastic.co/v1: the server has asked for the client to provide credentials, apm.k8s.elastic.co/v1beta1: the server has asked for the client to provide credentials, auto.gke.io/v1: the server has asked for the client to provide credentials, auto.gke.io/v1alpha1: the server has asked for the client to provide credentials, billingbudgets.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, binaryauthorization.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, cloudbuild.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, cloudfunctions.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, cloudscheduler.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, configcontroller.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, core.strimzi.io/v1beta2: the server has asked for the client to provide credentials, custom.metrics.k8s.io/v1beta1: the server has asked for the client to provide credentials, external.metrics.k8s.io/v1beta1: the server has asked for the client to provide credentials, k8s.nginx.org/v1: the server has asked for the client to provide credentials, kafka.strimzi.io/v1alpha1: the server has asked for the client to provide credentials, kafka.strimzi.io/v1beta1: the server has asked for the client to provide credentials, kafka.strimzi.io/v1beta2: the server has asked for the client to provide credentials, keda.sh/v1alpha1: the server has asked for the client to provide credentials, kiali.io/v1alpha1: the server has asked for the client to provide credentials, kibana.k8s.elastic.co/v1: the server has asked for the client to provide credentials, kms.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, logging.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, monitoring.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, monitoring.coreos.com/v1alpha1: the server has asked for the client to provide credentials, networkconnectivity.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, networking.istio.io/v1alpha3: the server has asked for the client to provide credentials, networking.istio.io/v1beta1: the server has asked for the client to provide credentials, networkservices.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, osconfig.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, recaptchaenterprise.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, servicenetworking.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, serviceusage.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, sourcerepo.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, spanner.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, storage.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, storagetransfer.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, vpcaccess.cnrm.cloud.google.com/v1beta1: the server has asked for the client to provide credentials, wgpolicyk8s.io/v1alpha2: the server has asked for the client to provide credentials", "level": "error", "msg": "Partial success when performing preferred resource discovery", "server": "https://cluster-1.data.int, "time": "2023-06-13T16:02:53Z" }
Attaching Graph of argocd_app_k8s_request_total:
I saw this issue today for the first time in our setup at @swisspost on a DaemonSet pod of Aquasecurity. There was a "ghost" Pod which I could not remove.
The only thing helped was to do
$ kubectl rollout restart sts/argocd-application-controller
statefulset.apps/argocd-application-controller restarted
After this the Pod was gone.
Before restarting the application-controller
I tried:
We are currently on version "v2.8.2+dbdfc71".
Hi,
For information. we faced this issue to on ArgoCD v2.10.7+b060053.
Hi,
We're also seeing this same issue. It's only occurring on deployments that contain significant number of pods (1k plus). Our other smaller deployments are fine.
Issue on EKS with ArgoCD v2.11.3, the pods from kubectl get pods
is different than the display from ArgoCD. I snooped the eventsource response and saw that its populating with old stale data (ghost pods). a hard refresh took an extremely long time and did not help.
Checklist:
argocd version
.Describe the bug
ArgoCD UI is showing old pods (that don’t exist anymore) from old replicasets as healthy in the UI. When you go to details on those pods it's empty, and delete errors out — because they haven’t existed in a while.
It also shows the new replicaset with the new pods correctly, in parallel with the old pods. I’ve tried a hard refresh to no avail, running ArgoCD HA.
To Reproduce
Deploy a new revision, creating a new replicaset.
Expected behavior
Old pods that don't exist in the Kubernetes API should not show up in the UI
Version
I suspect it's got something to do with improper cache invalidation on the Redis side.