Open alexanderdalloz opened 3 years ago
Hints about how to debug the error condition any further are highly welcome.
@sbose78 this looks openshift specific. Could someone on your team take a look.
Maybe a cluster cache refresh (or recycling the Redis deployment in a non-HA setup) may help here.
Just as a side note: I guess you probably know that the underlying Kubernetes version is pretty outdated (1.11), and this might as well be a bug in K8s.
Thanks @jannfis for the hint. Unfortunately neither resetting the cluster cache, not restarting the redis pod, nor even completely re-deploying Argo CD has changed anything regards the connection failure. The same error message persisted.
We are aware of the age of the k8s release OpenShift 3.11 ships with. Curiously the described error happens now for the first time after months and only on a single cluster while other OCP 3.11 clusters with Argo CD v2.0.0 do work without issues.
Our hope was to get a clue how to further diagnose the root cause. Probably by getting an idea what to query against the API server to find out whether anything is different compared to the other clusters. Maybe you folks could point us to the part of the Argo CD code where the check happens which leads to the error case.
Hey @alexanderdalloz, the error is actually thrown by gitops-engine
when loading the available resources on the cluster. It uses Kubernetes client API for listing the resources using a paged approach. You are using Argo CD v2.0.0 which was built against gitops-engine v0.3.2.
You can have a look at the following methods:
Updates to resources: https://github.com/argoproj/gitops-engine/blob/85ff5d98624364c3121ebcd6fd8dfc6aa80a8a6a/pkg/cache/cluster.go#L447
Syncing cluster to cache: https://github.com/argoproj/gitops-engine/blob/85ff5d98624364c3121ebcd6fd8dfc6aa80a8a6a/pkg/cache/cluster.go#L594
I believe this may be a subtle incompatibility or bug in certain conditions between more recent Kubernetes client (Argo CD 2.0.0 and gitops-engine 0.3.2 both use Kubernetes 1.20.4) and a Kubernetes 1.11 API server.
Hi @jannfis, thank you for your last comment. It brought me to some new results for which I like to get your opinion and advice.
The exact error message we are seeing in Argo CD comes from this code https://github.com/kubernetes/kubernetes/blob/v1.11.0/staging/src/k8s.io/apiserver/pkg/storage/etcd3/errors.go
which has the explanation
ErrCompacted is returned by Storage.Entries/Compact when a requested index is unavailable because it predates the last snapshot.
I found in addition on https://etcd.io/docs/v3.2/learning/api/#range
Revision - the point-in-time of the key-value store to use for the range. If revision is less or equal to zero, the range is over the latest key-value store If the revision is compacted, ErrCompacted is returned as a response.
I interpret the behaviour that etcd takes too much time to report back to the API server request (in our case paging the images on the cluster) so that it is obsoleted by a new snapshot in the meantime. Do you agree to this?
Based on this I did run the following oc queries:
$ oc get is -A | wc -l
9038
$ oc adm top imagestreams
Error from server: rpc error: code = ResourceExhausted desc = grpc: message too large (4990678454 bytes)
$ oc get images -A | wc -l
Error from server (Expired): The provided from parameter is too old to display a consistent list result. You must start a new list without the from.
275001
The last one is most interesting as it brings up the very same error situation, no Argo CD involved. We will check whether we can prune images on the specific cluster with the Argo CD blackout.
Nonetheless it would be good if Argo CD does not completely fail in such situations but would be able to tolerate it in some way.
Same issue using GKE on k8s 1.20
argocd@argocd-server-574d798f99-75nmb:~$ argocd --server localhost:8080 cluster list
SERVER NAME VERSION STATUS MESSAGE
https://kubernetes.default.svc in-cluster 1.20+ Failed failed to sync cluster https://10.4.128.1:443: failed to load initial state of resource Secret: The provided continue parameter is too old to display a consistent list result. You can start a new list without the continue parameter, or use the continue token in this response to retrieve the remainder of the results. Continuing with the provided token results in an inconsistent list - objects that were created, modified, or deleted between the time the first chunk was returned and now may show up in the list.
Same issue GKE k8s 1.21
Also running into this.
We are having a similar issue in one of our clusters. I did some investigation and this is where I got:
Configure cluster cache retry. Cluster Cache retry configuration was introduced in Argo CD but it is disabled by default. This is more a remediation but it could help addressing the issue in some cases.
@jannfis did you have a positive outcome configuring the Cluster Cache with retry policy?
Argo CD cluster communication goes into fail state due to a problem to enumerate images.
In consequence all deployed applications report the same errors and are in state unknown:
Restarting the Cluster API didn't change the situation, the error persists. The Argo CD Pod logs do not contain obvious hints about the root cause. Besides Argo CD the cluster works without noticable issues. The same Argo CD deployment (Helm chart based) works on a different cluster (same generation / release).