argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.93k stars 5.46k forks source link

cluster connection fails with: The provided from parameter is too old to display a consistent list result #6915

Open alexanderdalloz opened 3 years ago

alexanderdalloz commented 3 years ago

Argo CD cluster communication goes into fail state due to a problem to enumerate images.

PS C:\tools> .\argocd.exe cluster list
SERVER                          NAME        VERSION  STATUS  MESSAGE
https://kubernetes.default.svc  in-cluster  1.11+    Failed  failed to sync cluster https://100.68.0.1:443: failed to load initial state of resource Image.image.openshift.io: The provided from parameter is too old to display a consistent list result. You must start a new list without the from.

In consequence all deployed applications report the same errors and are in state unknown:

ComparisonError
failed to sync cluster https://100.68.0.1:443: failed to load initial state of resource Image.image.openshift.io: The provided from parameter is too old to display a consistent list result. You must start a new list without the from.
PS C:\tools> .\argocd.exe app get <app>
CONDITION        MESSAGE                                                                                                                                                                                                                                    LAST TRANSITION
ComparisonError  failed to sync cluster https://100.68.0.1:443: failed to load initial state of resource Image.image.openshift.io: The provided from parameter is too old to display a consistent list result. You must start a new list without the from.  2021-08-05 08:14:39 +0200 CEST
ComparisonError  failed to sync cluster https://100.68.0.1:443: failed to load initial state of resource Image.image.openshift.io: The provided from parameter is too old to display a consistent list result. You must start a new list without the from.  2021-08-05 08:14:39 +0200 CEST

Restarting the Cluster API didn't change the situation, the error persists. The Argo CD Pod logs do not contain obvious hints about the root cause. Besides Argo CD the cluster works without noticable issues. The same Argo CD deployment (Helm chart based) works on a different cluster (same generation / release).

PS C:\tools> .\argocd.exe version
argocd: v2.0.0+f5119c0
  BuildDate: 2021-04-07T06:03:59Z
  GitCommit: f5119c06686399134b3f296d44445bcdbc778d42
  GitTreeState: clean
  GoVersion: go1.16
  Compiler: gc
  Platform: windows/amd64
argocd-server: v2.0.0+f5119c0
  BuildDate: 2021-04-07T06:00:33Z
  GitCommit: f5119c06686399134b3f296d44445bcdbc778d42
  GitTreeState: clean
  GoVersion: go1.16
  Compiler: gc
  Platform: linux/amd64
  Ksonnet Version: v0.13.1
  Kustomize Version: v3.9.4 2021-02-09T19:22:10Z
  Helm Version: v3.5.1+g32c2223
  Kubectl Version: v0.20.4
  Jsonnet Version: v0.17.0
alexanderdalloz commented 3 years ago

Hints about how to debug the error condition any further are highly welcome.

jessesuen commented 3 years ago

@sbose78 this looks openshift specific. Could someone on your team take a look.

jannfis commented 3 years ago

Maybe a cluster cache refresh (or recycling the Redis deployment in a non-HA setup) may help here.

Just as a side note: I guess you probably know that the underlying Kubernetes version is pretty outdated (1.11), and this might as well be a bug in K8s.

alexanderdalloz commented 3 years ago

Thanks @jannfis for the hint. Unfortunately neither resetting the cluster cache, not restarting the redis pod, nor even completely re-deploying Argo CD has changed anything regards the connection failure. The same error message persisted.

We are aware of the age of the k8s release OpenShift 3.11 ships with. Curiously the described error happens now for the first time after months and only on a single cluster while other OCP 3.11 clusters with Argo CD v2.0.0 do work without issues.

Our hope was to get a clue how to further diagnose the root cause. Probably by getting an idea what to query against the API server to find out whether anything is different compared to the other clusters. Maybe you folks could point us to the part of the Argo CD code where the check happens which leads to the error case.

jannfis commented 3 years ago

Hey @alexanderdalloz, the error is actually thrown by gitops-engine when loading the available resources on the cluster. It uses Kubernetes client API for listing the resources using a paged approach. You are using Argo CD v2.0.0 which was built against gitops-engine v0.3.2.

You can have a look at the following methods:

Updates to resources: https://github.com/argoproj/gitops-engine/blob/85ff5d98624364c3121ebcd6fd8dfc6aa80a8a6a/pkg/cache/cluster.go#L447

Syncing cluster to cache: https://github.com/argoproj/gitops-engine/blob/85ff5d98624364c3121ebcd6fd8dfc6aa80a8a6a/pkg/cache/cluster.go#L594

I believe this may be a subtle incompatibility or bug in certain conditions between more recent Kubernetes client (Argo CD 2.0.0 and gitops-engine 0.3.2 both use Kubernetes 1.20.4) and a Kubernetes 1.11 API server.

alexanderdalloz commented 3 years ago

Hi @jannfis, thank you for your last comment. It brought me to some new results for which I like to get your opinion and advice.

The exact error message we are seeing in Argo CD comes from this code https://github.com/kubernetes/kubernetes/blob/v1.11.0/staging/src/k8s.io/apiserver/pkg/storage/etcd3/errors.go

which has the explanation

ErrCompacted is returned by Storage.Entries/Compact when a requested index is unavailable because it predates the last snapshot.

I found in addition on https://etcd.io/docs/v3.2/learning/api/#range

Revision - the point-in-time of the key-value store to use for the range. If revision is less or equal to zero, the range is over the latest key-value store If the revision is compacted, ErrCompacted is returned as a response.

I interpret the behaviour that etcd takes too much time to report back to the API server request (in our case paging the images on the cluster) so that it is obsoleted by a new snapshot in the meantime. Do you agree to this?

Based on this I did run the following oc queries:

$ oc get is -A | wc -l
9038

$ oc adm top imagestreams
Error from server: rpc error: code = ResourceExhausted desc = grpc: message too large (4990678454 bytes)

$ oc get images -A | wc -l
Error from server (Expired): The provided from parameter is too old to display a consistent list result. You must start a new list without the from.
275001

The last one is most interesting as it brings up the very same error situation, no Argo CD involved. We will check whether we can prune images on the specific cluster with the Argo CD blackout.

Nonetheless it would be good if Argo CD does not completely fail in such situations but would be able to tolerate it in some way.

drekle commented 3 years ago

Same issue using GKE on k8s 1.20

argocd@argocd-server-574d798f99-75nmb:~$ argocd --server localhost:8080 cluster list
SERVER                          NAME        VERSION  STATUS  MESSAGE
https://kubernetes.default.svc  in-cluster  1.20+    Failed  failed to sync cluster https://10.4.128.1:443: failed to load initial state of resource Secret: The provided continue parameter is too old to display a consistent list result. You can start a new list without the continue parameter, or use the continue token in this response to retrieve the remainder of the results. Continuing with the provided token results in an inconsistent list - objects that were created, modified, or deleted between the time the first chunk was returned and now may show up in the list.
sourceful-karlson commented 2 years ago

Same issue GKE k8s 1.21

alexionescu commented 2 years ago

Also running into this.

leoluz commented 1 year ago

We are having a similar issue in one of our clusters. I did some investigation and this is where I got:

Rationale

Possible solution

Configure cluster cache retry. Cluster Cache retry configuration was introduced in Argo CD but it is disabled by default. This is more a remediation but it could help addressing the issue in some cases.

@jannfis did you have a positive outcome configuring the Cluster Cache with retry policy?