argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
18.04k stars 5.51k forks source link

ArgoCD cannot connect to AWS EKS 1.29, failed to load initial state of resource AWSManagedControlPlane.controlplane.cluster.x-k8s.io #19889

Open andrea-avanzi opened 2 months ago

andrea-avanzi commented 2 months ago

Checklist:

Describe the bug

To Reproduce

I've already check https://bit.ly/argocd-faq, but i don't try to check "Argo CD is unable to connect to my cluster, how do I troubleshoot it?" because kubectl missing from argocd-server pod.

Expected behavior

Argocd cannot sync demo app and cannot connect to cluster

Screenshots

Version

argocd: v2.11.7+e4a0246
  BuildDate: 2024-07-24T10:10:59Z
  GitCommit: e4a0246c4d920bc1e5ee5f9048a99eca7e1d53cb
  GitTreeState: clean
  GoVersion: go1.21.12
  Compiler: gc
  Platform: darwin/amd64
argocd-server: v2.11.7+e4a0246
  BuildDate: 2024-07-24T09:33:49Z
  GitCommit: e4a0246c4d920bc1e5ee5f9048a99eca7e1d53cb
  GitTreeState: clean
  GoVersion: go1.21.10
  Compiler: gc
  Platform: linux/amd64
  Kustomize Version: v5.2.1 2023-10-19T20:13:51Z
  Helm Version: v3.14.4+g81c902a
  Kubectl Version: v0.26.11
  Jsonnet Version: v0.20.0

Logs

[
  {
    "server": "https://kubernetes.default.svc",
    "name": "in-cluster",
    "config": {
      "tlsClientConfig": {
        "insecure": false
      }
    },
    "connectionState": {
      "status": "Failed",
      "message": "failed to sync cluster https://10.100.0.1:443: failed to load initial state of resource AWSManagedControlPlane.controlplane.cluster.x-k8s.io: Internal error occurred: error resolving resource",
      "attemptedAt": "2024-09-11T11:35:35Z"
    },
    "serverVersion": "1.29+",
    "info": {
      "connectionState": {
        "status": "Failed",
        "message": "failed to sync cluster https://10.100.0.1:443: failed to load initial state of resource AWSManagedControlPlane.controlplane.cluster.x-k8s.io: Internal error occurred: error resolving resource",
        "attemptedAt": "2024-09-11T11:35:35Z"
      },
      "serverVersion": "1.29+",
      "cacheInfo": {},
      "applicationsCount": 1,
      "apiVersions": [
        "acme.cert-manager.io/v1",
        "acme.cert-manager.io/v1/Challenge",
        "acme.cert-manager.io/v1/Order",
        "addons.cluster.x-k8s.io/v1beta1",
        "addons.cluster.x-k8s.io/v1beta1/ClusterResourceSet",
        "addons.cluster.x-k8s.io/v1beta1/ClusterResourceSetBinding",
        "admissionregistration.k8s.io/v1",
        "admissionregistration.k8s.io/v1/MutatingWebhookConfiguration",
        "admissionregistration.k8s.io/v1/ValidatingWebhookConfiguration",
        "apiextensions.k8s.io/v1",
        "apiextensions.k8s.io/v1/CustomResourceDefinition",
        "apiregistration.k8s.io/v1",
        "apiregistration.k8s.io/v1/APIService",
        "apps/v1",
        "apps/v1/ControllerRevision",
        "apps/v1/DaemonSet",
        "apps/v1/Deployment",
        "apps/v1/ReplicaSet",
        "apps/v1/StatefulSet",
        "argoproj.io/v1alpha1",
        "argoproj.io/v1alpha1/AppProject",
        "argoproj.io/v1alpha1/Application",
        "argoproj.io/v1alpha1/ApplicationSet",
        "autoscaling/v1",
        "autoscaling/v1/HorizontalPodAutoscaler",
        "autoscaling/v2",
        "autoscaling/v2/HorizontalPodAutoscaler",
        "batch/v1",
        "batch/v1/CronJob",
        "batch/v1/Job",
        "bootstrap.cluster.x-k8s.io/v1beta1",
        "bootstrap.cluster.x-k8s.io/v1beta1/KubeadmConfig",
        "bootstrap.cluster.x-k8s.io/v1beta1/KubeadmConfigTemplate",
        "bootstrap.cluster.x-k8s.io/v1beta2",
        "bootstrap.cluster.x-k8s.io/v1beta2/EKSConfig",
        "bootstrap.cluster.x-k8s.io/v1beta2/EKSConfigTemplate",
        "cert-manager.io/v1",
        "cert-manager.io/v1/Certificate",
        "cert-manager.io/v1/CertificateRequest",
        "cert-manager.io/v1/ClusterIssuer",
        "cert-manager.io/v1/Issuer",
        "certificates.k8s.io/v1",
        "certificates.k8s.io/v1/CertificateSigningRequest",
        "cluster.x-k8s.io/v1beta1",
        "cluster.x-k8s.io/v1beta1/Cluster",
        "cluster.x-k8s.io/v1beta1/ClusterClass",
        "cluster.x-k8s.io/v1beta1/Machine",
        "cluster.x-k8s.io/v1beta1/MachineDeployment",
        "cluster.x-k8s.io/v1beta1/MachineHealthCheck",
        "cluster.x-k8s.io/v1beta1/MachinePool",
        "cluster.x-k8s.io/v1beta1/MachineSet",
        "clusterctl.cluster.x-k8s.io/v1alpha3",
        "clusterctl.cluster.x-k8s.io/v1alpha3/Provider",
        "controlplane.cluster.x-k8s.io/v1beta1",
        "controlplane.cluster.x-k8s.io/v1beta1/KubeadmControlPlane",
        "controlplane.cluster.x-k8s.io/v1beta1/KubeadmControlPlaneTemplate",
        "controlplane.cluster.x-k8s.io/v1beta2",
        "controlplane.cluster.x-k8s.io/v1beta2/AWSManagedControlPlane",
        "controlplane.cluster.x-k8s.io/v1beta2/ROSAControlPlane",
        "coordination.k8s.io/v1",
        "coordination.k8s.io/v1/Lease",
        "crd.k8s.amazonaws.com/v1alpha1",
        "crd.k8s.amazonaws.com/v1alpha1/ENIConfig",
        "discovery.k8s.io/v1",
        "discovery.k8s.io/v1/EndpointSlice",
        "elbv2.k8s.aws/v1alpha1",
        "elbv2.k8s.aws/v1alpha1/TargetGroupBinding",
        "elbv2.k8s.aws/v1beta1",
        "elbv2.k8s.aws/v1beta1/IngressClassParams",
        "elbv2.k8s.aws/v1beta1/TargetGroupBinding",
        "events.k8s.io/v1",
        "events.k8s.io/v1/Event",
        "flowcontrol.apiserver.k8s.io/v1",
        "flowcontrol.apiserver.k8s.io/v1/FlowSchema",
        "flowcontrol.apiserver.k8s.io/v1/PriorityLevelConfiguration",
        "flowcontrol.apiserver.k8s.io/v1beta3",
        "flowcontrol.apiserver.k8s.io/v1beta3/FlowSchema",
        "flowcontrol.apiserver.k8s.io/v1beta3/PriorityLevelConfiguration",
        "infrastructure.cluster.x-k8s.io/v1beta2",
        "infrastructure.cluster.x-k8s.io/v1beta2/AWSCluster",
        "infrastructure.cluster.x-k8s.io/v1beta2/AWSClusterControllerIdentity",
        "infrastructure.cluster.x-k8s.io/v1beta2/AWSClusterRoleIdentity",
        "infrastructure.cluster.x-k8s.io/v1beta2/AWSClusterStaticIdentity",
        "infrastructure.cluster.x-k8s.io/v1beta2/AWSClusterTemplate",
        "infrastructure.cluster.x-k8s.io/v1beta2/AWSFargateProfile",
        "infrastructure.cluster.x-k8s.io/v1beta2/AWSMachine",
        "infrastructure.cluster.x-k8s.io/v1beta2/AWSMachinePool",
        "infrastructure.cluster.x-k8s.io/v1beta2/AWSMachineTemplate",
        "infrastructure.cluster.x-k8s.io/v1beta2/AWSManagedCluster",
        "infrastructure.cluster.x-k8s.io/v1beta2/AWSManagedMachinePool",
        "infrastructure.cluster.x-k8s.io/v1beta2/ROSACluster",
        "infrastructure.cluster.x-k8s.io/v1beta2/ROSAMachinePool",
        "ipam.cluster.x-k8s.io/v1alpha1",
        "ipam.cluster.x-k8s.io/v1alpha1/IPAddress",
        "ipam.cluster.x-k8s.io/v1alpha1/IPAddressClaim",
        "ipam.cluster.x-k8s.io/v1beta1",
        "ipam.cluster.x-k8s.io/v1beta1/IPAddress",
        "ipam.cluster.x-k8s.io/v1beta1/IPAddressClaim",
        "k8s.nginx.org/v1",
        "k8s.nginx.org/v1/VirtualServer",
        "k8s.nginx.org/v1/VirtualServerRoute",
        "kube-green.com/v1alpha1",
        "kube-green.com/v1alpha1/SleepInfo",
        "monitoring.coreos.com/v1",
        "monitoring.coreos.com/v1/Alertmanager",
        "monitoring.coreos.com/v1/PodMonitor",
        "monitoring.coreos.com/v1/Probe",
        "monitoring.coreos.com/v1/Prometheus",
        "monitoring.coreos.com/v1/PrometheusRule",
        "monitoring.coreos.com/v1/ServiceMonitor",
        "monitoring.coreos.com/v1/ThanosRuler",
        "monitoring.coreos.com/v1alpha1",
        "monitoring.coreos.com/v1alpha1/AlertmanagerConfig",
        "networking.k8s.aws/v1alpha1",
        "networking.k8s.aws/v1alpha1/PolicyEndpoint",
        "networking.k8s.io/v1",
        "networking.k8s.io/v1/Ingress",
        "networking.k8s.io/v1/IngressClass",
        "networking.k8s.io/v1/NetworkPolicy",
        "node.k8s.io/v1",
        "node.k8s.io/v1/RuntimeClass",
        "policy/v1",
        "policy/v1/PodDisruptionBudget",
        "rbac.authorization.k8s.io/v1",
        "rbac.authorization.k8s.io/v1/ClusterRole",
        "rbac.authorization.k8s.io/v1/ClusterRoleBinding",
        "rbac.authorization.k8s.io/v1/Role",
        "rbac.authorization.k8s.io/v1/RoleBinding",
        "runtime.cluster.x-k8s.io/v1alpha1",
        "runtime.cluster.x-k8s.io/v1alpha1/ExtensionConfig",
        "scheduling.k8s.io/v1",
        "scheduling.k8s.io/v1/PriorityClass",
        "storage.k8s.io/v1",
        "storage.k8s.io/v1/CSIDriver",
        "storage.k8s.io/v1/CSINode",
        "storage.k8s.io/v1/CSIStorageCapacity",
        "storage.k8s.io/v1/StorageClass",
        "storage.k8s.io/v1/VolumeAttachment",
        "v1",
        "v1/ConfigMap",
        "v1/Endpoints",
        "v1/Event",
        "v1/LimitRange",
        "v1/Namespace",
        "v1/Node",
        "v1/PersistentVolume",
        "v1/PersistentVolumeClaim",
        "v1/Pod",
        "v1/PodTemplate",
        "v1/ReplicationController",
        "v1/ResourceQuota",
        "v1/Secret",
        "v1/Service",
        "v1/ServiceAccount",
        "vpcresources.k8s.aws/v1alpha1",
        "vpcresources.k8s.aws/v1alpha1/CNINode",
        "vpcresources.k8s.aws/v1beta1",
        "vpcresources.k8s.aws/v1beta1/SecurityGroupPolicy"
      ]
    }
  }
]

Thanks

nitishfy commented 2 months ago

What is the error you're getting? Also, have you tried adding the EKS cluster to the argocd using the argocd cluster add command?

andrea-avanzi commented 2 months ago

Error is "failed to sync cluster https://10.100.0.1:443: failed to load initial state of resource AWSManagedControlPlane.controlplane.cluster.x-k8s.io: Internal error occurred: error resolving resource" I have the same error when using cluster added by argocd cluster add command

andrea-avanzi commented 2 months ago

I have recreated sequence and this is argocd-application-controller-0 logs

time="2024-09-11T15:33:16Z" level=info msg="Processing all cluster shards"
time="2024-09-11T15:33:16Z" level=info msg="Processing all cluster shards"
time="2024-09-11T15:33:16Z" level=info msg="appResyncPeriod=3m0s, appHardResyncPeriod=0s, appResyncJitter=0s"
time="2024-09-11T15:33:16Z" level=info msg="Starting configmap/secret informers"
time="2024-09-11T15:33:17Z" level=info msg="Configmap/secret informer synced"
time="2024-09-11T15:33:17Z" level=warning msg="Cannot init sharding. Error while querying clusters list from database: server.secretkey is missing"
time="2024-09-11T15:33:17Z" level=warning msg="Failed to save clusters info: server.secretkey is missing"
time="2024-09-11T15:33:17Z" level=info msg="0xc000e209c0 subscribed to settings updates"
time="2024-09-11T15:33:17Z" level=info msg="Cluster https://kubernetes.default.svc has been assigned to shard 0"
time="2024-09-11T15:33:17Z" level=info msg="Starting secretInformer forcluster"
time="2024-09-11T15:33:17Z" level=warning msg="Unable to parse updated settings: server.secretkey is missing"
time="2024-09-11T15:33:17Z" level=info msg="Notifying 1 settings subscribers: [0xc000e209c0]"
time="2024-09-11T15:35:28Z" level=info msg="Refreshing app status (spec.source differs), level (3)" application=argocd/guestbook
time="2024-09-11T15:35:28Z" level=info msg="Comparing app state (cluster: https://kubernetes.default.svc, namespace: default)" application=argocd/guestbook
time="2024-09-11T15:35:28Z" level=info msg="Start syncing cluster" server="https://kubernetes.default.svc"
time="2024-09-11T15:35:30Z" level=error msg="Failed to sync cluster" error="failed to load initial state of resource AWSManagedControlPlane.controlplane.cluster.x-k8s.io: Internal error occurred: error resolving resource" server="https://kubernetes.default.svc"
time="2024-09-11T15:35:30Z" level=info msg="Normalized app spec: {\"status\":{\"conditions\":[{\"lastTransitionTime\":\"2024-09-11T15:35:28Z\",\"message\":\"Failed to load live state: failed to get cluster info for \\\"https://kubernetes.default.svc\\\": error synchronizing cache state : failed to sync cluster https://10.100.0.1:443: failed to load initial state of resource AWSManagedControlPlane.controlplane.cluster.x-k8s.io: Internal error occurred: error resolving resource\",\"type\":\"ComparisonError\"},{\"lastTransitionTime\":\"2024-09-11T15:35:28Z\",\"message\":\"Failed to load target state: failed to get cluster version for cluster \\\"https://kubernetes.default.svc\\\": failed to get cluster info for \\\"https://kubernetes.default.svc\\\": error synchronizing cache state : failed to sync cluster https://10.100.0.1:443: failed to load initial state of resource AWSManagedControlPlane.controlplane.cluster.x-k8s.io: Internal error occurred: error resolving resource\",\"type\":\"ComparisonError\"},{\"lastTransitionTime\":\"2024-09-11T15:35:28Z\",\"message\":\"error synchronizing cache state : failed to sync cluster https://10.100.0.1:443: failed to load initial state of resource AWSManagedControlPlane.controlplane.cluster.x-k8s.io: Internal error occurred: error resolving resource\",\"type\":\"UnknownError\"}],\"sync\":{\"comparedTo\":{\"destination\":{},\"source\":{\"repoURL\":\"\"}}}}}" application=argocd/guestbook
time="2024-09-11T15:35:30Z" level=error msg="Failed to cache app resources: error getting resource tree: failed to get app hosts: error synchronizing cache state : failed to sync cluster https://10.100.0.1:443: failed to load initial state of resource AWSManagedControlPlane.controlplane.cluster.x-k8s.io: Internal error occurred: error resolving resource" application=argocd/guestbook dedup_ms=0 dest-name= dest-namespace=default dest-server="https://kubernetes.default.svc" diff_ms=0 fields.level=3 git_ms=2406 health_ms=0 live_ms=0 settings_ms=0 sync_ms=0
time="2024-09-11T15:35:30Z" level=info msg="Updated sync status:  -> Unknown" application=guestbook dest-namespace=default dest-server="https://kubernetes.default.svc" reason=ResourceUpdated type=Normal
time="2024-09-11T15:35:30Z" level=info msg="Updated health status:  -> Healthy" application=guestbook dest-namespace=default dest-server="https://kubernetes.default.svc" reason=ResourceUpdated type=Normal
time="2024-09-11T15:35:30Z" level=info msg="Update successful" application=argocd/guestbook
time="2024-09-11T15:35:30Z" level=info msg="Reconciliation completed" application=argocd/guestbook dedup_ms=0 dest-name= dest-namespace=default dest-server="https://kubernetes.default.svc" diff_ms=0 fields.level=3 git_ms=2406 health_ms=0 live_ms=0 patch_ms=9 setop_ms=0 settings_ms=0 sync_ms=0 time_ms=2447
reggie-k commented 2 months ago

Did you configure sharding, if so, which algorithm? Is the cluster ArgoCD tries to connect to a local or a remote one? And did the issue occur after upgrading the EKS cluster, after upgrading ArgoCD, or were both upgraded together? What versions was the upgrade from and to? Does restarting the controller solve the issue (not recommending this is a workaround, of course, asking for understanding the problem better)?

andrea-avanzi commented 2 months ago

Cluster is local, i use https://kubernetes.default.svc The issue occur after i have updated both, eks and after argocd On EKS i cannot restart controller I haven't modify sharding config on argo, i used default configuration during installing

reggie-k commented 2 months ago

It may be a caching issue. Can you connect to ArgoCD's Redis and clear cluster info?

sidewinder12s commented 1 month ago

I also just ran into this with v2.10.5+335875d. Remote cluster, default sharding algo (not any of the new ones). One of my engineers installed a custom developed CRD and then threw the application controller with this error.

Turns out the CRD had helm templating within it that made it invalid/erroring out. Once the CRD was corrected, the error cleared for ArgoCD.

sambo2021 commented 1 month ago

I have the same problem argocd v2.12 argocd own cluster - eks-v1.29 argocd remote cluster - eks-v1.30 have istio v1.22 and api v1alpha3 is supported the app is sync status Unkown and last sync error with although other apps having similar istio resources deployed fine ComparisonError: Failed to load live state: Get "https://remote-cluster/apis/networking.istio.io/v1alpha3?timeout=32s"

EDIT: it seems that k8s-v1.30 and istio-v1.22 is not working consistently anymore with networking.istio.io/v1alpha3 for DR,VS,SE,GW but envoyfilter still works with v1alpha3 , even it is not mentioned a deprecation and replacement of v1alpha3 by v1 https://istio.io/latest/blog/2024/v1-apis/

andrii-korotkov-verkada commented 2 weeks ago

I couldn't find Argo-specific code for error resolving resource. Do you have any EKS logs that can help?

sidewinder12s commented 2 weeks ago

It seemed like the issue was the control plane/Kubernetes SDK was puking on a bad CRD in my case. Once the CRD was corrected all was well again (So I'd almost say out of scope for ArgoCD).