aws / eks-anywhere

Run Amazon EKS on your own infrastructure 🚀
https://anywhere.eks.amazonaws.com
Apache License 2.0
1.97k stars 286 forks source link

installing gitops fails occasionally with connection refused on eksa webhook #3199

Open maxdrib opened 2 years ago

maxdrib commented 2 years ago

What happened: I’ve seen the CloudStackLegacyFlux e2e test fail a number of times now (randomly) with the error

2022-08-26T03:50:36.132Z        V0      ❌ Error when installing GitOps toolkits on workload cluster; EKS-A will continue with cluster creation, but GitOps will not be enabled {"error": "installing GitHub gitops: executing flux bootstrap github: ► connecting to github.c
om\n► cloning branch \"main\" from Git repository \"https://github.com/that-jetpack-guy/spacecraft-aws-eks-anywhere-test-3be30196-e0c1-457e-93d7-98de27788052-129.git\"\n✔ cloned repository\n► generating component manifests\n✔ generated component manifests\n✔ component m
anifests are up to date\n► installing components in \"default\" namespace\n✔ installed components\n✔ reconciled components\n► determining if source secret \"default/flux-system\" exists\n✔ source secret up to date\n► generating sync manifests\n✔ generated sync manifests
\n✔ sync manifests are up to date\n► applying sync manifests\n✔ reconciled sync configuration\n◎ waiting for Kustomization \"default/default\" to be reconciled\n✗ CloudStackDatacenterConfig/default/main-i-0ff4e-5b8dff2 apply failed, error: Internal error occurred: faile
d calling webhook \"validation.cloudstackdatacenterconfig.anywhere.amazonaws.com\": Post \"[https://eksa-webhook-service.eksa-system.svc:443/validate-anywhere-eks-amazonaws-com-v1alpha1-cloudstackdatacenterconfig?timeout=10s](https://eksa-webhook-service.eksa-system.svc/validate-anywhere-eks-amazonaws-com-v1alpha1-cloudstackdatacenterconfig?timeout=10s)\": dial tcp 10.109.213.213:443: connect: conne
ction refused\nCustomResourceDefinition/alerts.notification.toolkit.fluxcd.io configured\nCustomResourceDefinition/buckets.source.toolkit.fluxcd.io configured\nCustomResourceDefinition/gitrepositories.source.toolkit.fluxcd.io configured\nCustomResourceDefinition/helmcha
rts.source.toolkit.fluxcd.io configured\nCustomResourceDefinition/helmreleases.helm.toolkit.fluxcd.io configured\nCustomResourceDefinition/helmrepositories.source.toolkit.fluxcd.io configured\nCustomResourceDefinition/kustomizations.kustomize.toolkit.fluxcd.io configure
d\nCustomResourceDefinition/providers.notification.toolkit.fluxcd.io configured\nCustomResourceDefinition/receivers.notification.toolkit.fluxcd.io configured\nNamespace/default configured\n\n► confirming components are healthy\n✔ helm-controller: deployment ready\n✔ kus
tomize-controller: deployment ready\n✔ notification-controller: deployment ready\n✔ source-controller: deployment ready\n✔ all components are healthy\n✗ bootstrap failed with 1 health check failure(s)\n"}

which indicates to me that we are installing gitops before some eksa-webhooks are available. We should wait to install gitops until after the eksa pod is ready so that the webhooks are available.

What you expected to happen: I would expect this operation to succeed without a connection refused error

How to reproduce it (as minimally and precisely as possible): Run the TestCloudStackUpgradeMulticlusterWorkloadClusterWithFluxLegacy a number of times

Anything else we need to know?:

Environment:

maxdrib commented 2 years ago

discussing with @danbudris, it might make sense to introduce some validation on the webhook server service endpoints such as https://eksa-webhook-service.eksa-system.svc:443/validate-anywhere-eks-amazonaws-com-v1alpha1-cloudstackdatacenterconfig?timeout=10s

maxdrib commented 2 years ago

There appears to already be retry logic in place https://github.com/aws/eks-anywhere/blob/6104510c396ae57863b43a497f5c19c2293b8173/pkg/gitops/flux/client.go#L60-L65 so it's unclear why this operation might be failing. Next steps would include the following:

  1. Fetch support bundle for the e2e test run to see if eks-a controller had some issues starting the webhook service
  2. Run e2e test with verbosity 9 instead of 4 to see if the operation is in fact retried and something else is the issue