argoproj / argo-helm

ArgoProj Helm Charts
https://argoproj.github.io/argo-helm/
Apache License 2.0
1.73k stars 1.87k forks source link

AWS ALB Redirect loop caused by unhealthy target group #2936

Closed ToddMurphy92 closed 3 weeks ago

ToddMurphy92 commented 4 weeks ago

Describe the bug

Target health status of target groups is unhealthy for both the argo-cd-argocd-server and the argo-cd-argocd-server-grpc service.

We've attempted to setup an argocd instance using the helm chart. In our values file we've specified that we want to use an AWS ALB (see values file below). When the service starts all of the pods seem to be running correctly and as far as we can tell all of the other EKS resources seem to be in a healthy state. The ALB is created by the AWS ALB Controller pointing to the ingress that we specify in the values file. As above, the target groups are unhealthy and we can not figure out why.

Related helm chart

argo-cd

Helm chart version

7.5.0

To Reproduce

helm repo add eks https://aws.github.io/eks-charts

helm install aws-load-balancer-controler eks/aws-load-balancer-controller \
-n kube-system \
--set clusterName=<cluster name> \
--set serviceAccount.create=true \
--set serviceAccount.name=aws-load-balancer-controller

helm repo add argo-cd https://argoproj.github.io/argo-helm

helm dep update argocd-charts

helm install argo-cd argocd-charts --create-namespace \
--values ./argocd-values/argocd-values.yaml --wait --namespace argo-cd 

Once the above is done, we then add an alias to argocd-dev.aws.example.com in our internal DNS, pointing to the address of the load balancer.

argocd-values.yaml contents (referenced above)

global:
  domain: argocd-dev.aws.example.com

configs:
  params:
    server.insecure: true

argo-cd:
  server:
    ingress:
      annotations:
        alb.ingress.kubernetes.io/backend-protocol: HTTP
        alb.ingress.kubernetes.io/listen-ports: '[{"HTTP": 80}, {"HTTPS": 443}]'
        alb.ingress.kubernetes.io/scheme: internal
        alb.ingress.kubernetes.io/ssl-redirect: '443'
        alb.ingress.kubernetes.io/subnets: subnet-redacted, subnet-redacted, subnet-redacted
        alb.ingress.kubernetes.io/target-type: ip
        alb.ingress.kubernetes.io/tags: cost-centre=A111, environment=DEV, function=ArgoCD, managed-by=redacted@example.com, owner=redacted@example.com
      aws:
        backendProtocolVersion: GRPC
        serviceType: ClusterIP
      controller: aws
      enabled: true
      ingressClassName: alb
    replicas: 1

controller:
  replicas: 1

repoServer:
  replicas: 1

applicationSet:
  replicaCount: 1

redis-ha:
  enabled: false

Expected behavior

We would expect the target groups to be healthy and for Argo CD to be accessible via the load balancer.

Screenshots

image

Additional context

No response

mkilchhofer commented 4 weeks ago

Can you try to remove the port 80 listener in the annotation? alb.ingress.kubernetes.io/listen-ports: '[{"HTTPS":443}]' Ref: Argo CD docs

Background: When a plain HTTP requests hits the Argo CD server component, it redirects to HTTPS with a status code of 307

Another option is to use eigher alb.ingress.kubernetes.io/healthcheck-protocol: HTTPS or alb.ingress.kubernetes.io/success-codes: 200-399

ToddMurphy92 commented 4 weeks ago

Thanks for responding.

Removing the listen port 80 didn't make a lot of difference. We added alb.ingress.kubernetes.io/healthcheck-protocol: HTTPS to our values file, the target groups reported healthy (so an improvement) however we were still getting stuck in a 307 redirect loop when we visit the UI from a web browser.

When we tried using success codes 200-399 we kept getting this error back from the AWS Load Balancer Controller:

{
    "level": "error",
    "ts": "2024-09-26T03:41:40Z",
    "msg": "Reconciler error",
    "controller": "ingress",
    "object":
    {
        "name": "argo-cd-argocd-server",
        "namespace": "argocd"
    },
    "namespace": "argocd",
    "name": "argo-cd-argocd-server",
    "reconcileID": "790e57e4-745a-4881-aeb7-c0460648e6c4",
    "error": "ValidationError: Health check matcher GRPC code '200' must be within '0-99' inclusive\n\tstatus code: 400, request id: 9443f4ee-df46-4347-a0d4-66ff5a86a3ea"
}

Here's the curl output for the 307 redirect loop:

* Connected to argocd-dev.aws.example.com (10.1.1.5) port 443 (#0)
* ALPN: offers h2,http/1.1
* (304) (OUT), TLS handshake, Client hello (1):
*  CAfile: /etc/ssl/cert.pem
*  CApath: none
* (304) (IN), TLS handshake, Server hello (2):
* TLSv1.2 (IN), TLS handshake, Certificate (11):
* TLSv1.2 (IN), TLS handshake, Server key exchange (12):
* TLSv1.2 (IN), TLS handshake, Server finished (14):
* TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
* TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (OUT), TLS handshake, Finished (20):
* TLSv1.2 (IN), TLS change cipher, Change cipher spec (1):
* TLSv1.2 (IN), TLS handshake, Finished (20):
* SSL connection using TLSv1.2 / ECDHE-RSA-AES128-GCM-SHA256
* ALPN: server accepted h2
* Server certificate:
*  subject: CN=argocd-dev.aws.example.com
*  start date: Sep 15 21:38:37 2024 GMT
*  expire date: Sep 15 21:38:37 2026 GMT
*  subjectAltName: host "argocd-dev.aws.example.com" matched cert's "argocd-dev.aws.example.com"
*  issuer: DC=com; DC=example; DC=aws; CN=a-pssvc-casub1
*  SSL certificate verify ok.
* using HTTP/2
* h2h3 [:method: GET]
* h2h3 [:path: /]
* h2h3 [:scheme: https]
* h2h3 [:authority: argocd-dev.aws.example.com]
* h2h3 [user-agent: curl/7.88.1]
* h2h3 [accept: */*]
* Using Stream ID: 1 (easy handle 0x12d80a800)
> GET / HTTP/2
> Host: argocd-dev.aws.example.com
> user-agent: curl/7.88.1
> accept: */*
>
< HTTP/2 307
< date: Thu, 26 Sep 2024 03:46:45 GMT
< content-type: text/html; charset=utf-8
< content-length: 80
< location: https://argocd-dev.aws.example.com/
<
<a href="https://argocd-dev.aws.example.com/">Temporary Redirect</a>.

* Connection #0 to host argocd-dev.aws.example.com left intact
ToddMurphy92 commented 4 weeks ago

Quick update - We fixed it!

The solution was to add --insecure under argo-cd.server.extraArgs

argo-cd:
  server:
    extraArgs:
      - --insecure

Thanks again for your help @mkilchhofer

bryan-srg commented 4 weeks ago

Just to be clear - @mkilchhofer gave us the clue with the 'When a plain HTTP requests hits the Argo CD server component, it redirects to HTTPS with a status code of 307" comment.

Since we're doing SSL termination at the ALB, we of course needed to make sure the ArgoCD app was okay with receiving non-ssl'd requests - which I'd missed on the first review of the docs. So thanks once again!

mkilchhofer commented 4 weeks ago

Quick update - We fixed it!

The solution was to add --insecure under argo-cd.server.extraArgs

argo-cd:
  server:
    extraArgs:
      - --insecure

Thanks again for your help @mkilchhofer

Ahh alright👍 yup this works also and it is the official documented solution.

Just a little FYI: At work we run Argo CD also together with ALB and the aws-load-balancer-controller and we removed the --insecure recently by adding the annotation alb.ingress.kubernetes.io/backend-protocol: HTTPS.

It's your decision now 😎👍

bryan-srg commented 4 weeks ago

We tried that (and indeed, still have it in the ALB config - but it didn't help. Maybe we also need to get the cert into the app somehow for that to work properly?

Edit: I'm confusing myself - it was the healthcheck protocol we set as HTTPS, not the backend-protocol.