MicrosoftDocs / azure-docs

Open source documentation of Microsoft Azure
https://docs.microsoft.com/azure
Creative Commons Attribution 4.0 International
10.21k stars 21.36k forks source link

No Healthy Upstream following MS Docs on AAG for Containers #124454

Open dan93-93 opened 1 day ago

dan93-93 commented 1 day ago

I've been following the docs verbatim over the past two days and pulling my hair out why this error is occurring:

Firstly, I ran all the az shell commands from this article:

Quickstart: Deploy Application Gateway for Containers ALB Controller.

All resources were deployed successfully inc the ALB controller. Though that naming the description for HELM_NAMESPACE='<your cluster name>' being cluster name was odd (a convention?) though I just named it as 'default'.

Then I ran through the BYO deployment article

Quickstart: Create Application Gateway for Containers - bring your own deployment

Created a new vNet and subnet as I'm doing BYO:

VNET_ADDRESS_PREFIX='10.0.0.0/16'  # Allows for multiple subnets
SUBNET_ADDRESS_PREFIX='10.0.1.0/24'  # Provides 256 addresses, meeting the 250 requirement

I noticed this specific article referenced creating a frontend which doesn't match with the frontend name associated with the next article. It'd be better if we reference the same frontend name throughout the documentation to avoid gateway class misconfiguration: FRONTEND_NAME='test-frontend'?

Last article followed was regarding SSL offloading:

SSL offloading with Application Gateway for Containers - Gateway API

As mentioned above, the referenced frontend is called FRONTEND_NAME='frontend' here whereas before it was referenced FRONTEND_NAME='test-frontend' (obvious please correct me if this isn't a right, but it would seem more appropriate to reference the previous article's frontend name).

Going through the documentation, not doing anything outside of what the docs have referenced (bar the frontend name change) and making sure the route and gateway are successful still throws a No healthy upstream response from curling the fqdn.

I've attached a .txt file I created to track all the shell commands that needed to be ran AAG4C Deployment.txt - if there's been any misconfiguration I'd be really appreciative to know why. Redacted some info it in, also convert it to a .sh for ease for reading.

Equally I have also reviewed the troubleshooting and got the logs of the ALB and looks like it's been updated:

{"level":"info","version":"1.2.3","AGC":"alb-test","alb-resource-id":"/subscriptions/b6f333ab-db5c-42fe-a88f-508eef404579/resourceGroups/rg-dan-microservices-001/providers/Microsoft.ServiceNetworking/trafficControllers/alb-test","operationID":"f70b34f7-39ad-4359-be93-6fc2412708ed","Timestamp":"2024-09-18T16:32:53.603597699Z","message":"Application Gateway for Containers resource config update OPERATION_STATUS_SUCCESS with operation ID f70b34f7-39ad-4359-be93-6fc2412708ed"}

So a bit stuck gotta admit, I'd really like to know where I've gone wrong....

ManoharLakkoju-MSFT commented 1 day ago

@dan93-93 Thanks for your feedback! We will investigate and update as appropriate.

dan93-93 commented 22 hours ago

I'd also like to add the following:

I've tried troubleshooting the issue and looking at backend health:

kubectl get pods -n azure-alb-system
kubectl logs alb-controller-5cdcb6459b-ck2pf -n azure-alb-system -c alb-controller # Standby controller
kubectl logs alb-controller-5cdcb6459b-lqdhv -n azure-alb-system -c alb-controller # Elected controller

kubectl port-forward alb-controller-5cdcb6459b-lqdhv -n $CONTROLLER_NAMESPACE 8000 8001
curl 'http://127.0.0.1:8000/backendHealth?service-name=test-infra/echo/80&detailed=true'

What was returned:

{
  "services": [
    {
      "serviceName": "test-infra/echo/80",
      "serviceHealth": [
        {
          "albId": "/subscriptions/xxxx-xxxx-xxxx-xxxx/resourceGroups/rg-dan-microservices-001/providers/Microsoft.ServiceNetworking/trafficControllers/alb-test",
          "totalEndpoints": 1,
          "totalHealthyEndpoints": 0,
          "totalUnhealthyEndpoints": 1,
          "endpoints": [
            {
              "address": "10.224.0.42",
              "health": {
                "status": "UNHEALTHY"
              }
            }
          ]
        }
      ]
    }
  ]
}

I created a deployment.yaml file and added a HealthCheckPolicy (used kubectl apply -f deployment.yaml):

apiVersion: alb.networking.azure.io/v1
kind: HealthCheckPolicy
metadata:
  name: gateway-health-check-policy
  namespace: test-infra
spec:
  targetRef:
    group: ""
    kind: Service
    name: echo
    namespace: test-infra
  default:
    interval: 5s
    timeout: 3s
    healthyThreshold: 1
    unhealthyThreshold: 1
    port: 80
    http:
      path: /
      match:
        statusCodes:
        - start: 200
          end: 299

And I also configured a readiness probe:

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: echo
  name: echo
  namespace: test-infra
spec:
  replicas: 1
  selector:
    matchLabels:
      app: echo
  template:
    metadata:
      labels:
        app: echo
    spec:
      containers:
      - image: gcr.io/k8s-staging-ingressconformance/echoserver:v20220815-e21d1a4
        name: echo
        lifecycle:
          preStop:
            exec:
              command: ["sleep", "10"]
        ports:
          - containerPort: 3000
        readinessProbe:
          httpGet:
            path: /
            port: 3000
          periodSeconds: 3
          timeoutSeconds: 1
        env:
        - name: POD_NAME
          valueFrom:
            fieldRef:
              fieldPath: metadata.name
        - name: NAMESPACE
          valueFrom:
            fieldRef:
              fieldPath: metadata.namespace

Unfortunately even with these additions it still returns UNHEALTHY