Open dan93-93 opened 2 months ago
@dan93-93 Thanks for your feedback! We will investigate and update as appropriate.
I'd also like to add the following:
I've tried troubleshooting the issue and looking at backend health:
kubectl get pods -n azure-alb-system
kubectl logs alb-controller-5cdcb6459b-ck2pf -n azure-alb-system -c alb-controller # Standby controller
kubectl logs alb-controller-5cdcb6459b-lqdhv -n azure-alb-system -c alb-controller # Elected controller
kubectl port-forward alb-controller-5cdcb6459b-lqdhv -n $CONTROLLER_NAMESPACE 8000 8001
curl 'http://127.0.0.1:8000/backendHealth?service-name=test-infra/echo/80&detailed=true'
What was returned:
{
"services": [
{
"serviceName": "test-infra/echo/80",
"serviceHealth": [
{
"albId": "/subscriptions/xxxx-xxxx-xxxx-xxxx/resourceGroups/rg-dan-microservices-001/providers/Microsoft.ServiceNetworking/trafficControllers/alb-test",
"totalEndpoints": 1,
"totalHealthyEndpoints": 0,
"totalUnhealthyEndpoints": 1,
"endpoints": [
{
"address": "10.224.0.42",
"health": {
"status": "UNHEALTHY"
}
}
]
}
]
}
]
}
I created a deployment.yaml file and added a HealthCheckPolicy (used kubectl apply -f deployment.yaml
):
apiVersion: alb.networking.azure.io/v1
kind: HealthCheckPolicy
metadata:
name: gateway-health-check-policy
namespace: test-infra
spec:
targetRef:
group: ""
kind: Service
name: echo
namespace: test-infra
default:
interval: 5s
timeout: 3s
healthyThreshold: 1
unhealthyThreshold: 1
port: 80
http:
path: /
match:
statusCodes:
- start: 200
end: 299
And I also configured a readiness probe:
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: echo
name: echo
namespace: test-infra
spec:
replicas: 1
selector:
matchLabels:
app: echo
template:
metadata:
labels:
app: echo
spec:
containers:
- image: gcr.io/k8s-staging-ingressconformance/echoserver:v20220815-e21d1a4
name: echo
lifecycle:
preStop:
exec:
command: ["sleep", "10"]
ports:
- containerPort: 3000
readinessProbe:
httpGet:
path: /
port: 3000
periodSeconds: 3
timeoutSeconds: 1
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
Unfortunately even with these additions it still returns UNHEALTHY. Please note that the IP 10.224.0.42 is also the Pod IP, surely it shouldn't care about the Pod IP but rather than Service itself.
I deployed a curl-pod
kubectl apply -f - <<EOF
apiVersion: v1
kind: Pod
metadata:
name: curl-pod
namespace: test-infra
spec:
containers:
- name: curl-container
image: curlimages/curl:latest
command: ["sleep", "3600"]
EOF
Running the below returns the same information:
kubectl exec -it curl-pod -n test-infra -- /bin/sh
curl http://echo.test-infra.svc.cluster.local:80/
curl http://10.224.0.42:3000/
{
"path": "/",
"host": "echo.test-infra.svc.cluster.local",
"method": "GET",
"proto": "HTTP/1.1",
"headers": {
"Accept": [
"*/*"
],
"User-Agent": [
"curl/8.10.1"
]
},
"namespace": "test-infra",
"ingress": "",
"service": "",
"pod": "echo-7965899f7d-hvw4l"
}
So why is the backendHealth displaying as unhealthy? What constitutes it being unhealthy? As the detailed logs as well don't give any indication as to why...
@dan93-93 Thank you for bringing this to our attention. I've delegated this to content author, who will review it and offer their insightful opinions.
Is there any update or is there any more information you need from me?
@dan93-93 @greg-lindsay I believe I've found the issues here as I was also running into the dreaded No healthy upstream
issue with my containers. I believe the issue is with the default HealthCheckPolicy
. What I found was that you need to configure your own custom HealthCheckPolicy
pointing to the targetPort
of the Service
that the HTTPRoute
is pointing to, and not the backendRefs
port. As soon as that custom HealthCheckPolicy
is created and deployed, my App Gateways started working as intended.
This is likely something that needs to be better defined in the documentation as it will catch a lot of people out. Following the example on SSL offloading with Application Gateway for Containers - Gateway API if you add in a healthcheck like the following, things should work:
apiVersion: alb.networking.azure.io/v1
kind: HealthCheckPolicy
metadata:
name: echo-healthcheck
namespace: test-infra
spec:
targetRef:
group: ''
kind: Service
name: echo
namespace: tig
default:
interval: 10s
timeout: 3s
healthyThreshold: 1
unhealthyThreshold: 5
port: 3000 # targetPort of the service, not port
http:
path: /
match:
statusCodes:
- start: 200
end: 299
useTLS: false
@smithjw thanks for the input, I just ran the below and unfortunately I'm still running in to this error even with the policy in place using the targetPort:
kubectl apply -f https://trafficcontrollerdocs.blob.core.windows.net/examples/https-scenario/ssl-termination/deployment.yaml
Then created a healthcheckpolicy:
kubectl apply -f - <<EOF
apiVersion: alb.networking.azure.io/v1
kind: HealthCheckPolicy
metadata:
name: echo-healthcheck
namespace: test-infra
spec:
targetRef:
group: ''
kind: Service
name: echo
namespace: test-infra
default:
interval: 10s
timeout: 3s
healthyThreshold: 1
unhealthyThreshold: 5
port: 3000
http:
path: /
match:
statusCodes:
- start: 200
end: 299
useTLS: false
EOF
Appears to be valid:
Status:
Conditions:
Last Transition Time: 2024-09-30T14:33:41Z
Message: Valid HealthCheckPolicy
Observed Generation: 2
Reason: Accepted
Status: True
Type: Accepted
Events: <none>
Then created the gateway and httpRoute as per the docs, ensuring successful deployment of each resource:
kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: Gateway
metadata:
name: gateway-01
namespace: test-infra
annotations:
alb.networking.azure.io/alb-id: $RESOURCE_ID
spec:
gatewayClassName: azure-alb-external
listeners:
- name: https-listener
port: 443
protocol: HTTPS
allowedRoutes:
namespaces:
from: Same
tls:
mode: Terminate
certificateRefs:
- kind : Secret
group: ""
name: listener-tls-secret
addresses:
- type: alb.networking.azure.io/alb-frontend
value: $FRONTEND_NAME
EOF
# Check gateway
kubectl get gateway gateway-01 -n test-infra -o yaml
# Create HttpRoute - do not deploy manually if using PythonApp.yaml (it's in the file)
kubectl apply -f - <<EOF
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: https-route
namespace: test-infra
spec:
parentRefs:
- name: gateway-01
sectionName: https-listener
rules:
- backendRefs:
- name: echo
port: 80
EOF
# Check HttpRoute
kubectl get httproute https-route -n test-infra -o yaml
Still thinks the endpoint of the pod is unhealthy:
{
"services": [
{
"serviceName": "test-infra/echo/80",
"serviceHealth": [
{
"albId": "/subscriptions/xxxx-xxxx-xxxx-xxxx/resourceGroups/rg-dan-microservices-001/providers/Microsoft.ServiceNetworking/trafficControllers/alb-test",
"totalEndpoints": 1,
"totalHealthyEndpoints": 0,
"totalUnhealthyEndpoints": 1,
"endpoints": [
{
"address": "10.224.0.7",
"health": {
"status": "UNHEALTHY"
}
}
]
}
]
}
]
}
Deleted the pod as well to see if a new pod maybe helped but sadly not.
@dan93-93 Ok, something else I've seen which I think it may be. I spun up a new deployment, service, modified my httproute, and added a custom health check with the backend port but saw the same Unhealthy message.
Just on a whim I tried setting HealthCheckPolicy back to the Service port, deployed that, modified the HealthCheckPolicy back to the backend port, deployed and now it's reporting healthy.
I have no idea why it's not working first time but seems to require creating the HealthCheckPolicy, changing the port and back again before it reports as healthy.
I've been trying the same scenario, and tested different port combinations and suggestions in this thread and nothing works, verified traffic on the backend pod using tcpdump -> no packets are coming (regardless of the port) seems like alb-controller either not really trying to do a healthcheck http/tcp call, no traffic is coming to the backend pod.
alb-controller logs are clear, nothing besides info messages is there
all the backends are UNHEALTHY, regardless of the presence of the HealthCheckPolicy and it's configuration.
is there any way to diagnose why alb-controller is not making healtheck requests?
@greg-lindsay is anyone looking in to this? I don't have access to Azure anymore due to a role change however I would like to recommend this product in the future but can't until there is some clarity on the issue.
I was able to set Loglevel to debug on the alb-controller helm chart, but can't see any obviouse reasons in there attaching collected logs albController-dbug-log.txt
and attaching my deployment yaml, just in case
deployment.yaml.txt
Just re-created the alb gw setup according to the new documentation, since new version was published yesterday, but still can't make it work.
seems like the sources are not open for this component, but it would be great to understand from which process the healthcheck is being executed, and why there are no errors and no attempts to reach the backend (proven by tcpdump), and no error or warning logs.... maybe we are looking in the wrong place?
Any help on this topic would be super helpful ...
Hello,
We have exactly the same thing. We've tested all possible combinations and configurations, with and without HealthCheckPolicy, with https/http, different port, etc.. but it's impossible to have a single healthy backend, even with Microsoft's demo templates... Tested on 3 different clusters, all created from scratch.
Tested also on different namespace with ReferenceGrant and on same namespace, same problem. The controller logs and the yaml info tell us that everything is fine with the configurations.
The fact of not having more detailed logs on the state of the backends is very problematic, even if we imagine that it might work later. In a production environment, it's essential to know why backends are not in good condition.
Can you give us some feedback on this subject? We'd really like to test it quickly if possible.
Hello,
We have exactly the same thing. We've tested all possible combinations and configurations, with and without HealthCheckPolicy, with https/http, different port, etc.. but it's impossible to have a single healthy backend, even with Microsoft's demo templates... Tested on 3 different clusters, all created from scratch.
Tested also on different namespace with ReferenceGrant and on same namespace, same problem. The controller logs and the yaml info tell us that everything is fine with the configurations.
The fact of not having more detailed logs on the state of the backends is very problematic, even if we imagine that it might work later. In a production environment, it's essential to know why backends are not in good condition.
Can you give us some feedback on this subject? We'd really like to test it quickly if possible.
I was actually able to make it work, but only in case cluster being created using console command from the manual, but the cluster created through Azure portal UI never worked in terms of proper health evaluation.
Not sure what is the reason, but you can try this as a workaround
I've been following the docs verbatim over the past two days and pulling my hair out why this error is occurring:
Firstly, I ran all the az shell commands from this article:
Quickstart: Deploy Application Gateway for Containers ALB Controller.
All resources were deployed successfully inc the ALB controller. Though that naming the description for
HELM_NAMESPACE='<your cluster name>'
being cluster name was odd (a convention?) though I just named it as 'default'.Then I ran through the BYO deployment article
Quickstart: Create Application Gateway for Containers - bring your own deployment
Created a new vNet and subnet as I'm doing BYO:
I noticed this specific article referenced creating a frontend which doesn't match with the frontend name associated with the next article. It'd be better if we reference the same frontend name throughout the documentation to avoid gateway class misconfiguration:
FRONTEND_NAME='test-frontend'
?Last article followed was regarding SSL offloading:
SSL offloading with Application Gateway for Containers - Gateway API
As mentioned above, the referenced frontend is called
FRONTEND_NAME='frontend'
here whereas before it was referencedFRONTEND_NAME='test-frontend'
(obvious please correct me if this isn't a right, but it would seem more appropriate to reference the previous article's frontend name).Going through the documentation, not doing anything outside of what the docs have referenced (bar the frontend name change) and making sure the route and gateway are successful still throws a
No healthy upstream
response from curling the fqdn.I've attached a .txt file I created to track all the shell commands that needed to be ran AAG4C Deployment.txt - if there's been any misconfiguration I'd be really appreciative to know why. Redacted some info it in, also convert it to a .sh for ease for reading.
Equally I have also reviewed the troubleshooting and got the logs of the ALB and looks like it's been updated:
{"level":"info","version":"1.2.3","AGC":"alb-test","alb-resource-id":"/subscriptions/b6f333ab-db5c-42fe-a88f-508eef404579/resourceGroups/rg-dan-microservices-001/providers/Microsoft.ServiceNetworking/trafficControllers/alb-test","operationID":"f70b34f7-39ad-4359-be93-6fc2412708ed","Timestamp":"2024-09-18T16:32:53.603597699Z","message":"Application Gateway for Containers resource config update OPERATION_STATUS_SUCCESS with operation ID f70b34f7-39ad-4359-be93-6fc2412708ed"}
So a bit stuck gotta admit, I'd really like to know where I've gone wrong....
services: application-gateway author: @greglin ms.service: azure-application-gateway ms.subservice: appgw-for-containers ms.custom: devx-track-azurecli ms.topic: quickstart ms.author: @greglin