envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.75k stars 4.75k forks source link

AWS ALB Healthchecks return 404 after adding Envoy proxy #36030

Open rajcheval opened 1 week ago

rajcheval commented 1 week ago

Title: AWS ALB Healthchecks return 404 after adding Envoy proxy

Description: I am using envoy with ECS Fargate. Envoy proxy is being used as a sidecar for SSL termination. ALB forwards the request to envoy proxy and it forwards the request to the application container. Application is working as expected but the health checks always returns 404

Repro steps: As soon as service starts up ALB Health checks start happening. These health checks return 404.

Note: The Envoy_collect tool gathers a tarball with debug logs, config and the following admin due to client restrictions I cannot share this. I am on version 1.32.0-dev

Admin and Stats Output: I can answer specific questions but I cannot share stats directly.

Note: If there are privacy concerns, sanitize the data prior to sharing.

Config: static_resources: listeners: -address: socket_address: 0.0.0.0 port_value: 443 filter_chains: transport_socket: name: envoy.transport_sockets.tlf typed_config: type.googleapis.com/env.extensions.transport_sockets.tlf.v3.DownstreamTlsContext common_tls_context: tls_certificates:

Logs: "user_agent":"ELB-HealthChecker/2.0" "http.response_code":404 "http.hostname":"ip-xxx-xxx-xxx.us-east-1.compute.internal"

Let me know how I can configure ALB health checks to return 200 instead of 404.

ravenblackx commented 1 week ago

Is there no more to the log entry than that? Response flags? Something that indicates whether the 404 came from upstream or was a local response? Request path?

I'm not super familiar with ALB, but this answer (not envoy-related) suggests that a name-based vhost doesn't work for ALB health checks, so it might work to set a default/fallback behavior that will do the health checks? I would think the certs would be a problem for *.us-east-1.compute.internal too, if it's supposed to be using TLS, but if that was the main problem it'd be a 500-series error not 404.

@alyssawilk is probably better equipped to understand the problem.

rajcheval commented 1 week ago

Thanks for your help.

I agree *.us-east-1.compute.internal may not work due to SSL cert.

I tried setting domains: to "*" and I started getting 503.

Here is a sanitized full access log entry. I hope this help in identifying the problem. I can start logging any other fields in the access log that may be helpful

{"timestamp":"2024-09-09T14:44:23.831Z","http.response.body.bytes":0,"service.name":"envoy","client.local.address":"1XX.1YY.194.183:443","http.request.method":"GET","http.request.headers.x_forwarded_for":null,"http.request.headers.x_forwarded_proto":"https","http.request.headers.authority":"1XX.1YY.194.183","envoy.route.name":null,"envoy.upstream.cluster":null,"http.request.duration":0,"http.request.headers.accept":null,"host.hostname":"ip-1XX-1YY-194-183.us-west-1.compute.internal","http.request.headers.id":"00beec21-1d1d-XXXX-a6a3-RRRR9e24412f","http.response_code":404,"user_agent":"ELB-HealthChecker/2.0","client.address":"1XX.YYY.194.19:25452","http.request.body.bytes":0}

alyssawilk commented 1 week ago

do you have detils in your log? https://www.envoyproxy.io/docs/envoy/latest/faq/debugging/why_is_envoy_sending_internal_responses

ravenblackx commented 1 week ago

us-east-1 in your previous config and us-west-1 in the log explains why you were getting 404 rather than 503 before, so that's a step forward.

Logging %RESPONSE_CODE_DETAILS% and %RESPONSE_FLAGS% is likely to provide more information about what the 503 is (which is probably going to be about certs.)

rajcheval commented 1 week ago

us-east-1 and us-west-1 was written when I was sanitizing the config file. Actual region is neither east nor west. I had to manually type in the entire config due to client restrictions so it may have other accidental typos.

I will log %RESPONSE_CODE_DETAILS% and %RESPONSE_FLAGS%

Thank you for your support.

rajcheval commented 1 week ago

I got RESPONSE_CODE_DETAILS : "route_not_found" RESPONSE_FLGS: "NR"

rajcheval commented 1 week ago

Please let me know if there is anything else you need.