Open alanprot opened 2 weeks ago
It seems that this may have the same root cause of https://github.com/emissary-ingress/emissary/pull/4447
Seems when we add the healthcheck, we create a new cluster with the same name and can trigger this bug...
Ok...
Indeed seems the same root cause of https://github.com/emissary-ingress/emissary/pull/4447
Seems that anything that changes the cluster object, can trigger this bug:
I kept changing the connect_timeout_ms
or cluster_idle_timeout_ms
on the mapping and could reproduce this problem as well -> No healthcheck configured at all:
Cluster Object:
"cluster": {
"@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
"name": "cluster_https___servoce-1_namespace-a_443_o-32E2014365DD7432-0",
"type": "EDS",
"eds_cluster_config": {
"eds_config": {
"ads": {},
"resource_api_version": "V3"
},
"service_name": "k8s/servoce-1/namespace-a/443"
},
"connect_timeout": "2s",
"dns_lookup_family": "V4_ONLY",
"transport_socket": {
"name": "envoy.transport_sockets.tls",
"typed_config": {
"@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.UpstreamTlsContext",
"common_tls_context": {}
}
},
"alt_stat_name": "distributor_cortex_443",
"common_http_protocol_options": {
"idle_timeout": "90s"
}
},
"last_updated": "2024-10-08T00:52:01.553Z"
},
Describe the bug Hi,
I'm trying to enable active health checks on a specific Mapping, which uses a KubernetesEndpointResolver. Upon configuring the healthcheck we can see that in some ambassador pods, all upstream hosts for the cluster associated to the mapping seem to disappear, and it stays missing until something else changes in the cluster (a pod scale up for instance). This causes these Ambassador pods to return 503 errors, as no upstream targets are found.
Here's the configuration I'm trying to add to the Mapping:
This issue only affects some Emissary pods. When comparing pods that experience the issue with those that work correctly, the Envoy config dump (with EDS info) reveals that the list of hosts is empty for the faulty pods:
Bad:
Good:
I haven't been able to pinpoint why some pods have an empty ClusterLoadAssignment, but it seems like a race condition, possibly in the service that populates the assignment via EDS. The issue occurs randomly in different pods if i keep removing and adding back the healthcheck config.
To Reproduce Steps to reproduce the behavior:
KubernetesEndpointResolver
and service pointing to an k8s service. Ex:service-a.default
cluster are not registered on the ClusterLoadAssignment (http://localhost:8001/config_dump?resource=&mask=&name_regex=&include_eds=on)Expected behavior Configuring the healthcheck should not wipe the instances of the cluster associated to the mapping resource.
Versions (please complete the following information):