Open style95 opened 1 month ago
ok, I noticed this only happens when the active health check is configured and I've just found this page.
Host absent / health check OK: Envoy will route to the target host. This is very important since the design assumes that the discovery service can fail at any time. If a host continues to pass health check even after becoming absent from the discovery data, Envoy will still route. Although it would be impossible to add new hosts in this scenario, existing hosts will continue to operate normally. When the discovery service is operating normally again the data will eventually re-converge.
So a DNS server initially responds with 4 IPs and if I removed one of the IPs from the DNS records, the envoy will keep sending requests to the removed IP as long as health checks succeed?
I confirmed that the removed host is not removed from the endpoint list with active health checks.
But I am still curious: when I update the max_connection_duration
setting, the endpoint is suddenly removed.
Could anyone share what's going on under the hood?
Addition:
It seems not only max_connection_duration
, but also changing connection settings like below triggers the deletion of endpoints.
But still couldn't understand the correlation between them.
trafficPolicy:
connectionPool:
http:
idleTimeout: 300s
tcp:
connectTimeout: 11s
tcpKeepalive:
interval: 74s
time: 600s
@alyssawilk Could you take a look at this? Since you made a DNS commit here, I mentioned you. Please bear with me if you are not in charge of this module.
cc @alyssawilk
DNS TTL isn't currently connected to upstream lifetime in any way. when DNS TTL expires, the DNS cache will re-resolve DNS. If the endpoint resolution changes, Envoy won't drain the endpoints. Generally endpoints that want to do a graceful drain would both update their DNS addresses (no new connections) and connection: close/goaway open connections (drain old connections). If they're not doing the latter you'd need to set max_connection_duration or add a feature to optionally drain on address resolution change
optional drain would be then complicated the fact that many endpoints (e.g.) google will advertise different DNS addresses each resolution even when not draining to end up with more granular connections across frontend fleets. So often an address "changing" doesn't invalidate the old endpoint, and draining would simply result in lots of cold start connections.
@alyssawilk
Thank you for the answer.
One strange thing is that if I don't configure active health checks, DNS resolution works as expected even without max_connection_duration
.
This is my cluster configuration. There is only http_idle_timeout
and I kept sending requests every 0.1s so I suppose the connection kept alive. So when active health checks are not set up, endpoints are drained based on the DNS TTL?
- circuitBreakers:
thresholds:
- maxConnections: 4294967295
maxPendingRequests: 4294967295
maxRequests: 4294967295
maxRetries: 4294967295
trackRemaining: true
commonLbConfig:
healthyPanicThreshold: {}
localityWeightedLbConfig: {}
connectTimeout: 11s
dnsLookupFamily: V4_ONLY
dnsRefreshRate: 60s
filters:
- name: istio.metadata_exchange
typedConfig:
'@type': type.googleapis.com/envoy.tcp.metadataexchange.config.MetadataExchange
protocol: istio-peer-exchange
lbPolicy: LEAST_REQUEST
loadAssignment:
clusterName: outbound|443||backend.my-backend
endpoints:
- lbEndpoints:
- endpoint:
address:
socketAddress:
address: backend.my-backend
portValue: 443
loadBalancingWeight: 1
metadata:
filterMetadata:
istio:
workload: ;;;;
loadBalancingWeight: 1
locality: {}
- lbEndpoints:
- endpoint:
address:
socketAddress:
address: my-backend.com
portValue: 80
healthCheckConfig: {}
locality: {}
priority: 4
policy:
overprovisioningFactor: 200
metadata:
filterMetadata:
istio:
config: /apis/networking.istio.io/v1alpha3/namespaces/my-backend/destination-rule/my-backend-options
default_original_port: 443
services:
- host: backend.my-backend
name: backend.my-backend
namespace: my-backend
name: outbound|443||backend.my-backend
outlierDetection:
baseEjectionTime: 1s
consecutive5xx: 4294967295
enforcingConsecutive5xx: 100
enforcingSuccessRate: 0
interval: 1s
respectDnsTtl: true
type: STRICT_DNS
typedExtensionProtocolOptions:
envoy.extensions.upstreams.http.v3.HttpProtocolOptions:
'@type': type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions
commonHttpProtocolOptions:
idleTimeout: 300s
explicitHttpConfig:
httpProtocolOptions: {}
upstreamConnectionOptions:
tcpKeepalive:
keepaliveInterval: 74
keepaliveTime: 600
Title: Envoy does not respect DNS TTL when no max_connection_duration is configured
Description: My envoy acts as a proxy to a certain DNS endpoint.
envoy ---(DNS)---> servers(dynamic IP change)
When the
max_connection_duration
is configured, it is working as expected but if no duration is configured, the envoy does not respect the DNS TTL. So even if I removed some IPs from the DNS records, removed servers keep receiving requests.I am not sure if this is expected behavior. I expected DNS ttl is always respected with this change.
I am using this version:
envoy version: e546bf5fc89b063bb911dc717c9beb26efa27a9f/1.25.4-dev/Clean/RELEASE/BoringSSL
Please let me know if any further information is needed.
Config: