envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.74k stars 4.75k forks source link

Envoy cannot connect to the XDS Server #36150

Open ShivanshiDhawan opened 4 days ago

ShivanshiDhawan commented 4 days ago

XDS server was restarted and envoy got disconnected. But it wasn't able to connect again to the XDS server for around 1.5 hours. Then envoy was restarted again and was able to connect back to the XDS server.

Envoy logs have only warning message getting logged again and again: [2024-09-11 09:43:24.904][1][warning][config] [./source/extensions/config_subscription/grpc/grpc_stream.h:190] DeltaAggregatedResources gRPC config stream to [] closed since 49223s ago: 14, upstream connect error or disconnect/reset before headers. reset reason: remote connection failure, transport failure reason: delayed connect error: 113

I understand the part that Envoy is using backoff_strategy for retries and we get this warning message if error still persists after backoff cycle.

I have few doubts on this:

XDS server config. It is running as a headless service in Kubernetes.

 "static_resources":
    "clusters":
    - "circuit_breakers":
        "thresholds":
        - "max_connections": 100000
          "max_pending_requests": 100000
          "max_requests": 60000000
          "max_retries": 50
          "priority": "HIGH"
        - "max_connections": 100000
          "max_pending_requests": 100000
          "max_requests": 60000000
          "max_retries": 50
          "priority": "DEFAULT"
      "connect_timeout": "1s"
      "dns_lookup_family": "V4_ONLY"
      "typed_extension_protocol_options":
        "envoy.extensions.upstreams.http.v3.HttpProtocolOptions":
          "@type": "type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions"
          "explicit_http_config":
            "http2_protocol_options":
              "connection_keepalive":
                "interval": "30s"
                "timeout": "20s"
      "lb_policy": "RANDOM"
      "load_assignment":
        "cluster_name": ""
        "endpoints":
        - "lb_endpoints":
          - "endpoint":
              "address":
                "socket_address":
                  "address": {{ "xx.namespace.svc.cluster.local." }}
                  "port_value": {{ $xds_server.port }}
      "name": ""
      "type": "LOGICAL_DNS"
      "upstream_connection_options":
        "tcp_keepalive":
          "keepalive_interval": 10
          "keepalive_probes": 3
          "keepalive_time": 30
zuercher commented 3 days ago

Why wasn't Envoy able to reconnect for 1.5 hours?

Difficult to say, but perhaps the first entry in the DNS result of xx.namespace.svc.cluster.local. continued to be a bad host? (See the discussion of LOGICAL_DNS below.)

Will envoy keep on retrying until it connects back using backoff strategy or is there any max_attempts?

It will continue to try to connect, but will eventually give up waiting for configuration and proceed with whatever static configuration may be available. See the initial_fetch_timeout configuration on ConfigSource.

XDS server is configured with LOGICAL DNS as the service discovery. Does envoy selects the host for every retry? Or it is same as the last attempt after some n number of retries?

https://www.envoyproxy.io/docs/envoy/v1.31.1/intro/arch_overview/upstream/service_discovery#logical-dns

I think each retry will be a new connection attempt. Whether it chooses the same host is dependent on the cluster's endpoints and the load balancing policy. Here it goes back to the DNS result and will always choose the first host in the most recent DNS response (this is the definition of LOGICAL_DNS). You might consider whether STRICT_DNS is a better choice here. That will cause Envoy to apply its load balancing policy so if there are multiple hosts each will be attempted eventually.

Circuit breaker is configured with max_retries of 50. But envoy retried more than that number of times. Metrics upstream_cx_connect_fail has value 3034 and upstream_cx_overflow is 0.

That circuit breaker doesn't apply here. Partly this is because the max-retries circuit breaker is the maximum number of concurrent retries on a cluster (e.g. retries as configured in an HttpConnectionManager retry_policy). Partly because I don't think we do the circuit breaker accounting in the XDS client code. In any event, XDS only ever has a single grpc request open to an XDS server at a given time so even if the circuit breaker accounting is taking place, it'll never hit the limits.

ShivanshiDhawan commented 3 days ago

Hey @zuercher , Thanks for the response. So, with LOGICAL_DNS, we would have the default 5000ms dns_refresh_rate. Hence, eventually DNS result of xx.namespace.svc.cluster.local.'s first IP address would have pointed to the correct host. But still for 1.5 hours envoy wasn’t able to connect back to the XDS server. Other point to note, there were 3 envoy pods and only 1 of them faced this issue. Remaining envoy pods connected back to the XDS server.