envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
25k stars 4.81k forks source link

Envoy fails to connect to external rate limit service: connection failure, transport failure reason: delayed connect error: 111 #22368

Closed shashankram closed 2 years ago

shashankram commented 2 years ago

Title: One line description Envoy fails to connect to external rate limit service: connection failure, transport failure reason: delayed connect error: 111. I have a global network rate limit filter on the ingress listener's filter_chain.

Description:

What issue is being seen? Describe what should be happening instead of the bug, for example: Envoy should not crash, the expected value isn't returned, etc.

I have set up an external Rate Limit Service (RLS) using https://github.com/envoyproxy/ratelimit. I am experiencing an issue where the Envoy instance is unable to communicate with the RLS.

Envoy version: v1.22.2

Admin and Stats Output:

Include the admin output for the following endpoints: /stats, /clusters, /routes, /server_info. For more information, refer to the admin endpoint documentation.

Relevant config dump: RLS cluster:

     "cluster": {
      "@type": "type.googleapis.com/envoy.config.cluster.v3.Cluster",
      "name": "ratelimiter.rls.svc.cluster.local|8081",
      "type": "STRICT_DNS",
      "connect_timeout": "10s",
      "dns_lookup_family": "V4_ONLY",
      "alt_stat_name": "ratelimiter.rls.svc.cluster.local|8081",
      "load_assignment": {
       "cluster_name": "ratelimiter.rls.svc.cluster.local|8081",
       "endpoints": [
        {
         "locality": {
          "zone": "zone"
         },
         "lb_endpoints": [
          {
           "endpoint": {
            "address": {
             "socket_address": {
              "address": "ratelimiter.rls.svc.cluster.local",
              "port_value": 8081
             }
            }
           },
           "load_balancing_weight": 100
          }
         ]
        }
       ]
      },
      "typed_extension_protocol_options": {
       "envoy.extensions.upstreams.http.v3.HttpProtocolOptions": {
        "@type": "type.googleapis.com/envoy.extensions.upstreams.http.v3.HttpProtocolOptions",
        "explicit_http_config": {
         "http2_protocol_options": {}
        }
       }
      },
      "respect_dns_ttl": true
     },

Rate limit filter config on ingress listener:

         "filters": [
          {
           "name": "l4_global_rate_limit",
           "typed_config": {
            "@type": "type.googleapis.com/envoy.extensions.filters.network.ratelimit.v3.RateLi
mit",
            "stat_prefix": "inbound_demo/fortio_8078_tcp",
            "domain": "test",
            "descriptors": [
             {
              "entries": [
               {
                "key": "my_key",
                "value": "my_value"
               }
              ]
             }
            ],
            "timeout": "10s",
            "failure_mode_deny": true,
            "rate_limit_service": {
             "grpc_service": {
              "envoy_grpc": {
               "cluster_name": "ratelimiter.rls.svc.cluster.local|8081"
              },
              "timeout": "10s"
             },
             "transport_api_version": "V3"
            }
           }
          },
          {
           "name": "tcp_proxy",
           "typed_config": {
            "@type": "type.googleapis.com/envoy.extensions.filters.network.tcp_proxy.v3.TcpProxy",
            "stat_prefix": "inbound-mesh-tcp-proxy.demo/fortio|8078|local",
            "cluster": "demo/fortio|8078|local"
           }
          }
         ],
         "transport_socket": {
          "name": "inbound_demo/fortio_8078_tcp",
          "typed_config": {
           "@type": "type.googleapis.com/envoy.extensions.transport_sockets.tls.v3.DownstreamTlsContext",

Stats pertaining to the RLS cluster and global rate limit:

cluster.ratelimiter.rls.svc.cluster.local|8081.default.total_match_count: 620
cluster.ratelimiter.rls.svc.cluster.local|8081.internal.upstream_rq_503: 5
cluster.ratelimiter.rls.svc.cluster.local|8081.internal.upstream_rq_5xx: 5
cluster.ratelimiter.rls.svc.cluster.local|8081.internal.upstream_rq_completed: 5
cluster.ratelimiter.rls.svc.cluster.local|8081.max_host_weight: 100
cluster.ratelimiter.rls.svc.cluster.local|8081.membership_change: 4
cluster.ratelimiter.rls.svc.cluster.local|8081.membership_healthy: 1
cluster.ratelimiter.rls.svc.cluster.local|8081.membership_total: 1
cluster.ratelimiter.rls.svc.cluster.local|8081.update_attempt: 620
cluster.ratelimiter.rls.svc.cluster.local|8081.update_no_rebuild: 616
cluster.ratelimiter.rls.svc.cluster.local|8081.update_success: 620
cluster.ratelimiter.rls.svc.cluster.local|8081.upstream_cx_connect_fail: 111
cluster.ratelimiter.rls.svc.cluster.local|8081.upstream_cx_connect_timeout: 5
cluster.ratelimiter.rls.svc.cluster.local|8081.upstream_cx_destroy: 111
cluster.ratelimiter.rls.svc.cluster.local|8081.upstream_cx_destroy_local: 5
cluster.ratelimiter.rls.svc.cluster.local|8081.upstream_cx_destroy_remote: 106
cluster.ratelimiter.rls.svc.cluster.local|8081.upstream_cx_http1_total: 19
cluster.ratelimiter.rls.svc.cluster.local|8081.upstream_cx_http2_total: 92
cluster.ratelimiter.rls.svc.cluster.local|8081.upstream_cx_total: 111
cluster.ratelimiter.rls.svc.cluster.local|8081.upstream_rq_503: 5
cluster.ratelimiter.rls.svc.cluster.local|8081.upstream_rq_5xx: 5
cluster.ratelimiter.rls.svc.cluster.local|8081.upstream_rq_cancelled: 16
cluster.ratelimiter.rls.svc.cluster.local|8081.upstream_rq_completed: 116
cluster.ratelimiter.rls.svc.cluster.local|8081.upstream_rq_pending_failure_eject: 100
cluster.ratelimiter.rls.svc.cluster.local|8081.upstream_rq_pending_total: 116
cluster.ratelimiter.rls.svc.cluster.local|8081.upstream_rq_timeout: 16
ratelimit.inbound_demo/fortio_8078_tcp.cx_closed: 114
ratelimit.inbound_demo/fortio_8078_tcp.error: 116
ratelimit.inbound_demo/fortio_8078_tcp.failure_mode_allowed: 2
ratelimit.inbound_demo/fortio_8078_tcp.total: 116

Logs:

Include the access logs and the Envoy logs.


[2022-07-22 16:38:21.661][24][debug][filter] [source/extensions/filters/listener/tls_inspector/tls_inspector.cc:88] tls inspector: new connection accepted
[2022-07-22 16:38:21.661][24][trace][filter] [source/extensions/filters/listener/tls_inspector/tls_inspector.cc:175] tls inspector: recv: 1457
[2022-07-22 16:38:21.661][24][trace][filter] [source/extensions/filters/listener/tls_inspector/tls_inspector.cc:145] tls:onALPN(), ALPN: osm
[2022-07-22 16:38:21.661][24][debug][filter] [source/extensions/filters/listener/tls_inspector/tls_inspector.cc:154] tls:onServerName(), requestedServerName: fortio.demo.svc.cluster.local
[2022-07-22 16:38:21.661][24][debug][filter] [source/extensions/filters/listener/original_dst/original_dst.cc:20] original_dst: new connection accepted
[2022-07-22 16:38:21.661][24][trace][filter] [source/extensions/filters/listener/original_dst/original_dst.cc:66] original_dst: set destination to 10.244.2.14:8078
[2022-07-22 16:38:21.662][24][debug][filter] [source/common/tcp_proxy/tcp_proxy.cc:197] [C241] new tcp proxy session
[2022-07-22 16:38:21.662][24][trace][connection] [source/common/network/connection_impl.cc:357] [C241] readDisable: disable=true disable_count=0 state=0 buffer_length=0
[2022-07-22 16:38:21.664][24][debug][router] [source/common/router/router.cc:470] [C0][S11366889662080980910] cluster 'ratelimiter.rls.svc.cluster.local|8081' match for URL '/envoy.service.ratelimit.v3.RateLimitService/ShouldRateLimit'
[2022-07-22 16:38:21.666][24][debug][router] [source/common/router/router.cc:673] [C0][S11366889662080980910] router decoding headers:
':method', 'POST'
':path', '/envoy.service.ratelimit.v3.RateLimitService/ShouldRateLimit'
':authority', 'ratelimiter.rls.svc.cluster.local|8081'
':scheme', 'http'
'te', 'trailers'
'grpc-timeout', '10000m'
'content-type', 'application/grpc'
'x-envoy-internal', 'true'
'x-forwarded-for', '10.244.2.14'
'x-envoy-expected-rq-timeout-ms', '10000'

[2022-07-22 16:38:21.666][24][debug][pool] [source/common/http/conn_pool_base.cc:76] queueing stream due to no available connections [2022-07-22 16:38:21.667][24][debug][pool] [source/common/conn_pool/conn_pool_base.cc:268] trying to create new connection [2022-07-22 16:38:21.667][24][trace][pool] [source/common/conn_pool/conn_pool_base.cc:269] ConnPoolImplBase 0x71ebea0d450, readyclients.size(): 0, busyclients.size(): 0, connectingclients.size(): 0, connecting_streamcapacity: 0, num_activestreams: 0, pendingstreams.size(): 1 per upstream preconnect ratio: 1 [2022-07-22 16:38:21.667][24][debug][pool] [source/common/conn_pool/conn_pool_base.cc:145] creating a new connection [2022-07-22 16:38:21.668][24][trace][http2] [source/common/http/http2/codec_impl.cc:1922] Codec does not have Metadata frame support. [2022-07-22 16:38:21.670][24][debug][http2] [source/common/http/http2/codec_impl.cc:1729] [C242] updating connection-level initial window size to 268435456 [2022-07-22 16:38:21.670][24][debug][connection] [./source/common/network/connection_impl.h:89] [C242] current connecting state: true [2022-07-22 16:38:21.670][24][debug][client] [source/common/http/codec_client.cc:57] [C242] connecting [2022-07-22 16:38:21.670][24][debug][connection] [source/common/network/connection_impl.cc:912] [C242] connecting to 10.96.227.8:8081 [2022-07-22 16:38:21.671][24][debug][connection] [source/common/network/connection_impl.cc:931] [C242] connection in progress [2022-07-22 16:38:21.671][24][trace][pool] [source/common/conn_pool/conn_pool_base.cc:131] not creating a new connection, shouldCreateNewConnection returned false. [2022-07-22 16:38:21.672][24][trace][router] [source/common/router/upstream_request.cc:256] [C0][S11366889662080980910] buffering 33 bytes [2022-07-22 16:38:21.672][24][debug][conn_handler] [source/server/active_tcp_listener.cc:142] [C241] new connection from 10.244.1.9:59536 [2022-07-22 16:38:21.673][24][trace][connection] [source/common/network/connection_impl.cc:563] [C241] socket event: 2 [2022-07-22 16:38:21.673][24][trace][connection] [source/common/network/connection_impl.cc:674] [C241] write ready [2022-07-22 16:38:21.679][24][trace][connection] [source/extensions/transport_sockets/tls/ssl_handshaker.cc:52] [C241] ssl error occurred while read: WANT_READ [2022-07-22 16:38:21.679][24][trace][connection] [source/common/network/connection_impl.cc:563] [C242] socket event: 3 [2022-07-22 16:38:21.679][24][trace][connection] [source/common/network/connection_impl.cc:674] [C242] write ready [2022-07-22 16:38:21.679][24][debug][connection] [source/common/network/connection_impl.cc:695] [C242] delayed connect error: 111 [2022-07-22 16:38:21.679][24][debug][connection] [source/common/network/connection_impl.cc:250] [C242] closing socket: 0 [2022-07-22 16:38:21.679][24][trace][connection] [source/common/network/connection_impl.cc:418] [C242] raising connection event 0 [2022-07-22 16:38:21.679][24][debug][client] [source/common/http/codec_client.cc:107] [C242] disconnect. resetting 0 pending requests [2022-07-22 16:38:21.680][24][debug][pool] [source/common/conn_pool/conn_pool_base.cc:439] [C242] client disconnected, failure reason: delayed connect error: 111 [2022-07-22 16:38:21.680][24][debug][router] [source/common/router/router.cc:1170] [C0][S11366889662080980910] upstream reset: reset reason: connection failure, transport failure reason: delayed connect error: 111 [2022-07-22 16:38:21.680][24][trace][main] [source/common/event/dispatcher_impl.cc:249] item added to deferred deletion list (size=1) [2022-07-22 16:38:21.680][24][debug][http] [source/common/http/async_client_impl.cc:101] async http request response headers (end_stream=true): ':status', '200' 'content-type', 'application/grpc' 'grpc-status', '14' 'grpc-message', 'upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111'

[2022-07-22 16:38:21.680][24][debug][filter] [source/extensions/filters/common/ratelimit/ratelimit_impl.cc:117] rate limit fail, status=14 msg=upstream connect error or disconnect/reset before headers. reset reason: connection failure, transport failure reason: delayed connect error: 111 [2022-07-22 16:38:21.680][24][debug][connection] [source/common/network/connection_impl.cc:139] [C241] closing data_to_write=0 type=1 [2022-07-22 16:38:21.680][24][debug][connection] [source/common/network/connection_impl.cc:250] [C241] closing socket: 1

phlax commented 2 years ago

cc @wbpcode

shashankram commented 2 years ago

Figured out the issue. Envoy is operating correctly. There was a misconfiguration in my environment where the cluster hostname resolved to an IP address that did not have endpoints in k8s. After fixing this misconfiguration, rate limiting works as expected.

Thanks!