envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.85k stars 4.77k forks source link

Consecutive Gateway Failure outlier detection ejects WebSocket endpoints on graceful connection close #24690

Open Hexta opened 1 year ago

Hexta commented 1 year ago

Title: Consecutive Gateway Failure outlier detection ejects WebSocket endpoints on graceful connection close

Description: Consecutive Gateway Failure outlier detection ejects WebSocket endpoints on graceful connection close.

Repro steps:

Config:

lds.yaml

filter_chains:
  - filters:
    - name: envoy.http_connection_manager
      typed_config:
        "@type": type.googleapis.com/envoy.extensions.filters.network.http_connection_manager.v3.HttpConnectionManager
        stat_prefix: ingress_http
        upgrade_configs:
          upgrade_type: websocket

        codec_type: AUTO

        http_filters:
          - name: envoy.filters.http.router
            typed_config:
              "@type": type.googleapis.com/envoy.extensions.filters.http.router.v3.Router
        route_config:
          name: ws
          virtual_hosts:
            - name: ws
              domains: ["*"]
              routes:
                - match:
                    prefix: /ws
                  route:
                    cluster: ws
                    prefix_rewrite: /

cds.yaml

resources:
  - "@type": type.googleapis.com/envoy.config.cluster.v3.Cluster
    name: ws
    outlier_detection:
      interval: 2s
      consecutive_gateway_failure: 1
      enforcing_consecutive_gateway_failure: 100
      base_ejection_time: 2s
      max_ejection_time: 2s
    load_assignment:
      cluster_name: ws
      endpoints:
        - lb_endpoints:
            - endpoint:
                address:
                  socket_address:
                    address: 127.0.0.1
                    port_value: 9910

Logs: app log

[159][debug][conn_handler] [source/server/active_tcp_listener.cc:147] [C3] new connection from 127.0.0.1:58836
[159][debug][http] [source/common/http/conn_manager_impl.cc:306] [C3] new stream
[159][debug][http] [source/common/http/conn_manager_impl.cc:930] [C3][S7468752995169847881] request headers complete (end_stream=false):
':authority', 'localhost'
':path', '/ws'
':method', 'GET'
'connection', 'Upgrade'
'upgrade', 'websocket'
'sec-websocket-version', '13'
'sec-websocket-key', 'jj7EkpIJfFGA/Ycw7L1ZZw=='

[159][debug][connection] [./source/common/network/connection_impl.h:92] [C3] current connecting state: false
[159][debug][router] [source/common/router/router.cc:470] [C3][S7468752995169847881] cluster 'ws' match for URL '/ws'
[159][debug][router] [source/common/router/router.cc:678] [C3][S7468752995169847881] router decoding headers:
':authority', 'localhost'
':path', '/'
':method', 'GET'
':scheme', 'http'
'connection', 'Upgrade'
'upgrade', 'websocket'
'sec-websocket-version', '13'
'sec-websocket-key', 'jj7EkpIJfFGA/Ycw7L1ZZw=='
'x-forwarded-proto', 'http'
'x-envoy-internal', 'true'
'x-request-id', '83455f65-2ae3-4d79-a28e-14181fd1096b'
'x-envoy-expected-rq-timeout-ms', '15000'
'x-envoy-original-path', '/ws'

[159][debug][pool] [source/common/http/conn_pool_base.cc:78] queueing stream due to no available connections (ready=0 busy=0 connecting=0)
[159][debug][pool] [source/common/conn_pool/conn_pool_base.cc:290] trying to create new connection
[159][debug][pool] [source/common/conn_pool/conn_pool_base.cc:145] creating a new connection (connecting=0)
[159][debug][connection] [./source/common/network/connection_impl.h:92] [C4] current connecting state: true
[159][debug][client] [source/common/http/codec_client.cc:57] [C4] connecting
[159][debug][connection] [source/common/network/connection_impl.cc:924] [C4] connecting to 127.0.0.1:9910
[159][debug][connection] [source/common/network/connection_impl.cc:943] [C4] connection in progress
[159][debug][connection] [source/common/network/connection_impl.cc:683] [C4] connected
[159][debug][client] [source/common/http/codec_client.cc:88] [C4] connected
[159][debug][pool] [source/common/conn_pool/conn_pool_base.cc:327] [C4] attaching to next stream
[159][debug][pool] [source/common/conn_pool/conn_pool_base.cc:181] [C4] creating stream
[159][debug][router] [source/common/router/upstream_request.cc:564] [C3][S7468752995169847881] pool ready
[159][debug][router] [source/common/router/router.cc:1371] [C3][S7468752995169847881] upstream headers complete: end_stream=false
[159][debug][http] [source/common/http/conn_manager_impl.cc:1551] [C3][S7468752995169847881] encoding headers via codec (end_stream=false):
':status', '101'
'upgrade', 'websocket'
'connection', 'Upgrade'
'sec-websocket-accept', 'vvzZD2L5yLMhIP/7YsDaw2W/jbc='
'date', 'Tue, 27 Dec 2022 11:48:10 GMT'

[159][debug][connection] [source/common/network/connection_impl.cc:651] [C4] remote close
[159][debug][connection] [source/common/network/connection_impl.cc:250] [C4] closing socket: 0
[159][debug][client] [source/common/http/codec_client.cc:107] [C4] disconnect. resetting 1 pending requests
[159][debug][client] [source/common/http/codec_client.cc:156] [C4] request reset
[159][debug][router] [source/common/router/router.cc:1212] [C3][S7468752995169847881] upstream reset: reset reason: connection termination, transport failure reason: 
[159][debug][http] [source/common/http/filter_manager.cc:894] [C3][S7468752995169847881] Resetting stream due to upstream_reset_after_response_started{connection_termination}. Prior headers have already been sent
[159][debug][http] [source/common/http/conn_manager_impl.cc:212] [C3][S7468752995169847881] doEndStream() resetting stream
[159][debug][http] [source/common/http/conn_manager_impl.cc:1604] [C3][S7468752995169847881] stream reset
[151][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1179] membership update for TLS cluster ws added 0 removed 0
[151][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:861] host 127.0.0.1:9910 in cluster ws was ejected by the outlier detector
[159][debug][connection] [source/common/network/connection_impl.cc:139] [C3] closing data_to_write=0 type=0
[159][debug][connection] [source/common/network/connection_impl.cc:250] [C3] closing socket: 1
[159][debug][conn_handler] [source/server/active_stream_listener_base.cc:120] [C3] adding to cleanup list
[159][debug][pool] [source/common/conn_pool/conn_pool_base.cc:483] [C4] client disconnected, failure reason: 
[159][debug][pool] [source/common/conn_pool/conn_pool_base.cc:453] invoking idle callbacks - is_draining_for_deletion_=false
[159][debug][pool] [source/common/conn_pool/conn_pool_base.cc:214] [C4] destroying stream: 0 remaining
[159][debug][upstream] [source/common/upstream/cluster_manager_impl.cc:1179] membership update for TLS cluster ws added 0 removed 0

outlier detection event log

{"type":"CONSECUTIVE_GATEWAY_FAILURE","cluster_name":"ws","upstream_url":"127.0.0.1:9910","action":"EJECT","num_ejections":1,"enforced":true,"eject_consecutive_event":{},"timestamp":"2022-12-27T11:48:10.950Z"}
snowp commented 1 year ago

I would think we're not observing the close socket frame and using it to recognize the graceful shutdown and just treating it as an abrupt connection termination, seems like something we should be able to do.

cc @alyssawilk

A short term solution might just be to increase your consecutive errors to more than 1 - that way you'd only trip the outlier detection if the connections repeatedly fails within a short period of time.

cpakulski commented 1 year ago

I looked at the events happening on the wire and a client sends ws opcode 8 (Connection close) to the server. The server sends TCP FIN which is interpreted as RemoteClose. It is local origin event and is mapped to HTTP code 503 which lands in outlier detector as gateway error. I will try to fix it.

cpakulski commented 1 year ago

It appears that Envoy does NOT parse websockets headers. The codebase contains websocket codec, which could be used to catch opcode 8 (FIN/Connection Close) coming from the server and "prepare" for TCP FIN coming from upstream, but that codec seems to be not used and not linked into the binary.

nagavijayan-nagarathinam commented 6 months ago

https://github.com/envoyproxy/envoy/issues/33398

hypnoce commented 2 weeks ago

Hi, we recently encountered this issue where a graceful WS close (code 1001) was detected as failed endpoint after the TCP socket was closed. Seems that envoy considers that the "http response headers" were not received before closing the tCP connection, thus the outlier. Looking at the code I'm not sure how to fix this. Seems a bigger development than my C++ skills can handle. Let me know if someone plan to fix this and if I can help. @cpakulski did you manage to have somth working ? Thanks

cpakulski commented 2 weeks ago

@hypnoce It is still an issue. Fixing it requires adding websockets parser to detect that client requests server to close the connection and not interpret server's FIN as error.