envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.99k stars 4.81k forks source link

Envoy running short lived WebSocket load starts returning codes 503 #36237

Closed afrancoc2000 closed 1 week ago

afrancoc2000 commented 1 month ago

Envoy running short lived WebSocket load starts returning codes 503

Description: I have a WebSocket application using Envoy as the routing load balancer tool, and I'm doing some load tests to it to see how it behaves under load and what to expect on load scenarios from it. On high load with a persistent connection, I mean streaming messages through the same opened connection for 3 minutes I'm able to reach the 100% CPU in Envoy and measure the limits of requests I can handle but when I change to open the connection send a message and then after receiving an answer closing the connection, I start receiving errors 503 really fast instead of the expected 101 while trying to open new connections.

Repro steps: Using the following k6 script:

import ws from "k6/ws";
import { check } from "k6";

const msg = "test message";

export default function () {
  const url = __ENV.WS_URL || "ws://localhost:8080/chat";
  const params = { headers: { Host: "example.com" } };

  const res = ws.connect(url, params, function (socket) {
    socket.on("open", function open() {
      socket.send(msg);
    });
    socket.on("message", (data) => {
      socket.close();
    });
    socket.on("error", (e) => {
      if (e.error() != "websocket: close sent") {
        console.error("An unexpected error occurred: ", e.error());
      }
    });
  });

  const result = check(res, { "status is 101": (r) => r && r.status === 101 });
  if (!result) {
    console.error(`Check failed: status ${res.status}, body ${res.body}`);
  }
}

Running it like this:

k6 run --vus 10 --duration 180s -e WS_URL=ws://mycompany/chat script.js

Starts returning 503 errors:


          /\      |‾‾| /‾‾/   /‾‾/
     /\  /  \     |  |/  /   /  /
    /  \/    \    |     (   /   ‾‾\
   /          \   |  |\  \ |  (‾)  |
  / __________ \  |__| \__\ \_____/ .io

     execution: local
        script: script-simple.js
        output: -

     scenarios: (100.00%) 1 scenario, 10 max VUs, 3m30s max duration (incl. graceful stop):
              * default: 10 looping VUs for 3m0s (gracefulStop: 30s)

ERRO[0006] Check failed: status 503, body upstream connect error or disconnect/reset before headers. reset reason: connection timeout  source=console
ERRO[0006] Check failed: status 503, body upstream connect error or disconnect/reset before headers. reset reason: connection timeout  source=console
ERRO[0006] Check failed: status 503, body upstream connect error or disconnect/reset before headers. reset reason: connection timeout  source=console
ERRO[0006] Check failed: status 503, body upstream connect error or disconnect/reset before headers. reset reason: connection timeout  source=console
......

Looking at a tcp Dump in Envoy we can see this happening while there are some retransmission errors on closing the websocket connection:

Dump against the client image

Dump against the backend image

This is how a successful websocket call looks like against the back: image

Config:

The config is just using the router filter and enables the websocket upgrade

....
            "http_filters": [
             {
              "name": "envoy.filters.http.router",
              "typed_config": {
               "@type": "type.googleapis.com/envoy.extensions.filters.http.router.v3.Router"
              }
             }
            ],
....
....
            "upgrade_configs": [
             {
              "upgrade_type": "websocket"
             }
            ],
....

Logs:

[2024-09-19 20:58:36.656][3612][debug][conn_handler] [source/common/listener_manager/active_tcp_listener.cc:160] [Tags: "ConnectionId":"11628"] new connection from 168.63.129.16:56980
[2024-09-19 20:58:37.363][3614][debug][pool] [source/common/conn_pool/conn_pool_base.cc:803] [Tags: "ConnectionId":"10029"] connect timeout
[2024-09-19 20:58:37.363][3614][debug][connection] [source/common/network/connection_impl.cc:149] [Tags: "ConnectionId":"10029"] closing data_to_write=0 type=1
[2024-09-19 20:58:37.363][3614][debug][connection] [source/common/network/connection_impl.cc:281] [Tags: "ConnectionId":"10029"] closing socket: 1
[2024-09-19 20:58:37.364][3614][debug][client] [source/common/http/codec_client.cc:107] [Tags: "ConnectionId":"10029"] disconnect. resetting 0 pending requests
[2024-09-19 20:58:37.364][3614][debug][pool] [source/common/conn_pool/conn_pool_base.cc:495] [Tags: "ConnectionId":"10029"] client disconnected, failure reason: 
[2024-09-19 20:58:37.364][3614][debug][router] [source/common/router/router.cc:1354] [Tags: "ConnectionId":"11612","StreamId":"8531618248070408344"] upstream reset: reset reason: connection timeout, transport failure reason: 
[2024-09-19 20:58:37.364][3614][debug][http] [source/common/http/filter_manager.cc:1077] [Tags: "ConnectionId":"11612","StreamId":"8531618248070408344"] Sending local reply with details upstream_reset_before_response_started{connection_timeout}
[2024-09-19 20:58:37.364][3614][debug][http] [source/common/http/conn_manager_impl.cc:1838] [Tags: "ConnectionId":"11612","StreamId":"8531618248070408344"] encoding headers via codec (end_stream=false):
':status', '503'
'content-length', '91'
'content-type', 'text/plain'
'date', 'Thu, 19 Sep 2024 20:58:36 GMT'
'server', 'Microsoft-Azure-Application-LB/AGC'
'connection', 'close'

[2024-09-19 20:58:37.364][3614][debug][http] [source/common/http/conn_manager_impl.cc:1950] [Tags: "ConnectionId":"11612","StreamId":"8531618248070408344"] Codec completed encoding stream.
[2024-09-19 20:58:37.364][3614][debug][http] [source/common/http/conn_manager_impl.cc:252] [Tags: "ConnectionId":"11612","StreamId":"8531618248070408344"] doEndStream() resetting stream
[2024-09-19 20:58:37.364][3614][debug][http] [source/common/http/conn_manager_impl.cc:1907] [Tags: "ConnectionId":"11612","StreamId":"8531618248070408344"] stream reset: reset reason: local reset, response details: -

Is there something that I'm missing in my configuration? What can I do to prevent this from happening? Is this the normal expected behavior?

Thanks

zuercher commented 1 month ago

There is a connection timeout configuration for the upstream cluster that might need to be adjusted. There are also default circuit breaker settings that often need to be adjusted. You can see if you're triggering the circuit breaker by looking at metrics.

Since you're using web sockets, connections cannot be reused between the client and Envoy and between Envoy and the upstream. When opening and closing connections quickly between the same pair of hosts (say, envoy and upstream), it's possible to exhaust the available ephemeral ports used for outbound connections (the combination of protocol/source-ip/source-port/destination-ip/destination-port must be unique and when they're all constant except the OS-chosen source-port, it's easy to run out). You can usually tell this is happening if netstat or similar tools show a large number of connections in close-wait or fin-wait stats when the errors start occurring.

afrancoc2000 commented 1 month ago

I will check my connection timeout and the circuit breaker settings, Thank you so much

github-actions[bot] commented 2 weeks ago

This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or "no stalebot" or other activity occurs. Thank you for your contributions.

github-actions[bot] commented 1 week ago

This issue has been automatically closed because it has not had activity in the last 37 days. If this issue is still valid, please ping a maintainer and ask them to label it as "help wanted" or "no stalebot". Thank you for your contributions.