envoyproxy / envoy

Cloud-native high-performance edge/middle/service proxy
https://www.envoyproxy.io
Apache License 2.0
24.74k stars 4.75k forks source link

Support a configurable queue for pending requests to use adaptive LIFO and CoDel #29070

Open balrawi-figma opened 1 year ago

balrawi-figma commented 1 year ago

Title: Support a configurable queue for pending requests to use adaptive LIFO and CoDel

Description: There is no way today that I'm aware of to use different queuing strategies for pending requests for Envoy. The issue here is to have a way to configure the pending requests queue in Envoy.

Why is this important? In an overloaded situation, using a simple FIFO queue impacts the p50 and p75 significantly (Some data from testing comparing different strategies are shared below).

Using controlled-delay (aka CoDel) has been a standard practice in the linux kernel. It's also been a well proven strategy to combine codel with an adaptive LIFO queue to mitigate request delays during an overloaded situation. Below are some numbers that I did using a draft PR that shed some light regarding the improvements in p50 and p75 of request latencies across different configurations.

Relevant Links: Related issue: https://github.com/envoyproxy/envoy/issues/9606 A draft PR for a possible implementation: https://github.com/envoyproxy/envoy/pull/28982

Testing

Setup Test an upstream with a controlled, random 'sleep' by sleeping between 0-3 seconds to simulate an overloaded situation saturating upstream capacity.

Test 1: Existing FIFO queue in Envoy:

➜  hey -n 500 -c 120 -t 0 https://test/circuit_breaking

Summary:
  Total:        94.1187 secs
  Slowest:        51.9708 secs
  Fastest:        0.0456 secs
  Average:        14.8038 secs
  Requests/sec:        5.0999

  Total data:        960 bytes
  Size/request:        2 bytes

Response time histogram:
  0.046 [1]        |
  5.238 [256]        |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  10.431 [27]        |■■■■
  15.623 [15]        |■■
  20.816 [28]        |■■■■
  26.008 [14]        |■■
  31.201 [28]        |■■■■
  36.393 [33]        |■■■■■
  41.586 [33]        |■■■■■
  46.778 [22]        |■■■
  51.971 [23]        |■■■■

Latency distribution:
  10% in 0.9155 secs
  25% in 2.0353 secs
 50% in 4.7792 secs
  75% in 29.8650 secs
  90% in 41.4760 secs
  95% in 46.5524 secs
  99% in 51.0998 secs

Test 2: Adaptive LIFO queue in Envoy (per PR):

➜  hey -n 500 -c 120 -t 0 https://test/circuit_breaking

Summary:
  Total:        115.0796 secs
  Slowest:        109.0175 secs
  Fastest:        0.0432 secs
  Average:        13.0082 secs
  Requests/sec:        4.1710

  Total data:        960 bytes
  Size/request:        2 bytes

Response time histogram:
  0.043 [1]        |
  10.941 [373]        |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  21.838 [22]        |■■
  32.735 [11]        |■
  43.633 [12]        |■
  54.530 [14]        |■■
  65.428 [17]        |■■
  76.325 [7]        |■
  87.223 [11]        |■
  98.120 [5]        |■
  109.017 [7]        |■

Latency distribution:
  10% in 0.0928 secs
  25% in 1.0895 secs
  50% in 2.1509 secs
  75% in 7.4232 secs
  90% in 54.1354 secs
  95% in 74.5611 secs
  99% in 103.9622 secs

Test 3: Adaptive LIFO queue + CoDel in Envoy (per PR):

CoDel target delay is 5ms target delay and 100ms interval

➜  hey -n 500 -c 120 -t 0 https://test/circuit_breaking

Summary:
  Total:        125.8912 secs
  Slowest:        120.8633 secs
  Fastest:        0.0805 secs
  Average:        4.5031 secs
  Requests/sec:        3.1773

  Total data:        1553924 bytes
  Size/request:        3884 bytes

Response time histogram:
  0.081 [1]        |
  12.159 [394]      |■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■■
  24.237 [0]        |
  36.315 [0]        |
  48.394 [0]        |
  60.472 [0]        |
  72.550 [0]        |
  84.628 [0]        |
  96.707 [0]        |
  108.785 [0]        |
  120.863 [5]        |■

Latency distribution:
  10% in 0.9964 secs
  25% in 1.7487 secs
  50% in 3.1075 secs
  75% in 3.8031 secs
  90% in 5.8382 secs
  95% in 5.8405 secs
  99% in 120.0763 secs

Status code distribution:
  [200]        52 responses
  [503]        343 responses
yanavlasov commented 1 year ago

@htuch @mattklein123 @zuercher @alyssawilk @wbpcode for any comments

alyssawilk commented 1 year ago

I think configurable queuing would be a completely reasonable feature. I'd say the first step would be to factor out the existing queue an extension with clean APIs, defaulting to the current mechanism, then add an extension point and add your algorithms as an extension, potentially contrib if you can't find a sponsor per the extension guidelines.

nezdolik commented 4 weeks ago

@balrawi-figma I can help out with the effort.