envoyproxy / nighthawk

L7 (HTTP/HTTPS/HTTP2/HTTP3) performance characterization tool
Apache License 2.0
360 stars 81 forks source link

Issues around max QPS #576

Open howardjohn opened 3 years ago

howardjohn commented 3 years ago

I would like to send requests at the maximum RPS possible. Many other tools (hey: default, fortio: -qps 0) support this mode.

In nighthawk, --rps must be between "value must be inside range [" '\x01' ", " '\U000f4240' "] (yay for horrible error messages from envoy, this is 1-1million though).

If I request 1M rps, I get an actual of 10k RPS. If I request 50k RPS, I am able to get 50k RPS. This indicates nighthawk is getting throttled unnecessarily, while the server can handle more load.

1M request logs

``` $ nighthawk --rps 1000000 --concurrency 1 --duration 10 --connections 64 --prefetch-connections --max-active-requests 1000 http://127.0.0.1:8080 [12:25:26.352783][854812][I] Starting 1 threads / event loops. Time limit: 10 seconds. [12:25:26.352823][854812][I] Global targets: 64 connections and 1000000 calls per second. [12:25:26.353462][854812][W] Unable to use runtime singleton for feature envoy.http.headermap.lazy_map_min_size [12:25:36.902843][854818][I] Stopping after 10000 ms. Initiated: 103381 / Completed: 103380. (Completion rate was 10337.99689860093 per second.) Nighthawk - A layer 7 protocol benchmarking tool. benchmark_http_client.latency_2xx (103317 samples) min: 0s 000ms 034us | mean: 0s 000ms 087us | max: 0s 005ms 736us | pstdev: 0s 000ms 041us Percentile Count Value 0.5 51659 0s 000ms 084us 0.75 77488 0s 000ms 093us 0.8 82654 0s 000ms 096us 0.9 92988 0s 000ms 104us 0.95 98152 0s 000ms 115us 0.990625 102349 0s 000ms 163us 0.99902344 103217 0s 000ms 594us Queueing and connection setup latency (103318 samples) min: 0s 000ms 001us | mean: 0s 000ms 003us | max: 0s 025ms 735us | pstdev: 0s 000ms 080us Percentile Count Value 0.5 51675 0s 000ms 003us 0.75 77510 0s 000ms 004us 0.8 82656 0s 000ms 004us 0.9 92997 0s 000ms 005us 0.95 98156 0s 000ms 005us 0.990625 102350 0s 000ms 011us 0.99902344 103218 0s 000ms 016us Request start to response end (103317 samples) min: 0s 000ms 034us | mean: 0s 000ms 087us | max: 0s 005ms 735us | pstdev: 0s 000ms 041us Percentile Count Value 0.5 51669 0s 000ms 083us 0.75 77491 0s 000ms 093us 0.8 82659 0s 000ms 096us 0.9 92987 0s 000ms 104us 0.95 98153 0s 000ms 115us 0.990625 102349 0s 000ms 162us 0.99902344 103217 0s 000ms 593us Response body size in bytes (103317 samples) min: 0 | mean: 0.0 | max: 0 | pstdev: 0.0 Response header size in bytes (103317 samples) min: 58 | mean: 58.0 | max: 58 | pstdev: 0.0 Blocking. Results are skewed when significant numbers are reported here. (103318 samples) min: 0s 000ms 018us | mean: 0s 000ms 096us | max: 0s 005ms 778us | pstdev: 0s 000ms 042us Percentile Count Value 0.5 51659 0s 000ms 092us 0.75 77490 0s 000ms 104us 0.8 82656 0s 000ms 107us 0.9 92989 0s 000ms 116us 0.95 98154 0s 000ms 129us 0.990625 102350 0s 000ms 180us 0.99902344 103218 0s 000ms 607us Initiation to completion (103380 samples) min: 0s 000ms 037us | mean: 0s 000ms 108us | max: 0s 026ms 479us | pstdev: 0s 000ms 635us Percentile Count Value 0.5 51694 0s 000ms 089us 0.75 77539 0s 000ms 099us 0.8 82712 0s 000ms 103us 0.9 93042 0s 000ms 112us 0.95 98211 0s 000ms 124us 0.990625 102411 0s 000ms 176us 0.99902344 103280 0s 000ms 773us Counter Value Per second benchmark.http_2xx 103317 10331.70 benchmark.pool_overflow 63 6.30 cluster_manager.cluster_added 1 0.10 default.total_match_count 1 0.10 membership_change 1 0.10 runtime.load_success 1 0.10 runtime.override_dir_not_exists 1 0.10 upstream_cx_http1_total 64 6.40 upstream_cx_rx_bytes_total 7748775 774877.27 upstream_cx_total 64 6.40 upstream_cx_tx_bytes_total 6095762 609576.02 upstream_rq_pending_overflow 63 6.30 upstream_rq_pending_total 1 0.10 upstream_rq_total 103318 10331.80 ```

50k request logs

``` [12:26:36.753033][855204][I] Starting 1 threads / event loops. Time limit: 10 seconds. [12:26:36.753072][855204][I] Global targets: 64 connections and 50000 calls per second. [12:26:36.753695][855204][W] Unable to use runtime singleton for feature envoy.http.headermap.lazy_map_min_size [12:26:47.303113][855210][I] Stopping after 10000 ms. Initiated: 499999 / Completed: 499996. (Completion rate was 49999.475001312494 per second.) Nighthawk - A layer 7 protocol benchmarking tool. benchmark_http_client.latency_2xx (499965 samples) min: 0s 000ms 026us | mean: 0s 000ms 169us | max: 0s 007ms 123us | pstdev: 0s 000ms 180us Percentile Count Value 0.5 249986 0s 000ms 086us 0.75 374976 0s 000ms 230us 0.8 399980 0s 000ms 282us 0.9 449975 0s 000ms 410us 0.95 474967 0s 000ms 495us 0.990625 495278 0s 000ms 759us 0.99902344 499477 0s 001ms 405us Queueing and connection setup latency (499968 samples) min: 0s 000ms 000us | mean: 0s 000ms 001us | max: 0s 016ms 741us | pstdev: 0s 000ms 023us Percentile Count Value 0.5 250156 0s 000ms 001us 0.75 375423 0s 000ms 002us 0.8 400613 0s 000ms 002us 0.9 450102 0s 000ms 002us 0.95 474994 0s 000ms 002us 0.990625 495281 0s 000ms 003us 0.99902344 499480 0s 000ms 008us Request start to response end (499965 samples) min: 0s 000ms 026us | mean: 0s 000ms 169us | max: 0s 007ms 122us | pstdev: 0s 000ms 180us Percentile Count Value 0.5 249992 0s 000ms 086us 0.75 374976 0s 000ms 230us 0.8 399975 0s 000ms 281us 0.9 449973 0s 000ms 410us 0.95 474968 0s 000ms 495us 0.990625 495278 0s 000ms 759us 0.99902344 499477 0s 001ms 405us Response body size in bytes (499965 samples) min: 0 | mean: 0.0 | max: 0 | pstdev: 0.0 Response header size in bytes (499965 samples) min: 58 | mean: 5unneccesarily8.0 | max: 58 | pstdev: 0.0 Blocking. Results are skewed when significant numbers are reported here. (4370 samples) min: 0s 000ms 014us | mean: 0s 000ms 264us | max: 0s 005ms 615us | pstdev: 0s 000ms 152us Percentile Count Value 0.5 2185 0s 000ms 253us 0.75 3278 0s 000ms 350us 0.8 3496 0s 000ms 357us 0.9 3933 0s 000ms 373us 0.95 4152 0s 000ms 386us 0.990625 4330 0s 000ms 472us 0.99902344 4366 0s 001ms 388us Initiation to completion (499996 samples) min: 0s 000ms 029us | mean: 0s 000ms 178us | max: 0s 017ms 698us | pstdev: 0s 000ms 187us Percentile Count Value 0.5 250004 0s 000ms 091us 0.75 374999 0s 000ms 243us 0.8 400002 0s 000ms 298us 0.9 450000 0s 000ms 435us 0.95 474998 0s 000ms 518us 0.990625 495309 0s 000ms 771us 0.99902344 499508 0s 001ms 424us Counter Value Per second benchmark.http_2xx 499965 49996.37 benchmark.pool_overflow 31 3.10 cluster_manager.cluster_added 1 0.10 default.total_match_count 1 0.10 membership_change 1 0.10 runtime.load_success 1 0.10 runtime.override_dir_not_exists 1 0.10 upstream_cx_http1_total 64 6.40 upstream_cx_rx_bytes_total 37497375 3749727.91 upstream_cx_total 64 6.40 upstream_cx_tx_bytes_total 29498112 2949803.65 upstream_rq_pending_overflow 31 3.10 upstream_rq_pending_total 2 0.20 upstream_rq_total 499968 49996.67 [12:26:47.424905][855204][I] Done. ```

Both of these are from nighthawk --rps ... --concurrency 1 --duration 10 --connections 64 --prefetch-connections, against fortio server on the same machine.

Based on this, the main thing I am seeing is pool_overflow=63 for the 1M requests, but pool_overflow=31 for 50k request.

I was a bit surprised the 50k QPS showed pool_overflow=31, despite everything seeming fine.

It would be great if some combination of these were addressed:

Results of varying QPS settings:

DEST                   TOOL       QPS      CONS  DUR  SUCCESS  THROUGHPUT   P50     P90     P99
http://127.0.0.1:8080  nighthawk  10000    64    10   99968    9996.89qps   0.08ms  0.12ms  0.34ms
http://127.0.0.1:8080  nighthawk  50000    64    10   499966   49996.70qps  0.09ms  0.42ms  0.77ms
http://127.0.0.1:8080  nighthawk  75000    64    10   687395   68740.33qps  0.30ms  0.41ms  1.03ms
http://127.0.0.1:8080  nighthawk  100000   64    10   690584   69058.48qps  0.30ms  0.42ms  1.03ms
http://127.0.0.1:8080  nighthawk  200000   64    10   688416   68843.29qps  0.30ms  0.42ms  1.03ms
http://127.0.0.1:8080  nighthawk  500000   64    10   694161   69417.19qps  0.30ms  0.40ms  1.02ms
http://127.0.0.1:8080  nighthawk  750000   64    10   103356   10335.69qps  0.08ms  0.10ms  0.16ms
http://127.0.0.1:8080  nighthawk  1000000  64    10   103350   10335.09qps  0.08ms  0.10ms  0.16ms
oschaaf commented 3 years ago

With --open-loop I would expect more overflows, but in closed loop mode we report time blocked on resource shortages (pool queue size / connections) instead in a histogram. I would expect that the histogram indicating the amount/time NH is blocked to rise when going from 50k to 1M rps. That seems to be happening in the examples above.

I think throughput degrades when setting qps super high because NH will spend most of its time processing failed attempts, to a point where it significantly interferes with i/o processing on the event loop.

For a dedicated max qps feature: I'm thinking that maybe we want to specialise there, and have a separate flag. Instead of letting NH set the pace, we should let inbound responses directly trigger NH to issue a new request. I suspect this would emulate some of the other tooling. This would also allow us to make this compatible with configurations that influence pacing, like --jitter. This should approach the actual max qps a server can handle, though maybe in some cases, a slightly higher qps could be obtained, as in this flow we're not always putting the server to work on a new reply while it is sending us one.

howardjohn commented 3 years ago

@oschaaf is there any way to get it to not get blocked? It seems odd the max throughput decays when we get too high. For example, we have 10331.70 requests and 64, so we need 161 RPS/connection. This gives us 6ms to send each request, which is far far longer than the reported latencies?

BTW, this is just something I personally stumbled upon, I don't think Istio does any thoroughput testing

oschaaf commented 3 years ago

Circling back to this, one useful trick may be to increase queue depth on the client, by tweaking --max-pending-requests:

That allows some wiggle room with respect to any back-pressure as the system will have an opportunity to queue up requests when there's no free connections/streams to put them in to. What will happen is that on the first opportunity window, which arises when a reply comes in and triggers an i/o event, a queued request will be taken from the queue and assigned to a stream/connection. So effectively this gives some more wiggle room for the request-release pacing to ramp up to try and to maintain the configured pace.