Opportunities to improve TE Platform-Plaintext benchmark

EgorBo commented 2 years ago

Platform-Plaintext TechEmpower Benchmark where dotnet shows impressive results (2nd place, up to ~12.4mln requests per second (RPS) on our PerfLab hardware with bigger network bandwidth) seems to be slightly bottlenecked by three SpanHelpers.IndexOf[Any] functions:

^ Linux-x64

e.g. IndexOfAny(val0, val1) mostly tries to find \n or \r in ASCII strings of length = 26, 49 and 151 where needed symbols usually found at positions 21, 22 and 100 (http headers)

Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7
Host: 10.0.0.102:5000
Connection: keep-alive

Host: 10.0.0.102:5000
Connection: keep-alive

Connection: keep-alive

I tried to rewrite them by hands e.g. I changed some branches, removed "Duff's devices", added "two 256bit vectors per iteration" path. Here is a standalone benchmark project with test data extracted from the Platform-Plaintext benchmark.

As the result, I constantly see stable improvements around 1-3% (never slower):

| load                   | base        | my          |         |
| ---------------------- | ----------- | ----------- | ------- |
| CPU Usage (%)          |          96 |          97 |  +1.04% |
| Cores usage (%)        |       2,702 |       2,726 |  +0.89% |
| Working Set (MB)       |          37 |          37 |   0.00% |
| Private Memory (MB)    |         370 |         370 |   0.00% |
| Start Time (ms)        |           0 |           0 |         |
| First Request (ms)     |          77 |          75 |  -2.60% |
| Requests/sec           |  12,403,854 |  12,679,971 |  +2.23% |
| Requests               | 187,261,280 | 191,437,425 |  +2.23% |
| Mean latency (ms)      |        1.18 |        1.31 | +11.02% |
| Max latency (ms)       |       56.19 |       53.21 |  -5.30% |
| Bad responses          |           0 |           0 |         |
| Socket errors          |           0 |           0 |         |
| Read throughput (MB/s) |    1,495.04 |    1,525.76 |  +2.05% |
| Latency 50th (ms)      |        0.66 |        0.63 |  -3.95% |
| Latency 75th (ms)      |        0.99 |        0.95 |  -4.04% |
| Latency 90th (ms)      |        1.83 |        2.00 |  +9.29% |
| Latency 99th (ms)      |       15.86 |       15.82 |  -0.25% |

^ with PGO. As you can see from https://aka.ms/aspnet/benchmarks (17th page, "full" checked) the best results we've ever seen were 12.42M RPS so 12.68M does look like an improvement:

Same relative improvements can be observed in non-PGO mode (default).

Here is the script I used to benchmark it:

for (;;) { crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario plaintext --profile aspnet-citrine-lin --application.framework net7.0 --application.options.outputFiles C:\prj\WSL\x64\base\*.* --json t1.json ;; crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario plaintext --profile aspnet-citrine-lin --application.framework net7.0 --application.options.outputFiles C:\prj\WSL\x64\diff\*.* --json t2.json ;; crank compare t1.json t2.json }

Standalone benchmark: https://gist.github.com/EgorBo/1d059726dae285e3a1db501896e8a1bd Commit in dotnet/runtime: https://github.com/EgorBo/runtime-1/commit/b2ee6ad589e0c8d6f495876d8a1f965770243896

/cc @stephentoub @GrabYourPitchforks

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-buffers, @GrabYourPitchforks See info in area-owners.md if you want to be subscribed.

Issue Details

Platform-Plaintext TechEmpower Benchmark where dotnet shows impressive results ([2nd place](https://www.techempower.com/benchmarks/#section=data-r20&hw=ph&test=plaintext), ~12.4mln RPS on our PerfLab hardware) seems to be slightly bottlenecked by three SpanHelpers.IndexOf[Any] functions: ![image](https://user-images.githubusercontent.com/523221/147016152-d1da2f2b-cedc-4985-8831-a70ca5d07fe7.png) _^ Linux-x64_ e.g. `IndexOfAny` with two values mostly tries to find `\n` or `\r` in ASCII strings of length = 26, 49 and 151 where needed symbols usually found at positions 21, 22 and 100 (http headers) I [tried](https://github.com/EgorBo/runtime-1/commit/b2ee6ad589e0c8d6f495876d8a1f965770243896) to rewrite them by hands e.g. I changed some branches, removed "Duff's devices" (even for cases where needed char is found at position = 0 my impl is still faster). Here is a [standalone benchmark](https://gist.github.com/EgorBo/1d059726dae285e3a1db501896e8a1bd) project with test data extracted from the Platform-Plaintext benchmark. As the result, I constantly see stable improvements around 1-3% (never slower): ``` | load | t1 | t2 | | | ---------------------- | ----------- | ----------- | ------- | | CPU Usage (%) | 96 | 97 | +1.04% | | Cores usage (%) | 2,702 | 2,726 | +0.89% | | Working Set (MB) | 37 | 37 | 0.00% | | Private Memory (MB) | 370 | 370 | 0.00% | | Start Time (ms) | 0 | 0 | | | First Request (ms) | 77 | 75 | -2.60% | | Requests/sec | 12,403,854 | 12,679,971 | +2.23% | | Requests | 187,261,280 | 191,437,425 | +2.23% | | Mean latency (ms) | 1.18 | 1.31 | +11.02% | | Max latency (ms) | 56.19 | 53.21 | -5.30% | | Bad responses | 0 | 0 | | | Socket errors | 0 | 0 | | | Read throughput (MB/s) | 1,495.04 | 1,525.76 | +2.05% | | Latency 50th (ms) | 0.66 | 0.63 | -3.95% | | Latency 75th (ms) | 0.99 | 0.95 | -4.04% | | Latency 90th (ms) | 1.83 | 2.00 | +9.29% | | Latency 99th (ms) | 15.86 | 15.82 | -0.25% | ``` _^ with PGO. As you can see from https://aka.ms/aspnet/benchmarks (17th page, "full" checked) the best results we've seen were around 12.42M RPS Same relative improvements can be observed in non-PGO mode (default)._ Here is the script I used to benchmark it: ``` for (;;) { crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario plaintext --profile aspnet-citrine-lin --application.framework net7.0 --application.options.outputFiles C:\prj\WSL\x64\base\*.* --json t1.json ;; crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario plaintext --profile aspnet-citrine-lin --application.framework net7.0 --application.options.outputFiles C:\prj\WSL\x64\diff\*.* --json t2.json ;; crank compare t1.json t2.json } ``` Standalone benchmark: https://gist.github.com/EgorBo/1d059726dae285e3a1db501896e8a1bd Commit in dotnet/runtime: https://github.com/EgorBo/runtime-1/commit/b2ee6ad589e0c8d6f495876d8a1f965770243896 /cc @stephentoub @GrabYourPitchforks

Author:	EgorBo
Assignees:	-
Labels:	`area-System.Buffers`, `tenet-performance`, `untriaged`
Milestone:	-

danmoseley commented 2 years ago

Is it possible that this could regress workloads that don't look like this one? Did you get a chance to run the IndexOf/IndexOfAny benchmarks in dotnet/performance (not sure how good they are but they're there)

EgorBo commented 2 years ago

Is it possible that this could regress workloads that don't look like this one? Did you get a chance to run the IndexOf/IndexOfAny benchmarks in dotnet/performance (not sure how good they are but they're there)

Yes, it's always about trade-offs, but what I'm 100% sure in that we can add a "two 256bit vectors per iteration" path for arrays >= 64 elements (bytes) without hurting other cases https://gist.github.com/EgorBo/1d059726dae285e3a1db501896e8a1bd#file-faster_spanhelpers_indexof-cs-L148-L186.

so it will help us to find \r or \n in

Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\n

faster 🙂

EgorBo commented 2 years ago

Also, it seems there are optimization opportunities inside the caller of that IndexOfAny - ParseHeaders we can try to detect : on the go as we search for \r or \n https://github.com/dotnet/aspnetcore/blob/main/src/Servers/Kestrel/Core/src/Internal/Http/HttpParser.cs#L141-L421

Currently we do IndexOfAny(data, '\r', '\n') to get current "Header:Value"'s length and then we do IndexOfAny(data, ':', ' ', '\t') again on the same input to extract Header and Value (and validate).

Who knows maybe we can cross 13M RPS 😄

cc @davidfowl @benaadams

stephentoub commented 2 years ago

but what I'm 100% sure in that we can add a "two 256bit vectors per iteration" path

I thought we were on a path/plan to switch to using Vector128 in all of these implementations. Is that not the case, @tannergooding?

EgorBo commented 2 years ago

Is there a reason? E.g. as we recently found out with Tanner in Discord that if you want to quickly check if a value of Http Header is "Proxy-Authenticate" (36 bytes) ignoring its case - it's faster to do it via two 256bit vectors than via three 128bit ones: https://gist.github.com/EgorBo/c8e8490ddd6f9a0d5b72c413ddd81d44 on both Core i7 8700K and Ryzen 5950X. And my impression that for any string longer than 32 bytes it's better to go the AVX path.

Also, I believe all the instructions involved don't cause downlclocking (PL0 aka Power License 0) especially on newer CPUs where all avx2 instructions don't do it. SkylakeX vs Ice Lake, from https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

Finding the end-line symbol in this header:

Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7

is twice slower with SSE.

EgorBo commented 2 years ago

Moved to dotnet/aspnet as a PR https://github.com/dotnet/aspnetcore/pull/39216

dotnet / runtime

Opportunities to improve TE Platform-Plaintext benchmark #63059