dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.55k stars 4.54k forks source link

Opportunities to improve TE Platform-Plaintext benchmark #63059

Closed EgorBo closed 2 years ago

EgorBo commented 2 years ago

Platform-Plaintext TechEmpower Benchmark where dotnet shows impressive results (2nd place, up to ~12.4mln requests per second (RPS) on our PerfLab hardware with bigger network bandwidth) seems to be slightly bottlenecked by three SpanHelpers.IndexOf[Any] functions: image image

^ Linux-x64

e.g. IndexOfAny(val0, val1) mostly tries to find \n or \r in ASCII strings of length = 26, 49 and 151 where needed symbols usually found at positions 21, 22 and 100 (http headers)

Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7
Host: 10.0.0.102:5000
Connection: keep-alive
Host: 10.0.0.102:5000
Connection: keep-alive
Connection: keep-alive

I tried to rewrite them by hands e.g. I changed some branches, removed "Duff's devices", added "two 256bit vectors per iteration" path. Here is a standalone benchmark project with test data extracted from the Platform-Plaintext benchmark.

As the result, I constantly see stable improvements around 1-3% (never slower):

| load                   | base        | my          |         |
| ---------------------- | ----------- | ----------- | ------- |
| CPU Usage (%)          |          96 |          97 |  +1.04% |
| Cores usage (%)        |       2,702 |       2,726 |  +0.89% |
| Working Set (MB)       |          37 |          37 |   0.00% |
| Private Memory (MB)    |         370 |         370 |   0.00% |
| Start Time (ms)        |           0 |           0 |         |
| First Request (ms)     |          77 |          75 |  -2.60% |
| Requests/sec           |  12,403,854 |  12,679,971 |  +2.23% |
| Requests               | 187,261,280 | 191,437,425 |  +2.23% |
| Mean latency (ms)      |        1.18 |        1.31 | +11.02% |
| Max latency (ms)       |       56.19 |       53.21 |  -5.30% |
| Bad responses          |           0 |           0 |         |
| Socket errors          |           0 |           0 |         |
| Read throughput (MB/s) |    1,495.04 |    1,525.76 |  +2.05% |
| Latency 50th (ms)      |        0.66 |        0.63 |  -3.95% |
| Latency 75th (ms)      |        0.99 |        0.95 |  -4.04% |
| Latency 90th (ms)      |        1.83 |        2.00 |  +9.29% |
| Latency 99th (ms)      |       15.86 |       15.82 |  -0.25% |

^ with PGO. As you can see from https://aka.ms/aspnet/benchmarks (17th page, "full" checked) the best results we've ever seen were 12.42M RPS so 12.68M does look like an improvement:

image

Same relative improvements can be observed in non-PGO mode (default).

Here is the script I used to benchmark it:

for (;;) { crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario plaintext --profile aspnet-citrine-lin --application.framework net7.0 --application.options.outputFiles C:\prj\WSL\x64\base\*.* --json t1.json ;; crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario plaintext --profile aspnet-citrine-lin --application.framework net7.0 --application.options.outputFiles C:\prj\WSL\x64\diff\*.* --json t2.json ;; crank compare t1.json t2.json }

Standalone benchmark: https://gist.github.com/EgorBo/1d059726dae285e3a1db501896e8a1bd Commit in dotnet/runtime: https://github.com/EgorBo/runtime-1/commit/b2ee6ad589e0c8d6f495876d8a1f965770243896

/cc @stephentoub @GrabYourPitchforks

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/area-system-buffers, @GrabYourPitchforks See info in area-owners.md if you want to be subscribed.

Issue Details
Platform-Plaintext TechEmpower Benchmark where dotnet shows impressive results ([2nd place](https://www.techempower.com/benchmarks/#section=data-r20&hw=ph&test=plaintext), ~12.4mln RPS on our PerfLab hardware) seems to be slightly bottlenecked by three SpanHelpers.IndexOf[Any] functions: ![image](https://user-images.githubusercontent.com/523221/147016152-d1da2f2b-cedc-4985-8831-a70ca5d07fe7.png) _^ Linux-x64_ e.g. `IndexOfAny` with two values mostly tries to find `\n` or `\r` in ASCII strings of length = 26, 49 and 151 where needed symbols usually found at positions 21, 22 and 100 (http headers) I [tried](https://github.com/EgorBo/runtime-1/commit/b2ee6ad589e0c8d6f495876d8a1f965770243896) to rewrite them by hands e.g. I changed some branches, removed "Duff's devices" (even for cases where needed char is found at position = 0 my impl is still faster). Here is a [standalone benchmark](https://gist.github.com/EgorBo/1d059726dae285e3a1db501896e8a1bd) project with test data extracted from the Platform-Plaintext benchmark. As the result, I constantly see stable improvements around 1-3% (never slower): ``` | load | t1 | t2 | | | ---------------------- | ----------- | ----------- | ------- | | CPU Usage (%) | 96 | 97 | +1.04% | | Cores usage (%) | 2,702 | 2,726 | +0.89% | | Working Set (MB) | 37 | 37 | 0.00% | | Private Memory (MB) | 370 | 370 | 0.00% | | Start Time (ms) | 0 | 0 | | | First Request (ms) | 77 | 75 | -2.60% | | Requests/sec | 12,403,854 | 12,679,971 | +2.23% | | Requests | 187,261,280 | 191,437,425 | +2.23% | | Mean latency (ms) | 1.18 | 1.31 | +11.02% | | Max latency (ms) | 56.19 | 53.21 | -5.30% | | Bad responses | 0 | 0 | | | Socket errors | 0 | 0 | | | Read throughput (MB/s) | 1,495.04 | 1,525.76 | +2.05% | | Latency 50th (ms) | 0.66 | 0.63 | -3.95% | | Latency 75th (ms) | 0.99 | 0.95 | -4.04% | | Latency 90th (ms) | 1.83 | 2.00 | +9.29% | | Latency 99th (ms) | 15.86 | 15.82 | -0.25% | ``` _^ with PGO. As you can see from https://aka.ms/aspnet/benchmarks (17th page, "full" checked) the best results we've seen were around 12.42M RPS Same relative improvements can be observed in non-PGO mode (default)._ Here is the script I used to benchmark it: ``` for (;;) { crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario plaintext --profile aspnet-citrine-lin --application.framework net7.0 --application.options.outputFiles C:\prj\WSL\x64\base\*.* --json t1.json ;; crank --config https://raw.githubusercontent.com/aspnet/Benchmarks/main/scenarios/platform.benchmarks.yml --scenario plaintext --profile aspnet-citrine-lin --application.framework net7.0 --application.options.outputFiles C:\prj\WSL\x64\diff\*.* --json t2.json ;; crank compare t1.json t2.json } ``` Standalone benchmark: https://gist.github.com/EgorBo/1d059726dae285e3a1db501896e8a1bd Commit in dotnet/runtime: https://github.com/EgorBo/runtime-1/commit/b2ee6ad589e0c8d6f495876d8a1f965770243896 /cc @stephentoub @GrabYourPitchforks
Author: EgorBo
Assignees: -
Labels: `area-System.Buffers`, `tenet-performance`, `untriaged`
Milestone: -
danmoseley commented 2 years ago

Is it possible that this could regress workloads that don't look like this one? Did you get a chance to run the IndexOf/IndexOfAny benchmarks in dotnet/performance (not sure how good they are but they're there)

EgorBo commented 2 years ago

Is it possible that this could regress workloads that don't look like this one? Did you get a chance to run the IndexOf/IndexOfAny benchmarks in dotnet/performance (not sure how good they are but they're there)

Yes, it's always about trade-offs, but what I'm 100% sure in that we can add a "two 256bit vectors per iteration" path for arrays >= 64 elements (bytes) without hurting other cases https://gist.github.com/EgorBo/1d059726dae285e3a1db501896e8a1bd#file-faster_spanhelpers_indexof-cs-L148-L186.

so it will help us to find \r or \n in

Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\n

faster πŸ™‚

EgorBo commented 2 years ago

Also, it seems there are optimization opportunities inside the caller of that IndexOfAny - ParseHeaders we can try to detect : on the go as we search for \r or \n https://github.com/dotnet/aspnetcore/blob/main/src/Servers/Kestrel/Core/src/Internal/Http/HttpParser.cs#L141-L421

Currently we do IndexOfAny(data, '\r', '\n') to get current "Header:Value"'s length and then we do IndexOfAny(data, ':', ' ', '\t') again on the same input to extract Header and Value (and validate).

image

Who knows maybe we can cross 13M RPS πŸ˜„

cc @davidfowl @benaadams

stephentoub commented 2 years ago

but what I'm 100% sure in that we can add a "two 256bit vectors per iteration" path

I thought we were on a path/plan to switch to using Vector128 in all of these implementations. Is that not the case, @tannergooding?

EgorBo commented 2 years ago

Is there a reason? E.g. as we recently found out with Tanner in Discord that if you want to quickly check if a value of Http Header is "Proxy-Authenticate" (36 bytes) ignoring its case - it's faster to do it via two 256bit vectors than via three 128bit ones: https://gist.github.com/EgorBo/c8e8490ddd6f9a0d5b72c413ddd81d44 on both Core i7 8700K and Ryzen 5950X. And my impression that for any string longer than 32 bytes it's better to go the AVX path.

Also, I believe all the instructions involved don't cause downlclocking (PL0 aka Power License 0) especially on newer CPUs where all avx2 instructions don't do it. image SkylakeX vs Ice Lake, from https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html

Finding the end-line symbol in this header:

Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7

is twice slower with SSE.

EgorBo commented 2 years ago

Moved to dotnet/aspnet as a PR https://github.com/dotnet/aspnetcore/pull/39216