Closed EgorBo closed 2 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to this area: @dotnet/area-system-buffers, @GrabYourPitchforks See info in area-owners.md if you want to be subscribed.
Author: | EgorBo |
---|---|
Assignees: | - |
Labels: | `area-System.Buffers`, `tenet-performance`, `untriaged` |
Milestone: | - |
Is it possible that this could regress workloads that don't look like this one? Did you get a chance to run the IndexOf/IndexOfAny benchmarks in dotnet/performance (not sure how good they are but they're there)
Is it possible that this could regress workloads that don't look like this one? Did you get a chance to run the IndexOf/IndexOfAny benchmarks in dotnet/performance (not sure how good they are but they're there)
Yes, it's always about trade-offs, but what I'm 100% sure in that we can add a "two 256bit vectors per iteration" path for arrays >= 64 elements (bytes) without hurting other cases https://gist.github.com/EgorBo/1d059726dae285e3a1db501896e8a1bd#file-faster_spanhelpers_indexof-cs-L148-L186.
so it will help us to find \r
or \n
in
Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7\r\n
faster π
Also, it seems there are optimization opportunities inside the caller of that IndexOfAny - ParseHeaders
we can try to detect :
on the go as we search for \r
or \n
https://github.com/dotnet/aspnetcore/blob/main/src/Servers/Kestrel/Core/src/Internal/Http/HttpParser.cs#L141-L421
Currently we do IndexOfAny(data, '\r', '\n')
to get current "Header:Value"'
s length and then we do IndexOfAny(data, ':', ' ', '\t')
again on the same input to extract Header
and Value
(and validate).
Who knows maybe we can cross 13M RPS π
cc @davidfowl @benaadams
but what I'm 100% sure in that we can add a "two 256bit vectors per iteration" path
I thought we were on a path/plan to switch to using Vector128 in all of these implementations. Is that not the case, @tannergooding?
Is there a reason?
E.g. as we recently found out with Tanner in Discord that if you want to quickly check if a value of Http Header is "Proxy-Authenticate"
(36 bytes) ignoring its case - it's faster to do it via two 256bit vectors than via three 128bit ones: https://gist.github.com/EgorBo/c8e8490ddd6f9a0d5b72c413ddd81d44 on both Core i7 8700K and Ryzen 5950X. And my impression that for any string longer than 32 bytes it's better to go the AVX path.
Also, I believe all the instructions involved don't cause downlclocking (PL0 aka Power License 0) especially on newer CPUs where all avx2 instructions don't do it. SkylakeX vs Ice Lake, from https://travisdowns.github.io/blog/2020/08/19/icl-avx512-freq.html
Finding the end-line symbol in this header:
Accept: application/json,text/html;q=0.9,application/xhtml+xml;q=0.9,application/xml;q=0.8,*/*;q=0.7
is twice slower with SSE.
Moved to dotnet/aspnet as a PR https://github.com/dotnet/aspnetcore/pull/39216
Platform-Plaintext TechEmpower Benchmark where dotnet shows impressive results (2nd place, up to ~12.4mln requests per second (RPS) on our PerfLab hardware with bigger network bandwidth) seems to be slightly bottlenecked by three
SpanHelpers.IndexOf[Any]
functions:^ Linux-x64
e.g.
IndexOfAny(val0, val1)
mostly tries to find\n
or\r
in ASCII strings of length = 26, 49 and 151 where needed symbols usually found at positions 21, 22 and 100 (http headers)I tried to rewrite them by hands e.g. I changed some branches, removed "Duff's devices", added "two 256bit vectors per iteration" path. Here is a standalone benchmark project with test data extracted from the Platform-Plaintext benchmark.
As the result, I constantly see stable improvements around 1-3% (never slower):
^ with PGO. As you can see from https://aka.ms/aspnet/benchmarks (17th page, "full" checked) the best results we've ever seen were 12.42M RPS so 12.68M does look like an improvement:
Same relative improvements can be observed in non-PGO mode (default).
Here is the script I used to benchmark it:
Standalone benchmark: https://gist.github.com/EgorBo/1d059726dae285e3a1db501896e8a1bd Commit in dotnet/runtime: https://github.com/EgorBo/runtime-1/commit/b2ee6ad589e0c8d6f495876d8a1f965770243896
/cc @stephentoub @GrabYourPitchforks