dotnet / aspnetcore

ASP.NET Core is a cross-platform .NET framework for building modern cloud-based web applications on Windows, Mac, or Linux.
https://asp.net
MIT License
35.16k stars 9.92k forks source link

Speed up ILEmitTrie jump table using Binary Search Tree #52812

Closed andrewjsaid closed 6 months ago

andrewjsaid commented 8 months ago

The proposal here is that we can improve ILEmitTrie by using binary search instead of the linear search we currently do for both finding the length as well as the vectorized search.

This improves the MicroBenchmarks as follows:

Implementation in https://github.com/andrewjsaid/dotnet-aspnetcore/pull/1. The PR template implied I had to first create an issue before the PR so I did so. Please let me know if design makes sense I can then point the PR to here for detailed code review.

In the linked PR you can see the changes are:

  1. Binary search for length instead of linear search
  2. Binary search for vectorized comparison instead of linear search
  3. Where possible, binary search for single character comparison instead of linear
  4. Reduce comparison of single character from 2 branches to 1
  5. In vectorized comparison when only comparing letters (which is the majority of real-world cases), I removed 14 IL operations for each iteration of 4 characters. <-- kind of the same optimization as is done in StringSearchValues
dotnet run -c Release --filter *JumpTableMultiple*Trie*  --launchCount 3

Before


BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=9.0.100-alpha.1.23608.3
  [Host]     : .NET 9.0.0 (9.0.23.61215), X64 RyuJIT
  Job-RMGOVO : .NET 9.0.0 (9.0.23.57707), X64 RyuJIT

Server=True  Toolchain=.NET Core 9.0  LaunchCount=3  
RunStrategy=Throughput  
Method Count Mean Error StdDev Op/s
Trie 2 3.284 ns 0.0152 ns 0.0286 ns 304,474,863.5
VectorTrie 2 3.373 ns 0.0190 ns 0.0357 ns 296,466,572.2
Trie 5 3.872 ns 0.0208 ns 0.0386 ns 258,288,594.1
VectorTrie 5 4.081 ns 0.0209 ns 0.0398 ns 245,050,802.2
Trie 10 4.960 ns 0.0501 ns 0.0942 ns 201,618,909.5
VectorTrie 10 5.069 ns 0.0594 ns 0.1129 ns 197,283,921.0
Trie 25 24.888 ns 0.2683 ns 0.5104 ns 40,180,047.5
VectorTrie 25 8.783 ns 0.0418 ns 0.0796 ns 113,854,026.0
Trie 50 41.514 ns 0.3654 ns 0.6952 ns 24,088,288.4
VectorTrie 50 15.320 ns 0.1335 ns 0.2540 ns 65,273,554.0
Trie 100 162.799 ns 1.0319 ns 1.9382 ns 6,142,555.1
VectorTrie 100 38.724 ns 0.2069 ns 0.3887 ns 25,823,634.6

After


BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=9.0.100-alpha.1.23608.3
  [Host]     : .NET 9.0.0 (9.0.23.61215), X64 RyuJIT
  Job-VILVTP : .NET 9.0.0 (9.0.23.57707), X64 RyuJIT

Server=True  Toolchain=.NET Core 9.0  LaunchCount=3  
RunStrategy=Throughput  
Method Count Mean Error StdDev Median Op/s
Trie 2 3.149 ns 0.0169 ns 0.0321 ns 3.148 ns 317,515,568.2
VectorTrie 2 3.380 ns 0.0193 ns 0.0347 ns 3.371 ns 295,849,542.1
Trie 5 3.791 ns 0.0209 ns 0.0397 ns 3.806 ns 263,756,033.7
VectorTrie 5 4.036 ns 0.0210 ns 0.0395 ns 4.026 ns 247,769,489.7
Trie 10 4.079 ns 0.0419 ns 0.0766 ns 4.054 ns 245,174,104.0
VectorTrie 10 4.197 ns 0.0210 ns 0.0400 ns 4.192 ns 238,244,013.7
Trie 25 25.089 ns 0.2536 ns 0.4826 ns 24.883 ns 39,858,055.0
VectorTrie 25 5.478 ns 0.0306 ns 0.0582 ns 5.474 ns 182,558,428.1
Trie 50 37.309 ns 0.2636 ns 0.5015 ns 37.299 ns 26,803,430.1
VectorTrie 50 8.761 ns 0.0302 ns 0.0575 ns 8.768 ns 114,138,948.1
Trie 100 60.896 ns 0.2635 ns 0.5013 ns 60.724 ns 16,421,417.5
VectorTrie 100 31.195 ns 0.1460 ns 0.2742 ns 31.206 ns 32,055,996.7
BrennanConroy commented 8 months ago

+@javiercn @JamesNK you two are familiar with routing

JamesNK commented 8 months ago

I know what the ILEmitTrie does but I've never touched it. From looking at the file history, no one has made significant changes to it since @rynowak wrote it years ago. Snaps for Ryan.

If the tests all pass and perf is good then we're halfway there. I'd like to get someone to double check the vertorized path. I don't have enough expertise to review code and offer an opinion.

What should think about:

JamesNK commented 8 months ago

@andrewjsaid Go ahead and create a PR. We should have discussion in the PR.

rynowak commented 8 months ago

There are rules about what jump table to create. There is a check that prefers a dictionary jump table over a trie above 100 items. Should this change?

I can probably provide some degree of feedback on topics like this in the PR. Unfortunately it's all from memory because I didn't work on this code much after the original version 😓

The good news is that I think the stakes probably pretty low. Most of the decisions like the switch between emit and dictionary were made empirically based on the benchmarks. So if the benchmarks change then the breakpoints should change too.

Binary search wasn't something I thought of while building the original version. Compared to the previous version of routing all of this work was in a different galaxy perf-wise 🪐 . Most of my time was spent worrying about overhead in the N == 1 case.

Is the new generated IL significantly larger? There can be a lot of trie jump tables so we need to keep its size in check

I think @JamesNK raises a good point about code size. Changing the complexity to log(N) means that the emit approach will scale to a higher breakpoint (where dictionary becomes better) than the linear version. We may not want to make this decision solely based on the throughput. In my anecdotal experience a "big" node of the routing tree has 10+ branches, not hundreds or thousands so cases that would result in big tries aren't that common.