Speed up ILEmitTrie jump table using Binary Search Tree

andrewjsaid commented 11 months ago

The proposal here is that we can improve ILEmitTrie by using binary search instead of the linear search we currently do for both finding the length as well as the vectorized search.

This improves the MicroBenchmarks as follows:

VectorTrie (Count = 25, Count = 50) is now 60% of original duration. Curiously only 80% for Count = 100
Trie (Count = 100) is now 40% of original duration. Curiously very minor / insignificant impact on other Counts

Implementation in https://github.com/andrewjsaid/dotnet-aspnetcore/pull/1. The PR template implied I had to first create an issue before the PR so I did so. Please let me know if design makes sense I can then point the PR to here for detailed code review.

In the linked PR you can see the changes are:

Binary search for length instead of linear search
Binary search for vectorized comparison instead of linear search
Where possible, binary search for single character comparison instead of linear
Reduce comparison of single character from 2 branches to 1
In vectorized comparison when only comparing letters (which is the majority of real-world cases), I removed 14 IL operations for each iteration of 4 characters. <-- kind of the same optimization as is done in StringSearchValues

dotnet run -c Release --filter *JumpTableMultiple*Trie*  --launchCount 3

Before


BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=9.0.100-alpha.1.23608.3
  [Host]     : .NET 9.0.0 (9.0.23.61215), X64 RyuJIT
  Job-RMGOVO : .NET 9.0.0 (9.0.23.57707), X64 RyuJIT

Server=True  Toolchain=.NET Core 9.0  LaunchCount=3  
RunStrategy=Throughput

Method	Count	Mean	Error	StdDev	Op/s
Trie	2	3.284 ns	0.0152 ns	0.0286 ns	304,474,863.5
VectorTrie	2	3.373 ns	0.0190 ns	0.0357 ns	296,466,572.2
Trie	5	3.872 ns	0.0208 ns	0.0386 ns	258,288,594.1
VectorTrie	5	4.081 ns	0.0209 ns	0.0398 ns	245,050,802.2
Trie	10	4.960 ns	0.0501 ns	0.0942 ns	201,618,909.5
VectorTrie	10	5.069 ns	0.0594 ns	0.1129 ns	197,283,921.0
Trie	25	24.888 ns	0.2683 ns	0.5104 ns	40,180,047.5
VectorTrie	25	8.783 ns	0.0418 ns	0.0796 ns	113,854,026.0
Trie	50	41.514 ns	0.3654 ns	0.6952 ns	24,088,288.4
VectorTrie	50	15.320 ns	0.1335 ns	0.2540 ns	65,273,554.0
Trie	100	162.799 ns	1.0319 ns	1.9382 ns	6,142,555.1
VectorTrie	100	38.724 ns	0.2069 ns	0.3887 ns	25,823,634.6

After


BenchmarkDotNet=v0.13.0, OS=Windows 10.0.22621
AMD Ryzen 9 5950X, 1 CPU, 32 logical and 16 physical cores
.NET SDK=9.0.100-alpha.1.23608.3
  [Host]     : .NET 9.0.0 (9.0.23.61215), X64 RyuJIT
  Job-VILVTP : .NET 9.0.0 (9.0.23.57707), X64 RyuJIT

Server=True  Toolchain=.NET Core 9.0  LaunchCount=3  
RunStrategy=Throughput

Method	Count	Mean	Error	StdDev	Median	Op/s
Trie	2	3.149 ns	0.0169 ns	0.0321 ns	3.148 ns	317,515,568.2
VectorTrie	2	3.380 ns	0.0193 ns	0.0347 ns	3.371 ns	295,849,542.1
Trie	5	3.791 ns	0.0209 ns	0.0397 ns	3.806 ns	263,756,033.7
VectorTrie	5	4.036 ns	0.0210 ns	0.0395 ns	4.026 ns	247,769,489.7
Trie	10	4.079 ns	0.0419 ns	0.0766 ns	4.054 ns	245,174,104.0
VectorTrie	10	4.197 ns	0.0210 ns	0.0400 ns	4.192 ns	238,244,013.7
Trie	25	25.089 ns	0.2536 ns	0.4826 ns	24.883 ns	39,858,055.0
VectorTrie	25	5.478 ns	0.0306 ns	0.0582 ns	5.474 ns	182,558,428.1
Trie	50	37.309 ns	0.2636 ns	0.5015 ns	37.299 ns	26,803,430.1
VectorTrie	50	8.761 ns	0.0302 ns	0.0575 ns	8.768 ns	114,138,948.1
Trie	100	60.896 ns	0.2635 ns	0.5013 ns	60.724 ns	16,421,417.5
VectorTrie	100	31.195 ns	0.1460 ns	0.2742 ns	31.206 ns	32,055,996.7

BrennanConroy commented 11 months ago

+@javiercn @JamesNK you two are familiar with routing

JamesNK commented 11 months ago

I know what the ILEmitTrie does but I've never touched it. From looking at the file history, no one has made significant changes to it since @rynowak wrote it years ago. Snaps for Ryan.

If the tests all pass and perf is good then we're halfway there. I'd like to get someone to double check the vertorized path. I don't have enough expertise to review code and offer an opinion.

What should think about:

Is the new generated IL significantly larger? There can be a lot of trie jump tables so we need to keep its size in check
There are rules about what jump table to create. There is a check that prefers a dictionary jump table over a trie above 100 items. Should this change?

JamesNK commented 11 months ago

@andrewjsaid Go ahead and create a PR. We should have discussion in the PR.

rynowak commented 11 months ago

There are rules about what jump table to create. There is a check that prefers a dictionary jump table over a trie above 100 items. Should this change?

I can probably provide some degree of feedback on topics like this in the PR. Unfortunately it's all from memory because I didn't work on this code much after the original version 😓

The good news is that I think the stakes probably pretty low. Most of the decisions like the switch between emit and dictionary were made empirically based on the benchmarks. So if the benchmarks change then the breakpoints should change too.

Binary search wasn't something I thought of while building the original version. Compared to the previous version of routing all of this work was in a different galaxy perf-wise 🪐 . Most of my time was spent worrying about overhead in the N == 1 case.

Is the new generated IL significantly larger? There can be a lot of trie jump tables so we need to keep its size in check

I think @JamesNK raises a good point about code size. Changing the complexity to log(N) means that the emit approach will scale to a higher breakpoint (where dictionary becomes better) than the linear version. We may not want to make this decision solely based on the throughput. In my anecdotal experience a "big" node of the routing tree has 10+ branches, not hundreds or thousands so cases that would result in big tries aren't that common.

dotnet / aspnetcore

Speed up ILEmitTrie jump table using Binary Search Tree #52812

Before

After