Helco / zzio

Zanzarah - WIP modding tools and engine remake
MIT License
14 stars 3 forks source link

zzre: Allocation-less colliders #371

Closed Helco closed 1 month ago

Helco commented 3 months ago

As discovered in #368 (and #313) the colliders are heavy allocating components due to many uses of LINQ and generator methods. Unfortunately it does not seem like we get value generator methods in C# anytime soon so we have to write manual enumerator structs to reduce memory allocations.

For sorting intersections we might want to also look into cached lists as well as cached stacks inside the enumerators or have intersections (instead of raycasts) always write into a sorted list.

For testing we can use the TestRaycaster but it should be possible to have both implementations side-by-side and (behind a compiler flag) run them both, expecting the exact same results.

Review todos:

Final benchmark results


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5011/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15
Method Mean Error StdDev Median Ratio Gen0 Allocated Alloc Ratio
Intersections
IntersectionsBaseline 10.731 ms 0.0162 ms 0.0826 ms 10.711 ms 1.00 640.6250 6827633 B 1.000
IntersectionsList 4.273 ms 0.0094 ms 0.0486 ms 4.281 ms 0.40 - 8 B 0.000
Raycasts
Baseline 33.15 ms 0.014 ms 0.070 ms 1.00 62.5000 814483 B 1.000
Merged 13.07 ms 0.004 ms 0.020 ms 0.39 - 7 B 0.000

GC Profile comparison

The GC profiler shows that TreeCollider was the main cause of per-frame allocations but also that we have quite a way to go for zero allocations per-frame.

image

Helco commented 3 months ago
First benchmark results BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=LongRun IterationCount=100 LaunchCount=3 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=15 | Method | Mean | Error | StdDev | Median | Ratio | Gen0 | Allocated | Alloc Ratio | |------------------------- |---------:|----------:|----------:|---------:|------:|---------:|-----------:|------------:| | IntersectionsGenerator | 9.125 ms | 0.0115 ms | 0.0588 ms | 9.121 ms | 1.00 | 140.6250 | 1475.41 KB | 1.00 | | IntersectionsList | 8.780 ms | 0.0081 ms | 0.0409 ms | 8.773 ms | 0.96 | 93.7500 | 989.76 KB | 0.67 | | IntersectionsStruct | 8.306 ms | 0.0063 ms | 0.0319 ms | 8.303 ms | 0.91 | 62.5000 | 774.8 KB | 0.53 | | IntersectionsTaggedUnion | 8.498 ms | 0.0122 ms | 0.0633 ms | 8.486 ms | 0.93 | 62.5000 | 774.8 KB | 0.53 |

Of course not depicted in the performance benchmarks is the code quality: IntersectionsStruct has a horrible API that bleeds into all consumers

Also the allocation-lessing is obviously not complete, the split stacks are still allocated per query and should either be fixed-size for a ridiculous tree size or pooled for amortization. For the next benchmark I will try to preserve the actual status quo as baseline, while this amortization will also be applied to a new generator-based method.

Helco commented 3 months ago
Results with amortized split stacks ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=LongRun IterationCount=100 LaunchCount=3 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=15 ``` | Method | Mean | Error | StdDev | Median | Ratio | Gen0 | Allocated | Alloc Ratio | |------------------------- |---------:|----------:|----------:|---------:|------:|--------:|----------:|------------:| | IntersectionsBaseline | 8.961 ms | 0.0116 ms | 0.0598 ms | 8.951 ms | 1.00 | 62.5000 | 782732 B | 1.000 | | IntersectionsGenerator | 8.969 ms | 0.0082 ms | 0.0427 ms | 8.970 ms | 1.00 | 62.5000 | 686644 B | 0.877 | | IntersectionsList | 9.131 ms | 0.0085 ms | 0.0442 ms | 9.129 ms | 1.02 | 15.6250 | 277140 B | 0.354 | | IntersectionsStruct | 8.591 ms | 0.0080 ms | 0.0414 ms | 8.590 ms | 0.96 | - | 12 B | 0.000 | | IntersectionsTaggedUnion | 8.505 ms | 0.0113 ms | 0.0587 ms | 8.496 ms | 0.95 | - | 12 B | 0.000 |

Still a bit curious why IntersectionsList is both slower (with supposedly less branching) and allocates per intersection. The struct enumerator have an allocation, but that might be by the benchmark and not by the intersection query.
(Also baseline is not correct as I forgot to revert the amortization on the atomic layer)

Helco commented 3 months ago

With Baseline corrected and the power of just removing coarse intersection tests entirely (let's just not care about out-of-bounds right?) we have no allocations for all three variants we would expect to have no allocations (minus amortization).

And still are spending 20% less runtime. ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=LongRun IterationCount=100 LaunchCount=3 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=15 ``` | Method | Mean | Error | StdDev | Median | Ratio | Gen0 | Allocated | Alloc Ratio | |------------------------- |----------:|----------:|----------:|----------:|------:|---------:|----------:|------------:| | IntersectionsBaseline | 10.642 ms | 0.0141 ms | 0.0731 ms | 10.624 ms | 1.00 | 640.6250 | 6830460 B | 1.000 | | IntersectionsGenerator | 10.532 ms | 0.0092 ms | 0.0476 ms | 10.531 ms | 0.99 | 562.5000 | 5991292 B | 0.877 | | IntersectionsList | 8.342 ms | 0.0072 ms | 0.0374 ms | 8.343 ms | 0.78 | - | 12 B | 0.000 | | IntersectionsStruct | 8.585 ms | 0.0071 ms | 0.0366 ms | 8.585 ms | 0.81 | - | 12 B | 0.000 | | IntersectionsTaggedUnion | 8.384 ms | 0.0112 ms | 0.0583 ms | 8.372 ms | 0.79 | - | 12 B | 0.000 |
Helco commented 2 months ago

Now we fix the baseline as separate assembly, because I want to tackle some more shared code within zzre.core Starting with plastering most of the math functions with AggressiveInlining | AggressiveOptimize after observing that the JITted assembly is abysmal for hot-loop methods. Then we can see that Triangle.ClosestPoint(Vector3) responsible for all end-stage math in most intersection queries (which are using Sphere as primitive) uses a non-optimal implementation and replace that entirely. The new implementation apparently has some other behavior (probably in extreme or special cases) but gameplay seems to still work and

the benchmark results warrant taking that risk ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=LongRun IterationCount=100 LaunchCount=3 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=15 ``` | Method | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio | |--------------------------------- |----------:|----------:|----------:|----------:|------:|--------:|---------:|----------:|------------:| | IntersectionsBaseline | 10.713 ms | 0.0379 ms | 0.1897 ms | 10.648 ms | 1.00 | 0.02 | 640.6250 | 6827628 B | 1.000 | | IntersectionsGenerator | 7.621 ms | 0.0401 ms | 0.2065 ms | 7.672 ms | 0.71 | 0.02 | 570.3125 | 5988646 B | 0.877 | | IntersectionsBaselineList | 8.343 ms | 0.0142 ms | 0.0740 ms | 8.320 ms | 0.78 | 0.02 | - | 12 B | 0.000 | | IntersectionsList | 5.170 ms | 0.0043 ms | 0.0224 ms | 5.169 ms | 0.48 | 0.01 | - | 6 B | 0.000 | | IntersectionsStruct | 5.474 ms | 0.0190 ms | 0.0963 ms | 5.427 ms | 0.51 | 0.01 | - | 6 B | 0.000 | | IntersectionsBaselineTaggedUnion | 8.443 ms | 0.0089 ms | 0.0461 ms | 8.440 ms | 0.79 | 0.01 | - | 12 B | 0.000 | | IntersectionsTaggedUnion | 5.098 ms | 0.0052 ms | 0.0269 ms | 5.095 ms | 0.48 | 0.01 | - | 6 B | 0.000 |
Helco commented 2 months ago
And now with the KD optimization ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=LongRun IterationCount=100 LaunchCount=3 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=15 ``` | Method | Mean | Error | StdDev | Median | Ratio | Gen0 | Allocated | Alloc Ratio | |--------------------------------- |----------:|----------:|----------:|----------:|------:|---------:|----------:|------------:| | IntersectionsBaseline | 10.543 ms | 0.0182 ms | 0.0945 ms | 10.532 ms | 1.00 | 640.6250 | 6827628 B | 1.000 | | IntersectionsGenerator | 7.447 ms | 0.0112 ms | 0.0582 ms | 7.450 ms | 0.71 | 570.3125 | 5988646 B | 0.877 | | IntersectionsBaselineList | 8.323 ms | 0.0051 ms | 0.0266 ms | 8.324 ms | 0.79 | - | 12 B | 0.000 | | IntersectionsList | 5.220 ms | 0.0083 ms | 0.0432 ms | 5.233 ms | 0.50 | - | 6 B | 0.000 | | IntersectionsListKD | 3.915 ms | 0.0030 ms | 0.0156 ms | 3.914 ms | 0.37 | - | 6 B | 0.000 | | IntersectionsStruct | 5.407 ms | 0.0051 ms | 0.0265 ms | 5.401 ms | 0.51 | - | 6 B | 0.000 | | IntersectionsBaselineTaggedUnion | 8.474 ms | 0.0090 ms | 0.0467 ms | 8.464 ms | 0.80 | - | 12 B | 0.000 | | IntersectionsTaggedUnion | 5.129 ms | 0.0050 ms | 0.0258 ms | 5.126 ms | 0.49 | - | 6 B | 0.000 |
Helco commented 2 months ago

Now we merge the two levels of kd-trees into a single structure, which brings just a bit of performance (getting us to exactly 3x faster) but should also simplify some API stuff, so maybe looking into the struct enumerator might be worthwhile again.

Not much but it is there ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=LongRun IterationCount=100 LaunchCount=3 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=15 ``` | Method | Mean | Error | StdDev | Median | Ratio | Gen0 | Allocated | Alloc Ratio | |--------------------------------- |----------:|----------:|----------:|----------:|------:|---------:|----------:|------------:| | IntersectionsBaseline | 10.519 ms | 0.0143 ms | 0.0728 ms | 10.506 ms | 1.00 | 640.6250 | 6827628 B | 1.000 | | IntersectionsGenerator | 7.438 ms | 0.0068 ms | 0.0354 ms | 7.433 ms | 0.71 | 570.3125 | 5988646 B | 0.877 | | IntersectionsBaselineList | 8.365 ms | 0.0200 ms | 0.1042 ms | 8.338 ms | 0.80 | - | 12 B | 0.000 | | IntersectionsList | 5.229 ms | 0.0052 ms | 0.0267 ms | 5.231 ms | 0.50 | - | 6 B | 0.000 | | IntersectionsListKD | 3.887 ms | 0.0049 ms | 0.0256 ms | 3.884 ms | 0.37 | - | 6 B | 0.000 | | IntersectionsListKDMerged | 3.466 ms | 0.0033 ms | 0.0170 ms | 3.464 ms | 0.33 | - | 3 B | 0.000 | | IntersectionsStruct | 5.421 ms | 0.0054 ms | 0.0283 ms | 5.422 ms | 0.52 | - | 6 B | 0.000 | | IntersectionsBaselineTaggedUnion | 8.458 ms | 0.0083 ms | 0.0434 ms | 8.454 ms | 0.80 | - | 12 B | 0.000 | | IntersectionsTaggedUnion | 5.266 ms | 0.0081 ms | 0.0416 ms | 5.263 ms | 0.50 | - | 6 B | 0.000 |

Also I should probably clean up a bit, both the math optimization as well as the KD optimization have proven themselves and we do not longer need them run them every time. Meaning: every test except baseline ones will get KD, just the suffix is not kept

Helco commented 2 months ago

All benchmarks should have KD optimization. also I checked the differences which seem to be just between Baseline and Current due to the triangle-sphere intersection. These differences seem to point to erroneous behaviour of the old one. So I will let that slide.

And the benchmark results ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=LongRun IterationCount=100 LaunchCount=3 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=15 ``` | Method | Mean | Error | StdDev | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio | |-------------------------- |----------:|----------:|----------:|------:|--------:|---------:|----------:|------------:| | IntersectionsBaseline | 11.115 ms | 0.0337 ms | 0.1743 ms | 1.00 | 0.02 | 640.6250 | 6827628 B | 1.000 | | IntersectionsList | 4.086 ms | 0.0055 ms | 0.0282 ms | 0.37 | 0.01 | - | 6 B | 0.000 | | IntersectionsListKDMerged | 3.609 ms | 0.0067 ms | 0.0347 ms | 0.32 | 0.01 | - | 3 B | 0.000 | | IntersectionsStruct | 4.108 ms | 0.0051 ms | 0.0261 ms | 0.37 | 0.01 | - | - | 0.000 | | IntersectionsTaggedUnion | 4.890 ms | 0.0065 ms | 0.0335 ms | 0.44 | 0.01 | - | 6 B | 0.000 |
Helco commented 2 months ago

While writing the MergedCollider I asked myself whether the memory layout of the full split array would affect performance.

So enjoy this one-off benchmark ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=MediumRun IterationCount=15 LaunchCount=2 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=10 ``` | Method | Mean | Error | StdDev | Median | Ratio | Gen0 | Allocated | Alloc Ratio | |----------------------------- |----------:|----------:|----------:|----------:|------:|---------:|----------:|------------:| | IntersectionsBaseline | 11.184 ms | 0.0465 ms | 0.0681 ms | 11.180 ms | 1.00 | 640.6250 | 6827623 B | 1.000 | | IntersectionsList | 4.120 ms | 0.0352 ms | 0.0527 ms | 4.115 ms | 0.37 | - | 6 B | 0.000 | | IntersectionsListKDMergedDF1 | 3.609 ms | 0.0387 ms | 0.0580 ms | 3.605 ms | 0.32 | - | 3 B | 0.000 | | IntersectionsListKDMergedDF2 | 3.666 ms | 0.0300 ms | 0.0449 ms | 3.665 ms | 0.33 | - | 2 B | 0.000 | | IntersectionsListKDMergedBF | 3.632 ms | 0.0382 ms | 0.0560 ms | 3.602 ms | 0.32 | - | - | 0.000 | | IntersectionsStruct | 4.144 ms | 0.0309 ms | 0.0453 ms | 4.131 ms | 0.37 | - | 6 B | 0.000 | | IntersectionsTaggedUnion | 5.082 ms | 0.0761 ms | 0.1091 ms | 5.082 ms | 0.45 | - | 6 B | 0.000 |

The answer: Not really, any difference here is pretty near the threshold of error... So let's go with the simplest one.

Helco commented 2 months ago

Finally the SIMD (two-split) benchmarks are in with three-split being scribbled up.

Let's see how it turned out ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=MediumRun IterationCount=15 LaunchCount=2 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=10 ``` | Method | Mean | Error | StdDev | Ratio | Gen0 | Allocated | Alloc Ratio | |--------------------------------- |----------:|----------:|----------:|------:|---------:|----------:|------------:| | IntersectionsBaseline | 10.593 ms | 0.0500 ms | 0.0717 ms | 1.00 | 640.6250 | 6827628 B | 1.000 | | IntersectionsList | 4.010 ms | 0.0196 ms | 0.0293 ms | 0.38 | - | 6 B | 0.000 | | IntersectionsListKDMerged | 3.563 ms | 0.0170 ms | 0.0249 ms | 0.34 | - | 3 B | 0.000 | | IntersectionsListKDMergedInty | 3.579 ms | 0.0054 ms | 0.0081 ms | 0.34 | - | 3 B | 0.000 | | IntersectionsStruct | 4.058 ms | 0.0102 ms | 0.0152 ms | 0.38 | - | 6 B | 0.000 | | IntersectionsTaggedUnion | 4.906 ms | 0.0348 ms | 0.0488 ms | 0.46 | - | 6 B | 0.000 | | IntersectionsSIMD128MoreBranches | 3.472 ms | 0.0083 ms | 0.0124 ms | 0.33 | - | 3 B | 0.000 | | IntersectionsSIMD128 | 3.805 ms | 0.0061 ms | 0.0090 ms | 0.36 | - | 3 B | 0.000 | | IntersectionsSIMD256 | 3.772 ms | 0.0045 ms | 0.0067 ms | 0.36 | - | 3 B | 0.000 |

oh well, this is surprisingly bad :) I probably still want to try the three-split one just for good measure, but we can already see that the branch reduction is not helpful and if it does help performance, it is a miniscule benefit.

Helco commented 2 months ago

And here are the results for the SIMD512 three-split collider. Because we have more loops iterations I also readded the less branching variant for the new benchmark.

I usually look at the results only after posting them here, so I cannot tell what hides under here... ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=MediumRun IterationCount=15 LaunchCount=2 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=10 ``` | Method | Mean | Error | StdDev | Ratio | Gen0 | Allocated | Alloc Ratio | |-------------------------- |----------:|----------:|----------:|------:|---------:|----------:|------------:| | IntersectionsBaseline | 10.551 ms | 0.0273 ms | 0.0392 ms | 1.00 | 640.6250 | 6827628 B | 1.000 | | IntersectionsList | 4.012 ms | 0.0233 ms | 0.0349 ms | 0.38 | - | 6 B | 0.000 | | IntersectionsListKDMerged | 3.581 ms | 0.0136 ms | 0.0199 ms | 0.34 | - | 3 B | 0.000 | | IntersectionsStruct | 4.056 ms | 0.0121 ms | 0.0182 ms | 0.38 | - | 6 B | 0.000 | | IntersectionsTaggedUnion | 4.795 ms | 0.0142 ms | 0.0213 ms | 0.45 | - | 6 B | 0.000 | | IntersectionsSIMD128 | 3.462 ms | 0.0080 ms | 0.0117 ms | 0.33 | - | 3 B | 0.000 | | IntersectionsSIMD256 | 3.555 ms | 0.0157 ms | 0.0235 ms | 0.34 | - | 3 B | 0.000 | | IntersectionsSIMD512 | 3.628 ms | 0.0098 ms | 0.0144 ms | 0.34 | - | 3 B | 0.000 | | IntersectionsSIMD512LB | 3.850 ms | 0.0115 ms | 0.0169 ms | 0.36 | - | 3 B | 0.000 |

The answer: not very much, we have again a minimal performance benefit of the SIMD128 two-split collider but anything higher performs worse and is naturally more complex. At this point I might scrap SIMD altogether for this usecase unless I have another idea for this. If I get crazy I might try attaching Intel VTune for example and look whether the SIMD ones have some solvable problem.

Just as a text note without further benchmark results: I tested a SOA variant of the SIMD128 with no discernable difference in performance. VTune showed a major bottleneck to be branch mispredictions, especially in the leaf Triangle-Sphere intersection test, which I guess is to be expected (if we reasonably knew the outcome we would not have to ask this very question), so I can see no obvious fault in the algorithm in the microarchitecture level.

Helco commented 2 months ago

I am almost at the end of the Intersections method, with the winner being the MergedCollider and some of the more simpler variant like List, Struct or TaggedUnion. I had yet to benchmark the latter two in the merged collider,

so here are the numbers for that ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=MediumRun IterationCount=15 LaunchCount=2 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=10 ``` | Method | Mean | Error | StdDev | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio | |------------------------------- |----------:|----------:|----------:|------:|--------:|---------:|----------:|------------:| | IntersectionsBaseline | 11.121 ms | 0.0906 ms | 0.1356 ms | 1.00 | 0.02 | 640.6250 | 6827628 B | 1.000 | | IntersectionsListKDMerged | 3.730 ms | 0.0356 ms | 0.0522 ms | 0.34 | 0.01 | - | 3 B | 0.000 | | IntersectionsStructMerged | 3.747 ms | 0.0094 ms | 0.0140 ms | 0.34 | 0.00 | - | 3 B | 0.000 | | IntersectionsTaggedUnionMerged | 4.044 ms | 0.0176 ms | 0.0258 ms | 0.36 | 0.00 | - | 6 B | 0.000 |

These numbers again show: simpler is better, so I will leave it at that. We can still cheat for one actual usecase in the game, where a line intersection is equivalent to a raycast. But for the other usecases (especially physics) we would need to incorporate more usecase-specific operations in order to allow for optimizations (e.g. filter by a product in order to reduce to a single-nearest-neighbor search). I am currently not inclined to do that.

I still would like to roughly benchmark raycast, making sure that it does not allocate and maybe try out a couple variants. After that I will wrap up this PR by putting the experiments into a backup branch and applying the winner variants to the productive game. Probably nice to then have a comparison of GC behavior, but that will have to wait a bit yet again.

Helco commented 2 months ago

Initial benchmark for raycasts, the results are worrying. A lot of allocations (which could be easily amortized though) and troublesome performance. I should add a benchmark with a sorted line intersection and also definitely figure out why the merged collider is so much worse than the other two. This is surprising.


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]    : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=MediumRun  IterationCount=15  LaunchCount=2
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=10
Method Mean Error StdDev Ratio Gen0 Allocated Alloc Ratio
Baseline 32.58 ms 0.111 ms 0.159 ms 1.00 62.5000 795.37 KB 1.00
SimpleOptimizations 27.32 ms 0.053 ms 0.079 ms 0.84 62.5000 795.35 KB 1.00
Merged 47.89 ms 0.102 ms 0.153 ms 1.47 - 148.5 KB 0.19
Helco commented 2 months ago

Let's get unsurprised here with an easy one first:

Line intersections are not faster than ray casts ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=MediumRun IterationCount=15 LaunchCount=2 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=10 ``` | Method | Mean | Error | StdDev | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio | |------------------------ |----------:|---------:|---------:|------:|--------:|--------:|----------:|------------:| | Baseline | 33.06 ms | 0.052 ms | 0.078 ms | 1.00 | 0.00 | 62.5000 | 814462 B | 1.000 | | LineIntersectionsWorld | 368.22 ms | 4.566 ms | 6.835 ms | 11.14 | 0.21 | - | 736 B | 0.001 | | LineIntersectionsMerged | 231.64 ms | 1.979 ms | 2.900 ms | 7.01 | 0.09 | - | 245 B | 0.000 | | SimpleOptimizations | 27.52 ms | 0.072 ms | 0.106 ms | 0.83 | 0.00 | 62.5000 | 814439 B | 1.000 | | Merged | 47.58 ms | 0.119 ms | 0.178 ms | 1.44 | 0.01 | - | 152067 B | 0.187 |

This can be attributed to intersection queries having to always return all intersections, while raycasts can exit out as soon as there cannot be a closer hit.

Helco commented 2 months ago

A one-off benchmark before I have to use a profiler again: At some point I added a SSE 4.1 version of Triangle.Barycentric but never benchmarked it (FOR SHAME!), so here is a benchmark with scalar, explicit sse 4.1 and SIMD128 versions:

I should have benchmarked it earlier ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=LongRun IterationCount=100 LaunchCount=3 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=15 ``` | Method | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio | |-------------------- |---------:|---------:|---------:|---------:|------:|--------:|--------:|----------:|------------:| | Baseline | 33.18 ms | 0.020 ms | 0.101 ms | 33.17 ms | 1.00 | 0.00 | 62.5000 | 795.37 KB | 1.00 | | SimpleOptimizations | 27.65 ms | 0.016 ms | 0.085 ms | 27.65 ms | 0.83 | 0.00 | 62.5000 | 795.35 KB | 1.00 | | MergedScalar | 44.45 ms | 0.043 ms | 0.221 ms | 44.40 ms | 1.34 | 0.01 | - | 148.5 KB | 0.19 | | MergedSse41 | 48.13 ms | 0.066 ms | 0.339 ms | 48.23 ms | 1.45 | 0.01 | - | 148.5 KB | 0.19 | | MergedSIMD128 | 46.98 ms | 0.362 ms | 1.872 ms | 48.20 ms | 1.42 | 0.06 | - | 148.5 KB | 0.19 |

We also have multi-modal distributions, vastly different results in MediumRun benchmarks, so summaries I would say: No use for either implementation, just scalar should be fine.

EDIT: Another benchmark not worthy of uploading is trying to just disable the degeneration test. We can do that during merging and safe the test for the raycasts but that is a tiny improvement over the current state. Profiler comparison it is.

Helco commented 2 months ago

Also not uploading: adding MIOptions makes casting almost twice as slow. Just adding AggressiveOptimization (without inlining) is better

and here are the results for that ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=MediumRun IterationCount=15 LaunchCount=2 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=10 ``` | Method | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio | |-------------------- |---------:|---------:|---------:|---------:|------:|--------:|--------:|----------:|------------:| | Baseline | 33.67 ms | 0.478 ms | 0.671 ms | 33.17 ms | 1.00 | 0.03 | 62.5000 | 795.37 KB | 1.00 | | SimpleOptimizations | 23.11 ms | 0.051 ms | 0.075 ms | 23.09 ms | 0.69 | 0.01 | 62.5000 | 795.35 KB | 1.00 | | Merged | 39.55 ms | 0.148 ms | 0.221 ms | 39.49 ms | 1.18 | 0.02 | - | 148.49 KB | 0.19 |

But merged it is still slower than baseline. The profiler did unfortunately tell me much so a bit of guesswork it is: I am working on an iterative version.

Helco commented 2 months ago

Oh well. The first iterative raycast and... at least it is parity in performance with baseline and in allocation with merged?

Overall still bad ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=MediumRun IterationCount=15 LaunchCount=2 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=10 ``` | Method | Mean | Error | StdDev | Ratio | Gen0 | Allocated | Alloc Ratio | |-------------------- |---------:|---------:|---------:|------:|--------:|----------:|------------:| | Baseline | 33.15 ms | 0.050 ms | 0.074 ms | 1.00 | 62.5000 | 795.37 KB | 1.00 | | SimpleOptimizations | 23.19 ms | 0.196 ms | 0.287 ms | 0.70 | 62.5000 | 795.35 KB | 1.00 | | Merged | 39.49 ms | 0.147 ms | 0.216 ms | 1.19 | - | 148.47 KB | 0.19 | | MergedIterative | 32.05 ms | 0.056 ms | 0.080 ms | 0.97 | - | 148.48 KB | 0.19 |

The allocations are due to the coarse check, in particular casting against a box allocates at the moment. I would still like to see better numbers for the casting itself.

Helco commented 2 months ago

Now we are getting somewhere. A new iterative variant using additional subtree-elimination reaches allllllmost the WorldCollider+Recursive Cast. That it is not faster is still beyond me.

The results ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=MediumRun IterationCount=15 LaunchCount=2 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=10 ``` | Method | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio | |-------------------- |---------:|---------:|---------:|---------:|------:|--------:|--------:|----------:|------------:| | Baseline | 33.82 ms | 0.457 ms | 0.670 ms | 34.32 ms | 1.00 | 0.03 | 66.6667 | 814465 B | 1.000 | | SimpleOptimizations | 23.03 ms | 0.043 ms | 0.063 ms | 23.03 ms | 0.68 | 0.01 | 62.5000 | 814439 B | 1.000 | | Merged | 39.42 ms | 0.088 ms | 0.126 ms | 39.41 ms | 1.17 | 0.02 | - | 152057 B | 0.187 | | MergedIterative | 31.72 ms | 0.073 ms | 0.109 ms | 31.70 ms | 0.94 | 0.02 | - | 46 B | 0.000 | | MergedRW | 25.25 ms | 0.033 ms | 0.048 ms | 25.25 ms | 0.75 | 0.01 | - | 23 B | 0.000 |
Helco commented 2 months ago

I omitted the break-even and just continued. In instrumentation profiles I saw Stack<T> operations having an unusual high percentage in runtime so to test I replaced it with a StackOverSpan<T> variant that uses ArrayPool<...>.Shared as backing memory.

It worked ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=LongRun IterationCount=100 LaunchCount=3 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=15 ``` | Method | Mean | Error | StdDev | Median | Ratio | RatioSD | Gen0 | Allocated | Alloc Ratio | |-------------------- |---------:|---------:|---------:|---------:|------:|--------:|--------:|----------:|------------:| | Baseline | 33.44 ms | 0.104 ms | 0.538 ms | 33.18 ms | 1.00 | 0.02 | 62.5000 | 814462 B | 1.000 | | SimpleOptimizations | 22.98 ms | 0.031 ms | 0.159 ms | 22.94 ms | 0.69 | 0.01 | 62.5000 | 814439 B | 1.000 | | Merged | 39.02 ms | 0.043 ms | 0.222 ms | 39.00 ms | 1.17 | 0.02 | - | 152057 B | 0.187 | | MergedIterative | 31.64 ms | 0.132 ms | 0.690 ms | 31.43 ms | 0.95 | 0.03 | - | 46 B | 0.000 | | MergedRWBR | 22.50 ms | 0.023 ms | 0.120 ms | 22.49 ms | 0.67 | 0.01 | - | 23 B | 0.000 | | MergedRWBRSS | 20.46 ms | 0.016 ms | 0.084 ms | 20.46 ms | 0.61 | 0.01 | - | 34 B | 0.000 |

I highly suspect we can also push this further. In the same profiles there were also Nullable unusually high and we can expect some additional performance by replacing the pretty ad-hoc Ray-Triangle intersection by a more standardized one (like Möller and Trumbore)

Helco commented 2 months ago

I am probably going to abandon the recursive merged as well as the naive iterative versions so I cleaned up the list of benchmarks a bit. Also instead of appending ever more acronyms to RW I am going to just compare the previous benchmarks results with the current changes (and baseline/simple opt).

The current changes prepare for the alternative Ray-Triangle intersection and also remove the usage of nullables. By using a NaN invariant we can also omit the comparison for misses entirely. At some point I might want to even move the intersection into the TreeCollider to have access to precomputed data without uglifying the Ray interface.

Another millisecond gone ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=MediumRun IterationCount=15 LaunchCount=2 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=10 ``` | Method | Mean | Error | StdDev | Ratio | Gen0 | Allocated | Alloc Ratio | |-------------------- |---------:|---------:|---------:|------:|--------:|----------:|------------:| | Baseline | 32.93 ms | 0.098 ms | 0.143 ms | 1.00 | 62.5000 | 814462 B | 1.000 | | SimpleOptimizations | 23.01 ms | 0.045 ms | 0.067 ms | 0.70 | 62.5000 | 814439 B | 1.000 | | MergedRWPrevious | 20.69 ms | 0.106 ms | 0.158 ms | 0.63 | - | 34 B | 0.000 | | MergedRWNext | 19.69 ms | 0.088 ms | 0.132 ms | 0.60 | - | 34 B | 0.000 |
Helco commented 2 months ago

Möller-Trumbore came through, we are well within 2x, even though I had to add a naive check to cull backfacing triangles from the test. The old intersection method did that without my explicit knowledge. Oh well...

The deeds ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=LongRun IterationCount=100 LaunchCount=3 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=15 ``` | Method | Mean | Error | StdDev | Median | Ratio | Gen0 | Allocated | Alloc Ratio | |-------------------- |---------:|---------:|---------:|---------:|------:|--------:|----------:|------------:| | Baseline | 33.03 ms | 0.017 ms | 0.088 ms | 33.04 ms | 1.00 | 62.5000 | 814462 B | 1.000 | | SimpleOptimizations | 23.10 ms | 0.020 ms | 0.106 ms | 23.08 ms | 0.70 | 62.5000 | 814439 B | 1.000 | | MergedRWPrevious | 19.61 ms | 0.010 ms | 0.051 ms | 19.61 ms | 0.59 | - | 34 B | 0.000 | | MergedRWNext | 14.97 ms | 0.011 ms | 0.057 ms | 14.97 ms | 0.45 | - | 17 B | 0.000 |
Helco commented 2 months ago

As I suspected there was an intermediate in Möller-Trumbore that we can use to cull back-faces (it's just the sign of the determinant). Also I finally removed degenerated triangles from the merged tree, removed the dummy splits of naive section collisions and reordered the triangles to remove the map indirection.

This might be the end for optimizations ``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=LongRun IterationCount=100 LaunchCount=3 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=15 ``` | Method | Mean | Error | StdDev | Median | Ratio | Gen0 | Allocated | Alloc Ratio | |-------------------- |---------:|---------:|---------:|---------:|------:|--------:|----------:|------------:| | Baseline | 33.13 ms | 0.020 ms | 0.103 ms | 33.12 ms | 1.00 | 62.5000 | 814462 B | 1.000 | | SimpleOptimizations | 23.13 ms | 0.014 ms | 0.070 ms | 23.12 ms | 0.70 | 62.5000 | 814439 B | 1.000 | | MergedRWPrevious | 14.85 ms | 0.015 ms | 0.076 ms | 14.88 ms | 0.45 | - | 17 B | 0.000 | | MergedRWNext | 13.25 ms | 0.017 ms | 0.087 ms | 13.22 ms | 0.40 | - | 17 B | 0.000 |

If I want to continue, I guess more precomputation and an even faster intersection algorithm might be the way to go. But that is not one I want to go, I already use more memory for the merged trees.