zzre: Allocation-less colliders

Helco commented 3 months ago

As discovered in #368 (and #313) the colliders are heavy allocating components due to many uses of LINQ and generator methods. Unfortunately it does not seem like we get value generator methods in C# anytime soon so we have to write manual enumerator structs to reduce memory allocations.

For sorting intersections we might want to also look into cached lists as well as cached stacks inside the enumerators or have intersections (instead of raycasts) always write into a sorted list.

For testing we can use the TestRaycaster but it should be possible to have both implementations side-by-side and (behind a compiler flag) run them both, expecting the exact same results.

[x] TreeCollider
[x] WorldCollider
[x] HumanPhysics
[x] LensFlare
[x] FairyHoverBehind
[x] FairyPhysics
[x] FindActorFloorCollisions
[x] WorldViewer
[x] (search for additional collider usages)

Review todos:

[x] PrefixSums instead of SumSums
[x] Remove baseline assembly and benchmark project
[x] Tests for StackOverSpan
[x] Impl and tests for ListOverSpan, maybe PooledList (TemporaryList? FixedTemporaryList? Inline-array?)
[x] ~Use same collider for geometry instances~ Some other time.
[x] Mass test executable to check world loading and collisions (also similar mass tests) (in different branch, will be rebased and merged after this PR)
[x] Make IEnumerable Intersections harder to call
[x] Replace line intersections with raycasts whereever possible
[x] Add proper link in Triangle.ClosestPoint
[x] Last benchmark results against baseline
[x] Compare GC profile against baseline

Final benchmark results


BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.5011/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
  [Host]  : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
  LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2

Job=LongRun  IterationCount=100  LaunchCount=3
MaxIterationCount=1000  MinIterationCount=10  WarmupCount=15

Method	Mean	Error	StdDev	Median	Ratio	Gen0	Allocated	Alloc Ratio
Intersections
IntersectionsBaseline	10.731 ms	0.0162 ms	0.0826 ms	10.711 ms	1.00	640.6250	6827633 B	1.000
IntersectionsList	4.273 ms	0.0094 ms	0.0486 ms	4.281 ms	0.40	-	8 B	0.000
Raycasts
Baseline	33.15 ms	0.014 ms	0.070 ms		1.00	62.5000	814483 B	1.000
Merged	13.07 ms	0.004 ms	0.020 ms		0.39	-	7 B	0.000

GC Profile comparison

The GC profiler shows that TreeCollider was the main cause of per-frame allocations but also that we have quite a way to go for zero allocations per-frame.

Helco commented 3 months ago

First benchmark results

BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=LongRun IterationCount=100 LaunchCount=3 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=15 | Method | Mean | Error | StdDev | Median | Ratio | Gen0 | Allocated | Alloc Ratio | |------------------------- |---------:|----------:|----------:|---------:|------:|---------:|-----------:|------------:| | IntersectionsGenerator | 9.125 ms | 0.0115 ms | 0.0588 ms | 9.121 ms | 1.00 | 140.6250 | 1475.41 KB | 1.00 | | IntersectionsList | 8.780 ms | 0.0081 ms | 0.0409 ms | 8.773 ms | 0.96 | 93.7500 | 989.76 KB | 0.67 | | IntersectionsStruct | 8.306 ms | 0.0063 ms | 0.0319 ms | 8.303 ms | 0.91 | 62.5000 | 774.8 KB | 0.53 | | IntersectionsTaggedUnion | 8.498 ms | 0.0122 ms | 0.0633 ms | 8.486 ms | 0.93 | 62.5000 | 774.8 KB | 0.53 |

Of course not depicted in the performance benchmarks is the code quality: IntersectionsStruct has a horrible API that bleeds into all consumers

Also the allocation-lessing is obviously not complete, the split stacks are still allocated per query and should either be fixed-size for a ridiculous tree size or pooled for amortization. For the next benchmark I will try to preserve the actual status quo as baseline, while this amortization will also be applied to a new generator-based method.

Helco commented 3 months ago

Results with amortized split stacks

``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=LongRun IterationCount=100 LaunchCount=3 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=15 ``` | Method | Mean | Error | StdDev | Median | Ratio | Gen0 | Allocated | Alloc Ratio | |------------------------- |---------:|----------:|----------:|---------:|------:|--------:|----------:|------------:| | IntersectionsBaseline | 8.961 ms | 0.0116 ms | 0.0598 ms | 8.951 ms | 1.00 | 62.5000 | 782732 B | 1.000 | | IntersectionsGenerator | 8.969 ms | 0.0082 ms | 0.0427 ms | 8.970 ms | 1.00 | 62.5000 | 686644 B | 0.877 | | IntersectionsList | 9.131 ms | 0.0085 ms | 0.0442 ms | 9.129 ms | 1.02 | 15.6250 | 277140 B | 0.354 | | IntersectionsStruct | 8.591 ms | 0.0080 ms | 0.0414 ms | 8.590 ms | 0.96 | - | 12 B | 0.000 | | IntersectionsTaggedUnion | 8.505 ms | 0.0113 ms | 0.0587 ms | 8.496 ms | 0.95 | - | 12 B | 0.000 |

Still a bit curious why IntersectionsList is both slower (with supposedly less branching) and allocates per intersection. The struct enumerator have an allocation, but that might be by the benchmark and not by the intersection query.
(Also baseline is not correct as I forgot to revert the amortization on the atomic layer)

Helco commented 3 months ago

With Baseline corrected and the power of just removing coarse intersection tests entirely (let's just not care about out-of-bounds right?) we have no allocations for all three variants we would expect to have no allocations (minus amortization).

And still are spending 20% less runtime.

``` BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4780/22H2/2022Update) Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores .NET SDK 8.0.300 [Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 LongRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2 Job=LongRun IterationCount=100 LaunchCount=3 MaxIterationCount=1000 MinIterationCount=10 WarmupCount=15 ``` | Method | Mean | Error | StdDev | Median | Ratio | Gen0 | Allocated | Alloc Ratio | |------------------------- |----------:|----------:|----------:|----------:|------:|---------:|----------:|------------:| | IntersectionsBaseline | 10.642 ms | 0.0141 ms | 0.0731 ms | 10.624 ms | 1.00 | 640.6250 | 6830460 B | 1.000 | | IntersectionsGenerator | 10.532 ms | 0.0092 ms | 0.0476 ms | 10.531 ms | 0.99 | 562.5000 | 5991292 B | 0.877 | | IntersectionsList | 8.342 ms | 0.0072 ms | 0.0374 ms | 8.343 ms | 0.78 | - | 12 B | 0.000 | | IntersectionsStruct | 8.585 ms | 0.0071 ms | 0.0366 ms | 8.585 ms | 0.81 | - | 12 B | 0.000 | | IntersectionsTaggedUnion | 8.384 ms | 0.0112 ms | 0.0583 ms | 8.372 ms | 0.79 | - | 12 B | 0.000 |