Closed Helco closed 3 weeks ago
Of course not depicted in the performance benchmarks is the code quality: IntersectionsStruct has a horrible API that bleeds into all consumers
Also the allocation-lessing is obviously not complete, the split stacks are still allocated per query and should either be fixed-size for a ridiculous tree size or pooled for amortization. For the next benchmark I will try to preserve the actual status quo as baseline, while this amortization will also be applied to a new generator-based method.
Still a bit curious why IntersectionsList
is both slower (with supposedly less branching) and allocates per intersection. The struct enumerator have an allocation, but that might be by the benchmark and not by the intersection query.
(Also baseline is not correct as I forgot to revert the amortization on the atomic layer)
With Baseline corrected and the power of just removing coarse intersection tests entirely (let's just not care about out-of-bounds right?) we have no allocations for all three variants we would expect to have no allocations (minus amortization).
Now we fix the baseline as separate assembly, because I want to tackle some more shared code within zzre.core
Starting with plastering most of the math functions with AggressiveInlining | AggressiveOptimize
after observing that the JITted assembly is abysmal for hot-loop methods.
Then we can see that Triangle.ClosestPoint(Vector3)
responsible for all end-stage math in most intersection queries (which are using Sphere
as primitive) uses a non-optimal implementation and replace that entirely. The new implementation apparently has some other behavior (probably in extreme or special cases) but gameplay seems to still work and
Now we merge the two levels of kd-trees into a single structure, which brings just a bit of performance (getting us to exactly 3x faster) but should also simplify some API stuff, so maybe looking into the struct enumerator might be worthwhile again.
Also I should probably clean up a bit, both the math optimization as well as the KD optimization have proven themselves and we do not longer need them run them every time. Meaning: every test except baseline ones will get KD, just the suffix is not kept
All benchmarks should have KD optimization. also I checked the differences which seem to be just between Baseline and Current due to the triangle-sphere intersection. These differences seem to point to erroneous behaviour of the old one. So I will let that slide.
While writing the MergedCollider
I asked myself whether the memory layout of the full split array would affect performance.
The answer: Not really, any difference here is pretty near the threshold of error... So let's go with the simplest one.
Finally the SIMD (two-split) benchmarks are in with three-split being scribbled up.
oh well, this is surprisingly bad :) I probably still want to try the three-split one just for good measure, but we can already see that the branch reduction is not helpful and if it does help performance, it is a miniscule benefit.
And here are the results for the SIMD512 three-split collider. Because we have more loops iterations I also readded the less branching variant for the new benchmark.
The answer: not very much, we have again a minimal performance benefit of the SIMD128 two-split collider but anything higher performs worse and is naturally more complex. At this point I might scrap SIMD altogether for this usecase unless I have another idea for this. If I get crazy I might try attaching Intel VTune for example and look whether the SIMD ones have some solvable problem.
Just as a text note without further benchmark results: I tested a SOA variant of the SIMD128 with no discernable difference in performance. VTune showed a major bottleneck to be branch mispredictions, especially in the leaf Triangle-Sphere intersection test, which I guess is to be expected (if we reasonably knew the outcome we would not have to ask this very question), so I can see no obvious fault in the algorithm in the microarchitecture level.
I am almost at the end of the Intersections method, with the winner being the MergedCollider and some of the more simpler variant like List, Struct or TaggedUnion. I had yet to benchmark the latter two in the merged collider,
These numbers again show: simpler is better, so I will leave it at that. We can still cheat for one actual usecase in the game, where a line intersection is equivalent to a raycast. But for the other usecases (especially physics) we would need to incorporate more usecase-specific operations in order to allow for optimizations (e.g. filter by a product in order to reduce to a single-nearest-neighbor search). I am currently not inclined to do that.
I still would like to roughly benchmark raycast, making sure that it does not allocate and maybe try out a couple variants. After that I will wrap up this PR by putting the experiments into a backup branch and applying the winner variants to the productive game. Probably nice to then have a comparison of GC behavior, but that will have to wait a bit yet again.
Initial benchmark for raycasts, the results are worrying. A lot of allocations (which could be easily amortized though) and troublesome performance. I should add a benchmark with a sorted line intersection and also definitely figure out why the merged collider is so much worse than the other two. This is surprising.
BenchmarkDotNet v0.14.0, Windows 10 (10.0.19045.4894/22H2/2022Update)
Intel Core i9-10900K CPU 3.70GHz, 1 CPU, 20 logical and 10 physical cores
.NET SDK 8.0.300
[Host] : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
MediumRun : .NET 8.0.5 (8.0.524.21615), X64 RyuJIT AVX2
Job=MediumRun IterationCount=15 LaunchCount=2
MaxIterationCount=1000 MinIterationCount=10 WarmupCount=10
Method | Mean | Error | StdDev | Ratio | Gen0 | Allocated | Alloc Ratio |
---|---|---|---|---|---|---|---|
Baseline | 32.58 ms | 0.111 ms | 0.159 ms | 1.00 | 62.5000 | 795.37 KB | 1.00 |
SimpleOptimizations | 27.32 ms | 0.053 ms | 0.079 ms | 0.84 | 62.5000 | 795.35 KB | 1.00 |
Merged | 47.89 ms | 0.102 ms | 0.153 ms | 1.47 | - | 148.5 KB | 0.19 |
Let's get unsurprised here with an easy one first:
This can be attributed to intersection queries having to always return all intersections, while raycasts can exit out as soon as there cannot be a closer hit.
A one-off benchmark before I have to use a profiler again: At some point I added a SSE 4.1 version of Triangle.Barycentric
but never benchmarked it (FOR SHAME!), so here is a benchmark with scalar, explicit sse 4.1 and SIMD128 versions:
We also have multi-modal distributions, vastly different results in MediumRun benchmarks, so summaries I would say: No use for either implementation, just scalar should be fine.
EDIT: Another benchmark not worthy of uploading is trying to just disable the degeneration test. We can do that during merging and safe the test for the raycasts but that is a tiny improvement over the current state. Profiler comparison it is.
Also not uploading: adding MIOptions makes casting almost twice as slow. Just adding AggressiveOptimization (without inlining) is better
But merged it is still slower than baseline. The profiler did unfortunately tell me much so a bit of guesswork it is: I am working on an iterative version.
Oh well. The first iterative raycast and... at least it is parity in performance with baseline and in allocation with merged?
The allocations are due to the coarse check, in particular casting against a box allocates at the moment. I would still like to see better numbers for the casting itself.
Now we are getting somewhere. A new iterative variant using additional subtree-elimination reaches allllllmost the WorldCollider+Recursive Cast. That it is not faster is still beyond me.
I omitted the break-even and just continued. In instrumentation profiles I saw Stack<T>
operations having an unusual high percentage in runtime so to test I replaced it with a StackOverSpan<T>
variant that uses ArrayPool<...>.Shared
as backing memory.
I highly suspect we can also push this further. In the same profiles there were also Nullable
unusually high and we can expect some additional performance by replacing the pretty ad-hoc Ray-Triangle intersection by a more standardized one (like Möller and Trumbore)
I am probably going to abandon the recursive merged as well as the naive iterative versions so I cleaned up the list of benchmarks a bit. Also instead of appending ever more acronyms to RW I am going to just compare the previous benchmarks results with the current changes (and baseline/simple opt).
The current changes prepare for the alternative Ray-Triangle intersection and also remove the usage of nullables. By using a NaN invariant we can also omit the comparison for misses entirely. At some point I might want to even move the intersection into the TreeCollider
to have access to precomputed data without uglifying the Ray
interface.
Möller-Trumbore came through, we are well within 2x, even though I had to add a naive check to cull backfacing triangles from the test. The old intersection method did that without my explicit knowledge. Oh well...
As I suspected there was an intermediate in Möller-Trumbore that we can use to cull back-faces (it's just the sign of the determinant). Also I finally removed degenerated triangles from the merged tree, removed the dummy splits of naive section collisions and reordered the triangles to remove the map
indirection.
If I want to continue, I guess more precomputation and an even faster intersection algorithm might be the way to go. But that is not one I want to go, I already use more memory for the merged trees.
As discovered in #368 (and #313) the colliders are heavy allocating components due to many uses of LINQ and generator methods. Unfortunately it does not seem like we get value generator methods in C# anytime soon so we have to write manual enumerator structs to reduce memory allocations.
For sorting intersections we might want to also look into cached lists as well as cached stacks inside the enumerators or have intersections (instead of raycasts) always write into a sorted list.
For testing we can use the
TestRaycaster
but it should be possible to have both implementations side-by-side and (behind a compiler flag) run them both, expecting the exact same results.Review todos:
PrefixSums
instead ofSumSums
StackOverSpan
ListOverSpan
, maybePooledList
(TemporaryList
?FixedTemporaryList
? Inline-array?)IEnumerable Intersections
harder to callTriangle.ClosestPoint
Final benchmark results
GC Profile comparison
The GC profiler shows that
TreeCollider
was the main cause of per-frame allocations but also that we have quite a way to go for zero allocations per-frame.