Open SeanCurtis-TRI opened 4 weeks ago
FYI my proof of concept branch is here: https://github.com/jwnimmer-tri/drake/commits/boxes-overlap/
Oh, one other thing I forgot to mention: if we really don't like the how this looks using raw intrinsics, conceivably we could bring in https://github.com/google/highway to write it in a more normal C++ style that would compile down to efficient intrinsics (even better, with cpu-based dispatch to the best routine for that flavor). Hopefully we can skate by without this time, but as we start writing more SIMD, I think it's probable we'll need to add highway
eventually.
@SeanCurtis-TRI if possible, it would be nice to cave off the "make a googlebench" part of this to @rpoyner-tri. (Even if you need to massage the space of input OBB domains afterwards.) Maybe you two can talk offline about it.
I'm fully in favor of the carving.
need to massage the space of input oBB domains afterwards
Not sure what you mean exactly.
Not sure what you mean exactly.
Since the function has a couple of early-exit paths, the average performance will be a function of the distribution (and predictability) of input OBBs. Part of making a good microbenchmark will be feeding in OBBs that have similar statistics to the ones seen in a real application.
I expect the easiest way to do that would be to take one of the representative scenarios, add some counters to the boxes overlap test and run things with the counters. Probably need to run across multiple manipulands.
Great. A nice starting scenario for that might be the anzu#13014
replay, which Rico is already going to re-profile soon. We can run it with some OBB case 1 / case 2 / case 3 static counters.
(But in any case, even a benchmark with only vanilla input statistics would still be a nice benefit. We can always dial in better statistics over time.)
My thought is to start with focused benchmarks for each of the cases, then collect statistics, then implement some statistical-mix benchmarks.
As I understand things so far, looks like I'll want to shuffle some common setup code out of boxes_overlap_test for use in benchmarks.
Here are some branch execution count statistics from the anzu profiling target mentioned above. The statistics collection hackery is shown at #21598.
A axis separation returns: [3944039, 8807791, 14267170]
B axis separation returns: [1487990, 2207006, 3686886]
Cross product axis separation returns [2137676, 1165955, 447086, 683011, 343790, 145649, 120081, 73198, 21485]
Overlap returns [76206736]
Good stuff! To help us grok it, could you baseline it to percentages? (I think that means summing to get the total calls, then diving that across the numbers?)
Here's a google sheet with the same data, prettied up a bit.: https://docs.google.com/spreadsheets/d/1TMtvXn3JanMObBe3tXTyTXfe3FcGoWcrhA5CApF9T9w/edit?usp=sharing
Surprising (at least to me) that 65% of the queries are overlap (full execution of the whole function). Does that match others' intuition?
That's an interesting result -- looks like broad phase is doing a good job so most of the returns are overlaps. I wonder if that means it would be faster to look for overlap rather than looking for separation.
That's one possibility, yes. Is there a fewer-flops math to check for overlap instead of non-overlap?
Another (SIMD) tactic would be to use branchless -- unconditionally compute all 15 overlap queries into a mask, and return it directly (casting mask int to C++ bool). That might avoid some branch mispredict penalties.
Memorialized the anzu branch use to measure here: anzu#13099
There are two pieces of information necessary to gain a greater understanding of the value of the broadphase culling:
true
far too much, leading to unhelpful and expensive primitive-primitive tests.That said, the efficacy of the BVH code is not at stake here. Merely the question of given what the code does can we get it to do it faster. The question of whether or not the BVH code is doing a "good" thing is for a later date.
FYI @rpoyner-tri my hack branch was boxes-overlap.
Fun update.
I ran the overlap benchmark in #21596 three times. Once against the BoxesOverlap()
in master, once against the intrinsics-based implementation in boxes-overlap, and once against the highway version of the same (located here). The benchmark results are down below, but here's a summary of what jumped out at me:
------------------------------------------------------------------------------------------------
Benchmark Time CPU Allocs Iterations
------------------------------------------------------------------------------------------------
BoxesOverlapBenchmark/ParallelContainedCase 129 ns 129 ns 0.125 4968178
BoxesOverlapBenchmark/PosedCase/0/0/-1 28.9 ns 28.9 ns 0.1875 24954658
BoxesOverlapBenchmark/PosedCase/1/0/-1 113 ns 113 ns 0.1875 6287359
BoxesOverlapBenchmark/PosedCase/0/1/-1 32.9 ns 32.9 ns 0.1875 21461271
BoxesOverlapBenchmark/PosedCase/1/1/-1 119 ns 119 ns 0.1875 6101520
BoxesOverlapBenchmark/PosedCase/0/2/-1 35.4 ns 35.4 ns 0.1875 19722903
BoxesOverlapBenchmark/PosedCase/1/2/-1 111 ns 111 ns 0.1875 6349643
BoxesOverlapBenchmark/PosedCase/0/0/0 52.4 ns 52.4 ns 0.1875 13089050
BoxesOverlapBenchmark/PosedCase/1/0/0 112 ns 112 ns 0.1875 6309225
BoxesOverlapBenchmark/PosedCase/0/1/0 68.7 ns 68.6 ns 0.1875 10190161
BoxesOverlapBenchmark/PosedCase/1/1/0 113 ns 113 ns 0.1875 6130558
BoxesOverlapBenchmark/PosedCase/0/2/0 83.3 ns 83.3 ns 0.1875 8344837
BoxesOverlapBenchmark/PosedCase/1/2/0 110 ns 110 ns 0.1875 6341236
BoxesOverlapBenchmark/PosedCase/0/0/1 66.7 ns 66.7 ns 0.1875 10517465
BoxesOverlapBenchmark/PosedCase/1/0/1 110 ns 110 ns 0.1875 6322427
BoxesOverlapBenchmark/PosedCase/0/1/1 81.4 ns 81.4 ns 0.1875 8648747
BoxesOverlapBenchmark/PosedCase/1/1/1 111 ns 111 ns 0.1875 6329373
BoxesOverlapBenchmark/PosedCase/0/2/1 95.7 ns 95.7 ns 0.1875 7211109
BoxesOverlapBenchmark/PosedCase/1/2/1 112 ns 112 ns 0.1875 6359837
BoxesOverlapBenchmark/PosedCase/0/0/2 81.9 ns 81.9 ns 0.1875 8431608
BoxesOverlapBenchmark/PosedCase/1/0/2 112 ns 112 ns 0.1875 6409212
BoxesOverlapBenchmark/PosedCase/0/1/2 97.9 ns 97.9 ns 0.1875 7112036
BoxesOverlapBenchmark/PosedCase/1/1/2 113 ns 113 ns 0.1875 6076795
BoxesOverlapBenchmark/PosedCase/0/2/2 113 ns 113 ns 0.1875 6211708
BoxesOverlapBenchmark/PosedCase/1/2/2 113 ns 113 ns 0.1875 6204990
------------------------------------------------------------------------------------------------
Benchmark Time CPU Allocs Iterations
------------------------------------------------------------------------------------------------
BoxesOverlapBenchmark/ParallelContainedCase 61.4 ns 61.4 ns 0.125 9166908
BoxesOverlapBenchmark/PosedCase/0/0/-1 12.5 ns 12.5 ns 0.1875 53171712
BoxesOverlapBenchmark/PosedCase/1/0/-1 40.8 ns 40.8 ns 0.1875 17289468
BoxesOverlapBenchmark/PosedCase/0/1/-1 12.3 ns 12.3 ns 0.1875 59670400
BoxesOverlapBenchmark/PosedCase/1/1/-1 41.4 ns 41.4 ns 0.1875 16710841
BoxesOverlapBenchmark/PosedCase/0/2/-1 12.1 ns 12.1 ns 0.1875 58791224
BoxesOverlapBenchmark/PosedCase/1/2/-1 41.2 ns 41.2 ns 0.1875 17114134
BoxesOverlapBenchmark/PosedCase/0/0/0 28.2 ns 28.2 ns 0.1875 24030831
BoxesOverlapBenchmark/PosedCase/1/0/0 41.3 ns 41.3 ns 0.1875 17129613
BoxesOverlapBenchmark/PosedCase/0/1/0 32.0 ns 32.0 ns 0.1875 21884716
BoxesOverlapBenchmark/PosedCase/1/1/0 41.5 ns 41.5 ns 0.1875 15562419
BoxesOverlapBenchmark/PosedCase/0/2/0 34.9 ns 34.9 ns 0.1875 20026718
BoxesOverlapBenchmark/PosedCase/1/2/0 41.0 ns 41.0 ns 0.1875 17022138
BoxesOverlapBenchmark/PosedCase/0/0/1 31.6 ns 31.6 ns 0.1875 22007645
BoxesOverlapBenchmark/PosedCase/1/0/1 41.6 ns 41.6 ns 0.1875 16646245
BoxesOverlapBenchmark/PosedCase/0/1/1 34.9 ns 34.9 ns 0.1875 19944926
BoxesOverlapBenchmark/PosedCase/1/1/1 41.0 ns 40.8 ns 0.1875 17098019
BoxesOverlapBenchmark/PosedCase/0/2/1 37.6 ns 37.4 ns 0.1875 18830446
BoxesOverlapBenchmark/PosedCase/1/2/1 41.4 ns 41.2 ns 0.1875 16807910
BoxesOverlapBenchmark/PosedCase/0/0/2 35.0 ns 34.9 ns 0.1875 20040863
BoxesOverlapBenchmark/PosedCase/1/0/2 41.0 ns 40.8 ns 0.1875 16670585
BoxesOverlapBenchmark/PosedCase/0/1/2 37.5 ns 37.4 ns 0.1875 18546764
BoxesOverlapBenchmark/PosedCase/1/1/2 40.9 ns 40.7 ns 0.1875 17225832
BoxesOverlapBenchmark/PosedCase/0/2/2 41.0 ns 40.8 ns 0.1875 17366779
BoxesOverlapBenchmark/PosedCase/1/2/2 40.4 ns 40.3 ns 0.1875 17244243
------------------------------------------------------------------------------------------------
Benchmark Time CPU Allocs Iterations
------------------------------------------------------------------------------------------------
BoxesOverlapBenchmark/ParallelContainedCase 28.5 ns 28.5 ns 0.125 19737188
BoxesOverlapBenchmark/PosedCase/0/0/-1 13.6 ns 13.6 ns 0.1875 52039155
BoxesOverlapBenchmark/PosedCase/1/0/-1 29.1 ns 29.1 ns 0.1875 24285073
BoxesOverlapBenchmark/PosedCase/0/1/-1 13.7 ns 13.7 ns 0.1875 49932847
BoxesOverlapBenchmark/PosedCase/1/1/-1 30.1 ns 30.1 ns 0.1875 23355036
BoxesOverlapBenchmark/PosedCase/0/2/-1 13.6 ns 13.6 ns 0.1875 50565652
BoxesOverlapBenchmark/PosedCase/1/2/-1 28.9 ns 28.9 ns 0.1875 23586123
BoxesOverlapBenchmark/PosedCase/0/0/0 22.9 ns 22.9 ns 0.1875 30625961
BoxesOverlapBenchmark/PosedCase/1/0/0 29.0 ns 29.0 ns 0.1875 24223770
BoxesOverlapBenchmark/PosedCase/0/1/0 24.4 ns 24.4 ns 0.1875 28330168
BoxesOverlapBenchmark/PosedCase/1/1/0 29.4 ns 29.4 ns 0.1875 23944716
BoxesOverlapBenchmark/PosedCase/0/2/0 26.4 ns 26.4 ns 0.1875 26287474
BoxesOverlapBenchmark/PosedCase/1/2/0 29.1 ns 29.1 ns 0.1875 23929282
BoxesOverlapBenchmark/PosedCase/0/0/1 24.5 ns 24.5 ns 0.1875 27674972
BoxesOverlapBenchmark/PosedCase/1/0/1 29.7 ns 29.7 ns 0.1875 23606747
BoxesOverlapBenchmark/PosedCase/0/1/1 26.6 ns 26.6 ns 0.1875 26635718
BoxesOverlapBenchmark/PosedCase/1/1/1 31.0 ns 31.0 ns 0.1875 22288692
BoxesOverlapBenchmark/PosedCase/0/2/1 28.3 ns 28.3 ns 0.1875 23564967
BoxesOverlapBenchmark/PosedCase/1/2/1 30.0 ns 30.0 ns 0.1875 23136369
BoxesOverlapBenchmark/PosedCase/0/0/2 26.3 ns 26.3 ns 0.1875 26526882
BoxesOverlapBenchmark/PosedCase/1/0/2 30.2 ns 30.2 ns 0.1875 22954542
BoxesOverlapBenchmark/PosedCase/0/1/2 28.0 ns 28.0 ns 0.1875 24790100
BoxesOverlapBenchmark/PosedCase/1/1/2 30.2 ns 30.2 ns 0.1875 23085781
BoxesOverlapBenchmark/PosedCase/0/2/2 30.2 ns 30.2 ns 0.1875 23164136
BoxesOverlapBenchmark/PosedCase/1/2/2 29.4 ns 29.4 ns 0.1875 23627965
Wow! Great result.
Correction: branch execution statistics anzu branch is now anzu#13134
, superseding the prior attempt.
Is your feature request related to a problem? Please describe. Because we advocate using compliant hydroelastics as the preferred mode of contact, the tet-tet pair culling is incredibly important. We use an oriented bounding box (OBB) bounding volume hierarchy (BVH) to "quickly" cull as many tet-pair candidates as possible. As such, we test pairs of OBBs for overlapping a lot.
Even if we make no algorithmic changes to how we cull tet pairs, simply speeding up the overlapping test should yield dividends.
Describe the solution you'd like The OBB overlap test lends itself well to SIMD performance improvements. (Its search for separating planes naturally partitions into parallel lumps of computation suitable for SIMD). We should offer up a SIMD implementation.
The SIMD implementation needs the same kind of fallback that the rotation matrix multiplication has (i.e., support for when run on a system where SIMD is not available).
In addition, we need some supporting infrastructure:
Describe alternatives you've considered We could consider different bounding volume types (spheres, etc.) and different traversal algorithms (parallel, etc.). However, those are much larger endeavors with uncertain return. Generally, OBB overlap is a valuable operation and making it faster is an unfettered good. Even if we end up accelerating compliant hydro contact using a different bounding volume type, this function should still be faster.