Speed up oriented bounding box (OBB) overlap test

SeanCurtis-TRI commented 4 weeks ago

Is your feature request related to a problem? Please describe. Because we advocate using compliant hydroelastics as the preferred mode of contact, the tet-tet pair culling is incredibly important. We use an oriented bounding box (OBB) bounding volume hierarchy (BVH) to "quickly" cull as many tet-pair candidates as possible. As such, we test pairs of OBBs for overlapping a lot.

Even if we make no algorithmic changes to how we cull tet pairs, simply speeding up the overlapping test should yield dividends.

Describe the solution you'd like The OBB overlap test lends itself well to SIMD performance improvements. (Its search for separating planes naturally partitions into parallel lumps of computation suitable for SIMD). We should offer up a SIMD implementation.

The SIMD implementation needs the same kind of fallback that the rotation matrix multiplication has (i.e., support for when run on a system where SIMD is not available).

In addition, we need some supporting infrastructure:

[x] We need some benchmarking so that we can track performance changes and prevent regression (or, at the very least, feel good about changes).
[x] The unit tests on the OBB overlap code are currently implicit in the test for a different compilation unit. They need to be brought into their own dedicated unit tests.
- [x] #21546

Describe alternatives you've considered We could consider different bounding volume types (spheres, etc.) and different traversal algorithms (parallel, etc.). However, those are much larger endeavors with uncertain return. Generally, OBB overlap is a valuable operation and making it faster is an unfettered good. Even if we end up accelerating compliant hydro contact using a different bounding volume type, this function should still be faster.

jwnimmer-tri commented 4 weeks ago

FYI my proof of concept branch is here: https://github.com/jwnimmer-tri/drake/commits/boxes-overlap/

jwnimmer-tri commented 4 weeks ago

Oh, one other thing I forgot to mention: if we really don't like the how this looks using raw intrinsics, conceivably we could bring in https://github.com/google/highway to write it in a more normal C++ style that would compile down to efficient intrinsics (even better, with cpu-based dispatch to the best routine for that flavor). Hopefully we can skate by without this time, but as we start writing more SIMD, I think it's probable we'll need to add highway eventually.

jwnimmer-tri commented 3 weeks ago

@SeanCurtis-TRI if possible, it would be nice to cave off the "make a googlebench" part of this to @rpoyner-tri. (Even if you need to massage the space of input OBB domains afterwards.) Maybe you two can talk offline about it.

SeanCurtis-TRI commented 2 weeks ago

I'm fully in favor of the carving.

need to massage the space of input oBB domains afterwards

Not sure what you mean exactly.

jwnimmer-tri commented 2 weeks ago

Not sure what you mean exactly.

Since the function has a couple of early-exit paths, the average performance will be a function of the distribution (and predictability) of input OBBs. Part of making a good microbenchmark will be feeding in OBBs that have similar statistics to the ones seen in a real application.

SeanCurtis-TRI commented 2 weeks ago

I expect the easiest way to do that would be to take one of the representative scenarios, add some counters to the boxes overlap test and run things with the counters. Probably need to run across multiple manipulands.

jwnimmer-tri commented 2 weeks ago

Great. A nice starting scenario for that might be the anzu#13014 replay, which Rico is already going to re-profile soon. We can run it with some OBB case 1 / case 2 / case 3 static counters.

jwnimmer-tri commented 2 weeks ago

(But in any case, even a benchmark with only vanilla input statistics would still be a nice benefit. We can always dial in better statistics over time.)

rpoyner-tri commented 1 week ago

My thought is to start with focused benchmarks for each of the cases, then collect statistics, then implement some statistical-mix benchmarks.

As I understand things so far, looks like I'll want to shuffle some common setup code out of boxes_overlap_test for use in benchmarks.

rpoyner-tri commented 1 week ago

Here are some branch execution count statistics from the anzu profiling target mentioned above. The statistics collection hackery is shown at #21598.

A axis separation returns: [3944039, 8807791, 14267170]
B axis separation returns: [1487990, 2207006, 3686886]
Cross product axis separation returns [2137676, 1165955, 447086, 683011, 343790, 145649, 120081, 73198, 21485]
Overlap returns [76206736]

jwnimmer-tri commented 1 week ago

Good stuff! To help us grok it, could you baseline it to percentages? (I think that means summing to get the total calls, then diving that across the numbers?)

rpoyner-tri commented 1 week ago

Here's a google sheet with the same data, prettied up a bit.: https://docs.google.com/spreadsheets/d/1TMtvXn3JanMObBe3tXTyTXfe3FcGoWcrhA5CApF9T9w/edit?usp=sharing

rpoyner-tri commented 1 week ago

Surprising (at least to me) that 65% of the queries are overlap (full execution of the whole function). Does that match others' intuition?

sherm1 commented 1 week ago

That's an interesting result -- looks like broad phase is doing a good job so most of the returns are overlaps. I wonder if that means it would be faster to look for overlap rather than looking for separation.

jwnimmer-tri commented 1 week ago

That's one possibility, yes. Is there a fewer-flops math to check for overlap instead of non-overlap?

Another (SIMD) tactic would be to use branchless -- unconditionally compute all 15 overlap queries into a mask, and return it directly (casting mask int to C++ bool). That might avoid some branch mispredict penalties.

rpoyner-tri commented 1 week ago

Memorialized the anzu branch use to measure here: anzu#13099

SeanCurtis-TRI commented 1 week ago

There are two pieces of information necessary to gain a greater understanding of the value of the broadphase culling:

You need some sense of how many possible obb-obb pairs are not being evaluated. I.e., for every "these obbs are separated" result you get, you'd want to know the size of the sub-tree of the bounding volume traversal tree lies underneath it.
You'd also like to know what fraction of the obb-obb pairs marked as "overlapping" actually produce overlapping "primitives". For example it would be easy to say true far too much, leading to unhelpful and expensive primitive-primitive tests.

That said, the efficacy of the BVH code is not at stake here. Merely the question of given what the code does can we get it to do it faster. The question of whether or not the BVH code is doing a "good" thing is for a later date.

jwnimmer-tri commented 1 week ago

FYI @rpoyner-tri my hack branch was boxes-overlap.

SeanCurtis-TRI commented 1 week ago

Fun update.

I ran the overlap benchmark in #21596 three times. Once against the BoxesOverlap() in master, once against the intrinsics-based implementation in boxes-overlap, and once against the highway version of the same (located here). The benchmark results are down below, but here's a summary of what jumped out at me:

Master -> SIMD reduces the cost to almost 1/3 of the original. Not surprising as we are taking the 15 calculations and doing it in five bands of three.
intrinsics -> highway reduces the cost a bit more.
- Highway isn't quite as fast in a couple of cases -- the absolute fastest cases for intrinsics (12.5 ns becomes 13.6 ns).
- However, in all other cases, it takes only 3/4 of the time of the intrinsics. Yowza. That means the ratio from master to highway is almost 1/4.

Master version

------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU    Allocs   Iterations
------------------------------------------------------------------------------------------------
BoxesOverlapBenchmark/ParallelContainedCase        129 ns          129 ns     0.125      4968178
BoxesOverlapBenchmark/PosedCase/0/0/-1            28.9 ns         28.9 ns    0.1875     24954658
BoxesOverlapBenchmark/PosedCase/1/0/-1             113 ns          113 ns    0.1875      6287359
BoxesOverlapBenchmark/PosedCase/0/1/-1            32.9 ns         32.9 ns    0.1875     21461271
BoxesOverlapBenchmark/PosedCase/1/1/-1             119 ns          119 ns    0.1875      6101520
BoxesOverlapBenchmark/PosedCase/0/2/-1            35.4 ns         35.4 ns    0.1875     19722903
BoxesOverlapBenchmark/PosedCase/1/2/-1             111 ns          111 ns    0.1875      6349643
BoxesOverlapBenchmark/PosedCase/0/0/0             52.4 ns         52.4 ns    0.1875     13089050
BoxesOverlapBenchmark/PosedCase/1/0/0              112 ns          112 ns    0.1875      6309225
BoxesOverlapBenchmark/PosedCase/0/1/0             68.7 ns         68.6 ns    0.1875     10190161
BoxesOverlapBenchmark/PosedCase/1/1/0              113 ns          113 ns    0.1875      6130558
BoxesOverlapBenchmark/PosedCase/0/2/0             83.3 ns         83.3 ns    0.1875      8344837
BoxesOverlapBenchmark/PosedCase/1/2/0              110 ns          110 ns    0.1875      6341236
BoxesOverlapBenchmark/PosedCase/0/0/1             66.7 ns         66.7 ns    0.1875     10517465
BoxesOverlapBenchmark/PosedCase/1/0/1              110 ns          110 ns    0.1875      6322427
BoxesOverlapBenchmark/PosedCase/0/1/1             81.4 ns         81.4 ns    0.1875      8648747
BoxesOverlapBenchmark/PosedCase/1/1/1              111 ns          111 ns    0.1875      6329373
BoxesOverlapBenchmark/PosedCase/0/2/1             95.7 ns         95.7 ns    0.1875      7211109
BoxesOverlapBenchmark/PosedCase/1/2/1              112 ns          112 ns    0.1875      6359837
BoxesOverlapBenchmark/PosedCase/0/0/2             81.9 ns         81.9 ns    0.1875      8431608
BoxesOverlapBenchmark/PosedCase/1/0/2              112 ns          112 ns    0.1875      6409212
BoxesOverlapBenchmark/PosedCase/0/1/2             97.9 ns         97.9 ns    0.1875      7112036
BoxesOverlapBenchmark/PosedCase/1/1/2              113 ns          113 ns    0.1875      6076795
BoxesOverlapBenchmark/PosedCase/0/2/2              113 ns          113 ns    0.1875      6211708
BoxesOverlapBenchmark/PosedCase/1/2/2              113 ns          113 ns    0.1875      6204990

Intrinsics

------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU    Allocs   Iterations
------------------------------------------------------------------------------------------------
BoxesOverlapBenchmark/ParallelContainedCase       61.4 ns         61.4 ns     0.125      9166908
BoxesOverlapBenchmark/PosedCase/0/0/-1            12.5 ns         12.5 ns    0.1875     53171712
BoxesOverlapBenchmark/PosedCase/1/0/-1            40.8 ns         40.8 ns    0.1875     17289468
BoxesOverlapBenchmark/PosedCase/0/1/-1            12.3 ns         12.3 ns    0.1875     59670400
BoxesOverlapBenchmark/PosedCase/1/1/-1            41.4 ns         41.4 ns    0.1875     16710841
BoxesOverlapBenchmark/PosedCase/0/2/-1            12.1 ns         12.1 ns    0.1875     58791224
BoxesOverlapBenchmark/PosedCase/1/2/-1            41.2 ns         41.2 ns    0.1875     17114134
BoxesOverlapBenchmark/PosedCase/0/0/0             28.2 ns         28.2 ns    0.1875     24030831
BoxesOverlapBenchmark/PosedCase/1/0/0             41.3 ns         41.3 ns    0.1875     17129613
BoxesOverlapBenchmark/PosedCase/0/1/0             32.0 ns         32.0 ns    0.1875     21884716
BoxesOverlapBenchmark/PosedCase/1/1/0             41.5 ns         41.5 ns    0.1875     15562419
BoxesOverlapBenchmark/PosedCase/0/2/0             34.9 ns         34.9 ns    0.1875     20026718
BoxesOverlapBenchmark/PosedCase/1/2/0             41.0 ns         41.0 ns    0.1875     17022138
BoxesOverlapBenchmark/PosedCase/0/0/1             31.6 ns         31.6 ns    0.1875     22007645
BoxesOverlapBenchmark/PosedCase/1/0/1             41.6 ns         41.6 ns    0.1875     16646245
BoxesOverlapBenchmark/PosedCase/0/1/1             34.9 ns         34.9 ns    0.1875     19944926
BoxesOverlapBenchmark/PosedCase/1/1/1             41.0 ns         40.8 ns    0.1875     17098019
BoxesOverlapBenchmark/PosedCase/0/2/1             37.6 ns         37.4 ns    0.1875     18830446
BoxesOverlapBenchmark/PosedCase/1/2/1             41.4 ns         41.2 ns    0.1875     16807910
BoxesOverlapBenchmark/PosedCase/0/0/2             35.0 ns         34.9 ns    0.1875     20040863
BoxesOverlapBenchmark/PosedCase/1/0/2             41.0 ns         40.8 ns    0.1875     16670585
BoxesOverlapBenchmark/PosedCase/0/1/2             37.5 ns         37.4 ns    0.1875     18546764
BoxesOverlapBenchmark/PosedCase/1/1/2             40.9 ns         40.7 ns    0.1875     17225832
BoxesOverlapBenchmark/PosedCase/0/2/2             41.0 ns         40.8 ns    0.1875     17366779
BoxesOverlapBenchmark/PosedCase/1/2/2             40.4 ns         40.3 ns    0.1875     17244243

Highway

------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU    Allocs   Iterations
------------------------------------------------------------------------------------------------
BoxesOverlapBenchmark/ParallelContainedCase       28.5 ns         28.5 ns     0.125     19737188
BoxesOverlapBenchmark/PosedCase/0/0/-1            13.6 ns         13.6 ns    0.1875     52039155
BoxesOverlapBenchmark/PosedCase/1/0/-1            29.1 ns         29.1 ns    0.1875     24285073
BoxesOverlapBenchmark/PosedCase/0/1/-1            13.7 ns         13.7 ns    0.1875     49932847
BoxesOverlapBenchmark/PosedCase/1/1/-1            30.1 ns         30.1 ns    0.1875     23355036
BoxesOverlapBenchmark/PosedCase/0/2/-1            13.6 ns         13.6 ns    0.1875     50565652
BoxesOverlapBenchmark/PosedCase/1/2/-1            28.9 ns         28.9 ns    0.1875     23586123
BoxesOverlapBenchmark/PosedCase/0/0/0             22.9 ns         22.9 ns    0.1875     30625961
BoxesOverlapBenchmark/PosedCase/1/0/0             29.0 ns         29.0 ns    0.1875     24223770
BoxesOverlapBenchmark/PosedCase/0/1/0             24.4 ns         24.4 ns    0.1875     28330168
BoxesOverlapBenchmark/PosedCase/1/1/0             29.4 ns         29.4 ns    0.1875     23944716
BoxesOverlapBenchmark/PosedCase/0/2/0             26.4 ns         26.4 ns    0.1875     26287474
BoxesOverlapBenchmark/PosedCase/1/2/0             29.1 ns         29.1 ns    0.1875     23929282
BoxesOverlapBenchmark/PosedCase/0/0/1             24.5 ns         24.5 ns    0.1875     27674972
BoxesOverlapBenchmark/PosedCase/1/0/1             29.7 ns         29.7 ns    0.1875     23606747
BoxesOverlapBenchmark/PosedCase/0/1/1             26.6 ns         26.6 ns    0.1875     26635718
BoxesOverlapBenchmark/PosedCase/1/1/1             31.0 ns         31.0 ns    0.1875     22288692
BoxesOverlapBenchmark/PosedCase/0/2/1             28.3 ns         28.3 ns    0.1875     23564967
BoxesOverlapBenchmark/PosedCase/1/2/1             30.0 ns         30.0 ns    0.1875     23136369
BoxesOverlapBenchmark/PosedCase/0/0/2             26.3 ns         26.3 ns    0.1875     26526882
BoxesOverlapBenchmark/PosedCase/1/0/2             30.2 ns         30.2 ns    0.1875     22954542
BoxesOverlapBenchmark/PosedCase/0/1/2             28.0 ns         28.0 ns    0.1875     24790100
BoxesOverlapBenchmark/PosedCase/1/1/2             30.2 ns         30.2 ns    0.1875     23085781
BoxesOverlapBenchmark/PosedCase/0/2/2             30.2 ns         30.2 ns    0.1875     23164136
BoxesOverlapBenchmark/PosedCase/1/2/2             29.4 ns         29.4 ns    0.1875     23627965

sherm1 commented 1 week ago

Wow! Great result.

rpoyner-tri commented 1 week ago

Correction: branch execution statistics anzu branch is now anzu#13134, superseding the prior attempt.

RobotLocomotion / drake