[math] Use highway instead of intrinsics

RobotLocomotion / drake

Model-based design and verification for robotics.

https://drake.mit.edu

Other

3.16k stars 1.24k forks source link

[math] Use highway instead of intrinsics #21571

Closed jwnimmer-tri closed 1 week ago

jwnimmer-tri commented 2 weeks ago

Related to #21526.

The proposition here is that the code more readable with this spelling, compared to the raw intrinsics. See https://google.github.io/highway/en/master/quick_reference.html#operations for the list of functions / grammar.

Down the road, highway also offers macros to handle the runtime-cpu-dispatch automatically. Not only would that let us get rid of our own dispatcher code here, but also highway's dispatcher is more advanced -- we could add support for a AVX512VL implementation without rewriting the code. (LLVM MCA suggests that the AVX512VL spelling is ~20% faster than AVX2.) There's also a possibility that we could turn on macOS SIMD as well, by only changing the configuration of the cpu dispatcher, with no changes to the bodies of these functions.

This change is

jwnimmer-tri commented 2 weeks ago

+@SeanCurtis-TRI I was feeling guilty about forcing you to write code using intrinsics. As it turns out, adding highway to the build system was really easy, and porting the code to it was mostly a quick search-replace.

FYI On the availability of this option for SIMD. If you think it's easier to read / write than the alternative, we can game plan on Monday about how to move forward.

jwnimmer-tri commented 1 week ago

+@sherm1 for feature review, please.

Per team f2f, we think Highway is probably the right way to go, especially as we write more SIMD stuff.

After this lands, I'll follow up with another PR (or two) to rewrite the CPU-dynamic dispatch here to use Highway instead our manual jump table, thus adding support for AVX512VL, hopefully macOS SIMD, etc.

~When I have time later, I'll also post a godbolt link with LLVM MCA analysis of the changes here, for comparison. (I'll just retcon this comment with the link.)~

Edit: Here's the godbolt link. It shows:

The two source codes for XinvX (toggled with an ifdef).
The assembly for our Jammy GCC version for both source codes.
The diff of the assembly.
The Clang assembly and it's MCA report for both source codes.

With Clang, the Block RThroughput estimate on Broadwell increases from 15.3 to 15.5.

sherm1 commented 1 week ago

BTW looking at the Godbolt results I see clang MCA results of 15.3 (Intel) and 12.5 (Highway). Was there an actual 15.5 result that I missed?

jwnimmer-tri commented 1 week ago

Sometimes godbolt gets confused and doesn't give the same MCA answer for the same input files. I assume it's something to do with some version pinning thing I haven't found the knob to set yet.

sherm1 commented 1 week ago

Anyway that's "Block reciprocal throughput" so smaller is better.

jwnimmer-tri commented 1 week ago

Please have a look at https://github.com/RobotLocomotion/drake/pull/21600. It keeps the "abcd" variable name lane order (low lane on the left), but cleans up the LoadMatrix3Inv function to read more clearly.

SeanCurtis-TRI commented 1 week ago

+@SeanCurtis-TRI for platform review.

jwnimmer-tri commented 1 week ago

I re-pushed here for reference, but remember this is NOT the PR we'll plan to review and merge.