Closed jwnimmer-tri closed 1 week ago
+@SeanCurtis-TRI I was feeling guilty about forcing you to write code using intrinsics. As it turns out, adding highway to the build system was really easy, and porting the code to it was mostly a quick search-replace.
FYI On the availability of this option for SIMD. If you think it's easier to read / write than the alternative, we can game plan on Monday about how to move forward.
+@sherm1 for feature review, please.
Per team f2f, we think Highway is probably the right way to go, especially as we write more SIMD stuff.
After this lands, I'll follow up with another PR (or two) to rewrite the CPU-dynamic dispatch here to use Highway instead our manual jump table, thus adding support for AVX512VL, hopefully macOS SIMD, etc.
~When I have time later, I'll also post a godbolt link with LLVM MCA analysis of the changes here, for comparison. (I'll just retcon this comment with the link.)~
Edit: Here's the godbolt link. It shows:
With Clang, the Block RThroughput estimate on Broadwell increases from 15.3 to 15.5.
BTW looking at the Godbolt results I see clang MCA results of 15.3 (Intel) and 12.5 (Highway). Was there an actual 15.5 result that I missed?
Sometimes godbolt gets confused and doesn't give the same MCA answer for the same input files. I assume it's something to do with some version pinning thing I haven't found the knob to set yet.
Anyway that's "Block reciprocal throughput" so smaller is better.
Please have a look at https://github.com/RobotLocomotion/drake/pull/21600. It keeps the "abcd" variable name lane order (low lane on the left), but cleans up the LoadMatrix3Inv
function to read more clearly.
+@SeanCurtis-TRI for platform review.
I re-pushed here for reference, but remember this is NOT the PR we'll plan to review and merge.
Related to #21526.
The proposition here is that the code more readable with this spelling, compared to the raw intrinsics. See https://google.github.io/highway/en/master/quick_reference.html#operations for the list of functions / grammar.
Down the road, highway also offers macros to handle the runtime-cpu-dispatch automatically. Not only would that let us get rid of our own dispatcher code here, but also highway's dispatcher is more advanced -- we could add support for a AVX512VL implementation without rewriting the code. (LLVM MCA suggests that the AVX512VL spelling is ~20% faster than AVX2.) There's also a possibility that we could turn on macOS SIMD as well, by only changing the configuration of the cpu dispatcher, with no changes to the bodies of these functions.
This change is