Closed danieldk closed 4 years ago
With fma: 12.8s, 13.7s, 12.8s Without fma: 12.65s, 12.86s, 13.2s
on tdz-train
Thanks! I'll merge this, since it doesn't reduce performance and increases precision. I'll see if I can make some changes to improve pipeline use.
This adds a variant that uses the FMA intrinsic when the machine supports it. On my machine it's neither slower nor faster, but my modest i5 might be constrained by cache size and memory speed at this point ;).