apple / swift-numerics

Advanced mathematical types and functions for Swift
Apache License 2.0
1.67k stars 142 forks source link

"Relaxed" multiply and add operations. #214

Closed stephentyrone closed 1 year ago

stephentyrone commented 2 years ago

This commit adds the following implementation hooks to the AlgebraicField protocol:

static func _relaxedAdd(_:Self, _:Self) -> Self
static func _relaxedMul(_:Self, _:Self) -> Self

These are equivalent to + and *, but have "relaxed semantics"; specifically, they license the compiler to reassociate them and to form FMA nodes, which are both significant optimizations that can easily make many common loops 8-10x faster. These transformation perturb results slightly, so they should not be enabled without care, but the results with the relaxed operations are--for most purposes--"just as good as" (and often better than) what strict operations produce. The main thing to beware of is that they are no longer portable; different compiler versions and different targets and optimization flags will result in different results.

These are then exposed under the Relaxed namespace as:

Relaxed.sum(a, b)
Relaxed.product(a, b)
stephentyrone commented 2 years ago

@swift-ci test

stephentyrone commented 2 years ago

@swift-ci test

stephentyrone commented 2 years ago

Hrm, why are we using a Swift-5.3.3 Linux toolchain for testing instead of something more recent? Still, good to know--if unfortunate--that reassociate(on) is not supported there. I'll have to add a workaround and a note for that.

stephentyrone commented 2 years ago

@swift-ci test

stephentyrone commented 2 years ago

@swift-ci test

stephentyrone commented 2 years ago

@swift-ci test

stephentyrone commented 2 years ago

@swift-ci test

stephentyrone commented 1 year ago

@swift-ci test

stephentyrone commented 1 year ago

Some quick perf numbers from my M1 laptop:

repeatedly summing 1024 Floats

time using reduce(0, +): 0.091 sec time using reduce(0, Relaxed.sum): 0.009 sec time using vDSP.sum from Accelerate: 0.004 sec

repeated dot-product of 1024 Floats

time using reduce(0) { $0 + $1*$1 }: 0.085 sec time using reduce(0) { Relaxed.multiplyAdd($1, $1, $0): 0.011 sec time using vDSP.sumOfSquares from Accelerate: 0.005 sec

For "typical" reduction workloads as above, we see about a 10x speedup over the strict operators, and we're about 2x off of hand-written SIMD.