Feature request for "relaxed" floating-point semantics

WowbaggersLiquidLunch commented 2 years ago

There is already #214, but I think it's good to add a use case for it. (Also I promised to open an issue for it.)

Currently I'm writing a parallelized K-means algorithm that can benefit from "relaxed" floating-point semantics.

In a typical K-means algorithm, there are 2 steps in each clustering iteration: assigning points/observations to clusters and updating cluster centroids. Both steps operate on (often large numbers of) independent points and clusters, which are highly parallelizable.

This is what the sequential version of the "updating cluster centroids" step roughly looks like in Swift:

struct Point {
    let coordinates: [Double]
}

actor Cluster {
    var centroidCoordinates: [Double]
    var points: Set<Point>
}

extension Cluster {
    var dimensionCount: Int {
        centroidCoordinates.count
    }

    func updateCentroidCoordinates() {
        guard !points.isEmpty else { return }

        centroidCoordinates = (0..<dimensionCount).map { dimensionIndex in
            points.map { $0.coordinates[dimensionIndex] } .reduce(0, +) / Double(points.count)
        }
    }
}

The updateCentroidCoordinates function sets a cluster's centroid to the average of all its points' positions.

To parallelize this sequential algorithm, one of the best and fastest solutions is using hardware/instruction-level parallelism such as SIMD. Even better if the compiler can do it automatically. However, because the + operation in the inner loop reduce(0, +) is not associative for floating-point values, the compiler cannot auto-vectorize the loop:

.LBB20_61:
        vaddsd  xmm0, xmm0, qword ptr [r13 + 8*rdx + 32]
        vaddsd  xmm0, xmm0, qword ptr [r13 + 8*rdx + 40]
        vaddsd  xmm0, xmm0, qword ptr [r13 + 8*rdx + 48]
        vaddsd  xmm0, xmm0, qword ptr [r13 + 8*rdx + 56]
        vaddsd  xmm0, xmm0, qword ptr [r13 + 8*rdx + 64]
        vaddsd  xmm0, xmm0, qword ptr [r13 + 8*rdx + 72]
        vaddsd  xmm0, xmm0, qword ptr [r13 + 8*rdx + 80]
        vaddsd  xmm0, xmm0, qword ptr [r13 + 8*rdx + 88]

Because the standard library does not provide any public API that tells the compiler it's ok to reorder and vectorize some floating-point operations, in order to make this code go fast, one must either vectorize it by hand using SIMD types, or find a workaround to enable auto-vectorization. One workaround suggested on the forums by @stephentyrone is using inlined C functions that wraps "relaxed" floating-point operations:

// myHeader.h
static inline __attribute__((inline_always))
double associative_add(double a, double b) {
#pragma clang fp reassociate(on)
  return a + b;
}

// myFile.swift
func reduce(foo: [Double]) -> Double {
    foo.reduce(0, associative_add)
}

This works, but it also comes with some downsides: Each user must write their own copy of this workaround, for each floating-point type they need it for. It's not written in Swift, so it can be less accessible to some people who are not familiar with C and Clang features. Having APIs for "relaxed" floating-point semantics provided by swift-numerics should be able to solve these 2 problems.

stephentyrone commented 2 years ago

Thanks for the excellent feature request!

stephentyrone commented 1 year ago

Completed with https://github.com/apple/swift-numerics/pull/214

apple / swift-numerics

Feature request for "relaxed" floating-point semantics #216

There is already #214, but I think it's good to add a use case for it. (Also I promised to open an issue for it.)