SIMD vectorisation - Githubissues

Smaug123 commented 1 year ago

Array.max, for example, can be made more than an order of magnitude faster with vectorisation on appropriate hardware. The money shot (forgive the rather random naming, I copy-pasted some code which used to be about factorials, for example), on a machine with Vector<byte>.Count = 16:

BenchmarkDotNet v0.13.10, macOS Sonoma 14.1 (23B74) [Darwin 23.1.0]
Apple M1 Max, 1 CPU, 10 logical and 10 physical cores
.NET SDK 7.0.403
  [Host]     : .NET 7.0.13 (7.0.1323.51816), Arm64 RyuJIT AdvSIMD DEBUG
  DefaultJob : .NET 7.0.13 (7.0.1323.51816), Arm64 RyuJIT AdvSIMD

| Method             | ArraySize | Vectorized | Mean            | Error        | StdDev       |
|------------------- |---------- |----------- |----------------:|-------------:|-------------:|
| BenchmarkFactorial | 10        | False      |        13.95 ns |     0.011 ns |     0.009 ns |
| BenchmarkFactorial | 10        | True       |        14.23 ns |     0.015 ns |     0.012 ns |
| BenchmarkFactorial | 100       | False      |        71.63 ns |     0.765 ns |     0.716 ns |
| BenchmarkFactorial | 100       | True       |        19.84 ns |     0.035 ns |     0.027 ns |
| BenchmarkFactorial | 1000      | False      |       641.30 ns |     1.559 ns |     1.382 ns |
| BenchmarkFactorial | 1000      | True       |        67.48 ns |     0.125 ns |     0.098 ns |
| BenchmarkFactorial | 10000     | False      |     6,283.97 ns |    12.787 ns |    11.336 ns |
| BenchmarkFactorial | 10000     | True       |       458.34 ns |     1.151 ns |     0.899 ns |
| BenchmarkFactorial | 100000    | False      |    62,312.46 ns |    21.059 ns |    17.585 ns |
| BenchmarkFactorial | 100000    | True       |     4,435.52 ns |     7.009 ns |     5.853 ns |
| BenchmarkFactorial | 1000000   | False      |   622,888.31 ns |   415.604 ns |   368.422 ns |
| BenchmarkFactorial | 1000000   | True       |    44,163.61 ns |    42.545 ns |    35.527 ns |
| BenchmarkFactorial | 2000000   | False      | 1,245,989.22 ns | 1,678.470 ns | 1,401.599 ns |
| BenchmarkFactorial | 2000000   | True       |    87,850.09 ns |    64.740 ns |    54.061 ns |

// * Hints *
Outliers
  Bench.BenchmarkFactorial: Default -> 2 outliers were removed, 3 outliers were detected (14.98 ns, 15.04 ns, 15.14 ns)
  Bench.BenchmarkFactorial: Default -> 3 outliers were removed (15.41 ns..15.44 ns)
  Bench.BenchmarkFactorial: Default -> 3 outliers were removed (21.06 ns..21.11 ns)
  Bench.BenchmarkFactorial: Default -> 1 outlier  was  removed (647.56 ns)
  Bench.BenchmarkFactorial: Default -> 3 outliers were removed, 5 outliers were detected (69.05 ns, 69.08 ns, 69.73 ns..70.75 ns)
  Bench.BenchmarkFactorial: Default -> 1 outlier  was  removed (6.33 us)
  Bench.BenchmarkFactorial: Default -> 3 outliers were removed (465.11 ns..465.85 ns)
  Bench.BenchmarkFactorial: Default -> 2 outliers were removed (63.04 us, 63.09 us)
  Bench.BenchmarkFactorial: Default -> 2 outliers were removed (4.48 us, 4.48 us)
  Bench.BenchmarkFactorial: Default -> 1 outlier  was  removed (628.06 us)
  Bench.BenchmarkFactorial: Default -> 2 outliers were removed (44.30 us, 44.38 us)
  Bench.BenchmarkFactorial: Default -> 2 outliers were removed (1.25 ms, 1.30 ms)
  Bench.BenchmarkFactorial: Default -> 2 outliers were removed (88.18 us, 89.12 us)

// * Legends *
  ArraySize  : Value of the 'ArraySize' parameter
  Vectorized : Value of the 'Vectorized' parameter
  Mean       : Arithmetic mean of all measurements
  Error      : Half of 99.9% confidence interval
  StdDev     : Standard deviation of all measurements
  1 ns       : 1 Nanosecond (0.000000001 sec)

It's trivial to write the property-based test that asserts this agrees with Array.max; and I reran this using Array.max directly (rather than my vendored version) and got basically the same numbers.

Describe the solution you'd like

All appropriate array methods should be vectorised. There are examples in System.Linq.

One difficulty is that Vector<'a> requires 'a : (new : unit -> 'a) and 'a : struct and 'a :> System.ValueType. I do not know how to make this change in a way that doesn't break the API, because I don't know how to provide one implementation for the 'a which satisfy those requirements and one for 'a which do not. But it would be great if we could simply provide various implementations of Array.max and have the correct one be selected.

Describe alternatives you've considered

We can always not do this; people can write this themselves, or call out to System.Linq manually. It's sad if they have to do that, though.

Additional context Program:

namespace Vectorised

open BenchmarkDotNet.Attributes
open BenchmarkDotNet.Running

module Vectorised =
    let inline maxFromFSharpCore (array: _[]) =
        if isNull array then
            nullArg "array"

        if array.Length = 0 then
            invalidArg "array" "array was empty"

        let mutable acc = array.[0]

        for i = 1 to array.Length - 1 do
            let curr = array.[i]

            if curr > acc then
                acc <- curr

        acc

    open System.Numerics

    let inline max<'a when 'a : comparison and 'a : (new : unit -> 'a) and 'a : struct and 'a :> System.ValueType> (array : 'a []) : 'a =
        if isNull array then
            nullArg "array"

        let span = System.MemoryExtensions.AsSpan array
        if span.IsEmpty then
            invalidArg "array" "array was empty"

        let mutable index = 0
        let mutable acc = Unchecked.defaultof<'a>

        if (Vector.IsHardwareAccelerated && span.Length >= Vector<'a>.Count * 2) then
            let mutable maxes = Vector<'a> span
            index <- Vector<'a>.Count

            maxes <- Vector.Max(maxes, new Vector<'a>(span.Slice(index)));
            index <- index + Vector<'a>.Count

            while (index + Vector<'a>.Count <= span.Length) do
                maxes <- Vector.Max(maxes, new Vector<'a>(span.Slice(index)));
                index <- index + Vector<'a>.Count

            acc <- maxes.[0]
            for i = 1 to Vector<'a>.Count - 1 do
                if maxes.[i] > acc then
                    acc <- maxes.[i]

        else
            acc <- span.[0]
            index <- 1

        for i = index to span.Length - 1 do
            if span.[i] > acc then
                acc <- span.[i]

        acc

    let random = System.Random ()

    let arrays =
        [| 10 ; 100 ; 1_000 ; 10_000 ; 100_000 ; 1_000_000 ; 2_000_000 |]
        |> Array.map (fun i -> i, Array.zeroCreate i)
        |> Array.map (fun (index, arr) ->
            random.NextBytes arr
            index, arr
        )
        |> Map.ofArray

    type Bench () =

        [<Params(10, 100, 1000, 10_000, 100_000, 1_000_000, 2_000_000)>]
        member val ArraySize = 0 with get, set

        [<Params (true, false)>]
        member val Vectorized = false with get, set

        // Define the method to benchmark
        [<Benchmark>]
        member this.BenchmarkFactorial() =
            if this.Vectorized then
                max arrays.[this.ArraySize]
                |> ignore
            else
                maxFromFSharpCore arrays.[this.ArraySize]
                |> ignore

    [<EntryPoint>]
    let main _ =
        let summary = BenchmarkRunner.Run<Bench>()
        if System.Object.ReferenceEquals (summary, null) then
            failwith "got no summary"

        0

Smaug123 commented 1 year ago

Come to think of it, maybe we could literally just call System.Linq.Max from Array.max.

kerams commented 1 year ago

Many vectorization intrinsics are not available in netstandard2.0 or even 2.1.

Smaug123 commented 1 year ago

Many vectorization intrinsics are not available in netstandard2.0 or even 2.1.

Yep, this would have to be conditionally compiled in on the appropriate frameworks, I imagine.

tannergooding commented 1 year ago

One difficulty is that Vector<'a> requires 'a : (new : unit -> 'a) and 'a : struct and 'a :> System.ValueType. I do not know how to make this change in a way that doesn't break the API, because I don't know how to provide one implementation for the 'a which satisfy those requirements and one for 'a which do not. But it would be great if we could simply provide various implementations of Array.max and have the correct one be selected.

We relaxed this in .NET 8 and it no longer requires any constraint. This allows generic code to simply do the equivalent of if (Vector<T>.IsSupported && Vector.IsHardwareAccelerated) { ... }

That being said, we do indeed accelerate System.Linq.Max already. There are also more considerations than System.Numerics.Vector<T> and you can achieve greater performance by taking advantage of the newer System.Runtime.Intrinsics APIs. We are likewise working on providing more functions for performance vectorizable functions over spans of data. This is being done via System.Numerics.Tensors.TensorPrimitives, which is out of band and only supports float in .NET 8, but which will expand to the full set of T that Vector<T> supports into .NET 9.

-- Noting that there may be nuance for exact semantics which may or may not be compatible with F#'s existing semantics, and so the ability to accelerate using such APIs likely needs to be considered case by case.

Smaug123 commented 1 year ago

Cool - if it were purely my decision, then, I'd just replace the implementation of Array.max with a single call to System.Linq's version, since it sounds like keeping up with System.Linq will be a bunch of duplicative work.

vzarytovskii commented 1 year ago

This needs to be a language suggestion, since it's a big change to core library.

It needs to cover corner cases, and current implementation comparisons (especially around handling NaNs, existing constraints vs ones which are coming from Enumerable, and such, as they might be breaking changes), also to include a proposed design with different targets for corelib.

We don't have current plans for new target for FSharp.Core, there are too many open questions with it.

As a testing ground, a separate library with same modules, shadowing builtin ones, might work.

T-Gro commented 1 year ago

I would be in favor of the following:

Separate standalone library with new NET targets, which would provide the vectorized functions
Separate fsharp suggestions on FSharp.Core evolution w.r.t. to new APIs available in NET
If the conclusion is that a modular design is fitting (real "core" being NS, additional libs based on newer targets), the library from # 1 above would live directly in this repo and be considered part of "FSharp.Core family of split libraries"

vzarytovskii commented 1 year ago

I would be in favor of the following:

Separate standalone library with new NET targets, which would provide the vectorized functions

Separate fsharp suggestions on FSharp.Core evolution w.r.t. to new APIs available in NET

If the conclusion is that a modular design is fitting (real "core" being NS, additional libs based on newer targets), the library from # 1 above would live directly in this repo and be considered part of "FSharp.Core family of split libraries"

This should be a separate suggestion, since FSharp.Core is a very special library, separating it will require a metric tonne of work, including type forwarding, changes to compiler, etc.

@KevinRansom tried to investigate it, it turned out to be a gigantic task.

vzarytovskii commented 1 year ago

Related, in regars to changes to standard library: https://github.com/dotnet/fsharp/issues/13207

tannergooding commented 1 year ago

This needs to be a language suggestion, since it's a big change to core library.

@vzarytovskii, could you elaborate on this a bit?

I definitely understand the general consideration around ensuring using any existing BCL method doesn't change the behavior or edge case handling before F# could make the switch to using it. Hence my comment of "Noting that there may be nuance for exact semantics which may or may not be compatible with F#'s existing semantics, and so the ability to accelerate using such APIs likely needs to be considered case by case."

However, SIMD doesn't itself change the semantics of the algorithm, only the performance, and can be made to ensure all the same edge cases are handled identically to the scalar implementation. So while it's work, and something that the F# team would have to be willing to maintain. It should just be an internal implementation detail that is otherwise unobservable to the consumer of the API. So, I don't understand why the use of SIMD in general is considered a "big change".

-- FSharp.Core is already multitargeting netstandard2.0 and netstandard2.1 and so already has access to System.Numerics.Vector<T> under which it could do some basic acceleration. However, it could also reasonably multi-target net8.0 (and moving forward the latest version) and simply #if NET8_0_OR_GREATER to light up usage of the newer intrinsics (whether on Vector<T> or the newer Vector128<T> and friends).

vzarytovskii commented 1 year ago

This needs to be a language suggestion, since it's a big change to core library.

@vzarytovskii, could you elaborate on this a bit? I don't understand why the use of SIMD in general is considered a "big change".

Because generally it might be a behaviour change at runtime (around NaNs, constraints, exceptions thrown, just general behaviour), it needs to be proven otherwise, hence the need of suggestion, test matrix, detailed description and probably some prototyping. We don't wan't to accidentally break someone just by upgrading SDK (and implicitly - FSharp.Core).

However, SIMD doesn't itself change the semantics of the algorithm, only the performance

That is not necessary true, we've seen a different behaviour between our functions and what's in Enumerable in some corner cases. This needs to be investigated and described. Example: https://github.com/dotnet/fsharp/issues/13207

However, it could also reasonably multi-target net8.0 (and moving forward the latest version) and simply #if NET8_0_OR_GREATER to light up usage of the newer intrinsics (whether on Vector<T> or the newer Vector128<T> and friends).

We don't consider adding additional targets at this point, it needs to be designed and discussed separately - which targets do we support, for how long, how often do we introduce new once/discontinue old ones, do we keep FSharp.Core as a singular assembly or separate it into multiple ones, and so on.

tannergooding commented 1 year ago

That is not necessary true, we've seen a different behaviour between our functions and what's in Enumerable in some corner cases. This needs to be investigated and described. Example: https://github.com/dotnet/fsharp/issues/13207

That's a difference in implementation of two similar functions, it is not a difference caused by SIMD.

That is, Enumerable.M() and some functionally similar M() in FSharp.Core may indeed be different. In that case, F# would presumably not want to use the BCL function.

However, nothing would prevent F# from writing its own vectorized code the behaves identically to F#'s existing functionality. It wouldn't even require FSharp.Core to reference anything additional in the case of the existing netstandard2.1 target. It is truly an implementation detail.

vzarytovskii commented 1 year ago

That's a difference in implementation of two similar functions, it is not a difference caused by SIMD.

That's what I meant, sorry that was unclear - if we use existing implementations from BCL, it will be a big change from standard library perspective (using BCL code vs our own, will affect inlining for example) and might be breaking.

So that's why a concrete language/library proposal is needed, so we clear what's proposed, what will be consequences, what are the alternatives, etc.

However, nothing would prevent F# from writing its own vectorized code the behaves identically to F#'s existing functionality. It wouldn't even require FSharp.Core to reference anything additional in the case of the existing netstandard2.1 target. It is truly an implementation detail.

Even in this case it has potential breaking something if some details are forgotten, hence - needs a suggestion + design.

dotnet / fsharp

SIMD vectorisation #16230