Questions - Githubissues

Darelbi commented 3 months ago

Hi!, Thank you for this amazing library, however it is not clear by documentation if it supports matrix/tensor multiplication..

Does it employ also thread parallelism? (Parallel.For, in addition to SIMD instructions)

If it supports tensor/matrix multiplication it would be great for use in machine learning, just in case it supports it already (in example a forward pass in a neural network is just a(Wx+b) where W is matrices of weights, x input vector, b bias and a activation function) , how do I use the tensor/matrix multiplication? thanks

Darelbi commented 3 months ago

Ok nevermind, looked at the very clean source code, it obviously don't support it.

It would be nice to have, implementing it would not be easy though. Also I would add a IParallelProvider generic parameter allowing people to switch a Parallel.For when needed.

For the machine learning PART, usually the input is computed in parallel, so there are x1,x2,x3,x4 inputs

therefore there is a matrix operation a(W*(x1|x2|x3|x4) + b) = (y1|y2|y3|y4) allowing such operation with SIMD-smart-optimized matrix multiplication plus Parallel.For would make NetFabric.Numberics.Tensors very appetible for machine learning. Of course my example is with vector input, so W is 2D and x 1D but nobody prevents using as input 2D images (thus making W 3D tensor) and so on.. (even though 3D input or more dimensional input is very rare)

Nonetheless I'm doing that right now, was searching for a opensource framework in C# doing that and this one is the most close to what I need. (Just needing the matrix multiplication, I'm not requiring other methods like Singular values decomposition and so on)

aalmada commented 2 months ago

Hi @Darelbi! I kicked off this library to streamline SIMD operations on spans. Along the way, I stumbled upon System.Numerics.Tensors, which shared some similarities but came with its own set of limitations. So, I've been refining my version to overcome these limitations and enhance both performance and functionality. I've laid a solid foundation and now I'm ready to ramp up improvements. Open to new ideas and contributions. Also intrigued by the idea of using Parallel.For.

aalmada commented 2 months ago

I experimented adding Parallel.For but run into an issue. Spans are ref struct so they can't be used in lambdas. There's one unsafe workaround: https://stackoverflow.com/a/66747462/861773

aalmada commented 2 months ago

Hi @Darelbi,

I've been playing around with this idea lately. Unfortunately, every attempt I've made runs into the snag that any ref struct, like Span<T>, can't reside in the heap. That means they're a no-go for lambda expressions which are required for all parallelization solutions.

Check out this prototype you can fiddle with:

using System.Numerics;

const int size = 10_000;
var source = Enumerable.Range(0, size).ToArray();
var destination = new int[size];

Tensor.Apply<int, DoubleOperator<int>>(source, destination);

Console.WriteLine("Array processing complete.");

static class Tensor
{   
    public static void Apply<T, TOperator>(ReadOnlySpan<T> source, Span<T> destination)
        where TOperator: IUnaryOperator<T, T>
        => Apply<T, T, TOperator>(source, destination);

    public static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {      
        if(source.Length > 100)
            ParallelApply<T, TResult, TOperator>(source, destination);
        else
            Apply<T, TResult, TOperator>(source, destination, 0, source.Length);
    }

    static void ParallelApply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {
        var availableCores = Environment.ProcessorCount;
        var size = source.Length;
        var chunkSize = size / availableCores;

        var actions = new Action[availableCores];
        for (var coreIndex = 0; coreIndex < availableCores; coreIndex++)
        {
            var startIndex = coreIndex * chunkSize;
            var endIndex = (coreIndex == availableCores - 1) 
                ? size 
                : (coreIndex + 1) * chunkSize;
            actions[coreIndex] = () => Apply<T, TResult, TOperator>(source, destination, startIndex, endIndex);
        }
        Parallel.Invoke(actions);
    }

    static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination, int startIndex, int endIndex)
        where TOperator: IUnaryOperator<T, TResult>
    {
        for (var index = startIndex; index < endIndex; index++)
        {
            destination[index] = TOperator.Invoke(source[index]);
        }
    }
}

interface IUnaryOperator<T, TResult>
{
    static abstract TResult Invoke(T x);
}

readonly struct DoubleOperator<T>
    : IUnaryOperator<T, T>
    where T: INumberBase<T>, IMultiplyOperators<T, T, T>
{
    public static T Invoke(T x) => T.CreateChecked(2) * x;
}

Attempting to pin the span, as suggested here, doesn't work with generics.

Also, giving a callback delegate a shot, as suggested here, lands us in the same heap issue (boxing).

At this point, it seems like the only way forward is to switch to Memory<T>, which means reworking the entire public API, but it may be worth it.

Here's a prototype using Memory<T>:

using System.Numerics;

const int size = 10_000;
var source = Enumerable.Range(0, size).ToArray();
var destination = new int[size];

Tensor.Apply<int, DoubleOperator<int>>(source, destination);

Console.WriteLine("Array processing complete.");

static class Tensor
{   
    public static void Apply<T, TOperator>(ReadOnlyMemory<T> source, Memory<T> destination)
        where TOperator: IUnaryOperator<T, T>
        => Apply<T, T, TOperator>(source, destination);

    public static void Apply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {      
        if(source.Length > 100)
            ParallelApply<T, TResult, TOperator>(source, destination);
        else
            Apply<T, TResult, TOperator>(source.Span, destination.Span, 0, source.Length);
    }

    static void ParallelApply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {
        var availableCores = Environment.ProcessorCount;
        var size = source.Length;
        var chunkSize = size / availableCores;

        var actions = new Action[availableCores];
        for (var coreIndex = 0; coreIndex < availableCores; coreIndex++)
        {
            var startIndex = coreIndex * chunkSize;
            var endIndex = (coreIndex == availableCores - 1) 
                ? size 
                : (coreIndex + 1) * chunkSize;
            actions[coreIndex] = () => Apply<T, TResult, TOperator>(source.Span, destination.Span, startIndex, endIndex);
        }
        Parallel.Invoke(actions);
    }

    static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination, int startIndex, int endIndex)
        where TOperator: IUnaryOperator<T, TResult>
    {
        for (var index = startIndex; index < endIndex; index++)
        {
            destination[index] = TOperator.Invoke(source[index]);
        }
    }
}

interface IUnaryOperator<T, TResult>
{
    static abstract TResult Invoke(T x);
}

readonly struct DoubleOperator<T>
    : IUnaryOperator<T, T>
    where T: INumberBase<T>, IMultiplyOperators<T, T, T>
{
    public static T Invoke(T x) => T.CreateChecked(2) * x;
}

Do you have any more suggestions?

aalmada commented 2 months ago

I made some enhancements. Now, each chunk has a minimum size to prevent spending more time managing threads than processing the data. The APIs now support both Memory<T> and Span<T>, but CPU parallelization is only available for Memory<T> APIs. To prevent ambiguity, overloads for arrays are also required. This implies that all operations must include these overloads, which is quite a bit of work...

using System.Numerics;

const int size = 10_100;
var source = Enumerable.Range(0, size).ToArray();
var destination = new int[size];

Console.WriteLine("Array processing started.");
Tensor.Apply<int, DoubleOperator<int>>(source, destination);
Console.WriteLine("Array processing complete.");

static class Tensor
{   
    const int minChunkSize = 100;

    public static void Apply<T, TOperator>(T[] source, T[] destination)
        where TOperator: IUnaryOperator<T, T>
        => Apply<T, T, TOperator>(source.AsMemory(), destination.AsMemory());

    public static void Apply<T, TResult, TOperator>(T[] source, TResult[] destination)
        where TOperator: IUnaryOperator<T, TResult>
        => Apply<T, TResult, TOperator>(source.AsMemory(), destination.AsMemory());

    public static void Apply<T, TOperator>(ReadOnlyMemory<T> source, Memory<T> destination)
        where TOperator: IUnaryOperator<T, T>
        => Apply<T, T, TOperator>(source, destination);

    public static void Apply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {      
        if(source.Length > 2 * minChunkSize)
            ParallelApply<T, TResult, TOperator>(source, destination);
        else
            Apply<T, TResult, TOperator>(source.Span, destination.Span);
    }

    static void ParallelApply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {
        var availableCores = Environment.ProcessorCount;
        var size = source.Length;
        var chunkSize = int.Max(size / availableCores, minChunkSize);

        var actions = new Action[size / chunkSize];
        for (var index = 0; index < actions.Length; index++)
        {
            var start = index * chunkSize;
            var length = (index == actions.Length - 1) 
                ? size - start
                : chunkSize;

            Console.WriteLine($"Core: {index} Start: {start} Length: {length}");

            var sourceSlice = source.Slice(start, length);
            var destinationSlice = destination.Slice(start, length);
            actions[index] = () => Apply<T, TResult, TOperator>(sourceSlice.Span, destinationSlice.Span);
        }
        Console.WriteLine("Parallel processing started.");
        Parallel.Invoke(actions);
    }

    public static void Apply<T, TOperator>(ReadOnlySpan<T> source, Span<T> destination)
        where TOperator: IUnaryOperator<T, T>
        => Apply<T, T, TOperator>(source, destination);

    public static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {
        Console.WriteLine($"Processing chunk! Source: {source.Length} Destination: {destination.Length}");
        // SIMD processing to be added here
        for (var index = 0; index < source.Length && index < destination.Length; index++)
            destination[index] = TOperator.Invoke(source[index]);
    }
}

interface IUnaryOperator<T, TResult>
{
    static abstract TResult Invoke(T x);
}

readonly struct DoubleOperator<T>
    : IUnaryOperator<T, T>
    where T: INumberBase<T>, IMultiplyOperators<T, T, T>
{
    public static T Invoke(T x) => T.CreateChecked(2) * x;
}

aalmada commented 2 months ago

I ran tests on branch #29, but the results aren't too encouraging. The multicore performance is slower, regardless of whether SIMD is employed or not. I need to explore the hardware constraints to understand this better.


BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.4353/22H2/2022Update)
Intel Core i7-7567U CPU 3.50GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
.NET SDK 9.0.100-preview.1.24101.2
  [Host]    : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
  Scalar    : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT
  Vector128 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX
  Vector256 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2

Method	Job	Categories	Count	Mean	StdDev	Median	Ratio
Baseline_Double	Scalar	Double	1000	1,078.60 ns	40.114 ns	1,060.22 ns	baseline
System_Double	Scalar	Double	1000	406.24 ns	13.758 ns	399.47 ns	2.67x faster
NetFabric_Double	Scalar	Double	1000	2,180.41 ns	166.326 ns	2,150.15 ns	2.05x slower
Baseline_Double	Vector128	Double	1000	1,042.76 ns	20.052 ns	1,035.44 ns	1.04x faster
System_Double	Vector128	Double	1000	205.04 ns	4.889 ns	203.97 ns	5.28x faster
NetFabric_Double	Vector128	Double	1000	2,270.89 ns	125.795 ns	2,307.96 ns	2.07x slower
Baseline_Double	Vector256	Double	1000	1,444.38 ns	116.838 ns	1,456.48 ns	1.31x slower
System_Double	Vector256	Double	1000	152.14 ns	11.168 ns	149.91 ns	7.20x faster
NetFabric_Double	Vector256	Double	1000	2,201.52 ns	80.728 ns	2,205.31 ns	2.04x slower

Baseline_Float	Scalar	Float	1000	1,209.83 ns	61.967 ns	1,197.02 ns	baseline
System_Float	Scalar	Float	1000	480.37 ns	33.768 ns	472.45 ns	2.54x faster
NetFabric_Float	Scalar	Float	1000	2,359.35 ns	93.242 ns	2,387.54 ns	1.96x slower
Baseline_Float	Vector128	Float	1000	770.33 ns	53.606 ns	750.29 ns	1.57x faster
System_Float	Vector128	Float	1000	126.49 ns	9.335 ns	125.69 ns	9.58x faster
NetFabric_Float	Vector128	Float	1000	2,152.34 ns	89.694 ns	2,153.25 ns	1.79x slower
Baseline_Float	Vector256	Float	1000	762.05 ns	79.493 ns	753.45 ns	1.56x faster
System_Float	Vector256	Float	1000	67.04 ns	1.134 ns	66.90 ns	18.42x faster
NetFabric_Float	Vector256	Float	1000	1,999.26 ns	90.642 ns	2,017.39 ns	1.66x slower

Baseline_Half	Scalar	Half	1000	12,504.44 ns	286.312 ns	12,399.19 ns	baseline
System_Half	Scalar	Half	1000	12,231.32 ns	120.729 ns	12,238.40 ns	1.02x faster
NetFabric_Half	Scalar	Half	1000	9,433.74 ns	867.650 ns	9,546.42 ns	1.35x faster
Baseline_Half	Vector128	Half	1000	9,697.71 ns	240.589 ns	9,676.38 ns	1.29x faster
System_Half	Vector128	Half	1000	10,333.35 ns	852.316 ns	9,931.87 ns	1.18x faster
NetFabric_Half	Vector128	Half	1000	8,915.24 ns	799.399 ns	8,905.60 ns	1.51x faster
Baseline_Half	Vector256	Half	1000	10,267.79 ns	924.079 ns	9,858.21 ns	1.26x faster
System_Half	Vector256	Half	1000	9,777.72 ns	98.069 ns	9,765.89 ns	1.28x faster
NetFabric_Half	Vector256	Half	1000	9,393.03 ns	475.270 ns	9,403.79 ns	1.36x faster

Baseline_Int	Scalar	Int	1000	1,297.64 ns	12.022 ns	1,299.23 ns	baseline
System_Int	Scalar	Int	1000	407.63 ns	4.247 ns	409.42 ns	3.18x faster
NetFabric_Int	Scalar	Int	1000	2,341.00 ns	112.485 ns	2,360.99 ns	1.69x slower
Baseline_Int	Vector128	Int	1000	1,353.19 ns	75.724 ns	1,316.32 ns	1.05x slower
System_Int	Vector128	Int	1000	115.52 ns	6.332 ns	114.52 ns	11.38x faster
NetFabric_Int	Vector128	Int	1000	2,108.18 ns	110.913 ns	2,122.89 ns	1.54x slower
Baseline_Int	Vector256	Int	1000	1,307.51 ns	21.841 ns	1,305.11 ns	1.01x slower
System_Int	Vector256	Int	1000	64.33 ns	1.039 ns	64.19 ns	20.18x faster
NetFabric_Int	Vector256	Int	1000	1,993.01 ns	90.504 ns	2,016.42 ns	1.55x slower

Baseline_Long	Scalar	Long	1000	1,045.51 ns	18.504 ns	1,044.03 ns	baseline
System_Long	Scalar	Long	1000	406.87 ns	7.117 ns	405.92 ns	2.57x faster
NetFabric_Long	Scalar	Long	1000	2,256.12 ns	163.947 ns	2,250.57 ns	2.18x slower
Baseline_Long	Vector128	Long	1000	1,071.94 ns	48.088 ns	1,050.91 ns	1.04x slower
System_Long	Vector128	Long	1000	207.46 ns	4.846 ns	205.69 ns	5.03x faster
NetFabric_Long	Vector128	Long	1000	2,197.30 ns	162.174 ns	2,164.07 ns	2.15x slower
Baseline_Long	Vector256	Long	1000	1,047.96 ns	16.598 ns	1,042.90 ns	1.00x slower
System_Long	Vector256	Long	1000	123.71 ns	0.750 ns	123.83 ns	8.46x faster
NetFabric_Long	Vector256	Long	1000	2,191.66 ns	103.227 ns	2,201.34 ns	2.03x slower

Baseline_Short	Scalar	Short	1000	1,050.32 ns	13.160 ns	1,051.75 ns	baseline
System_Short	Scalar	Short	1000	413.54 ns	14.802 ns	409.65 ns	2.52x faster
NetFabric_Short	Scalar	Short	1000	2,185.30 ns	169.597 ns	2,129.97 ns	2.16x slower
Baseline_Short	Vector128	Short	1000	1,042.56 ns	10.547 ns	1,041.07 ns	1.01x faster
System_Short	Vector128	Short	1000	57.45 ns	2.324 ns	56.74 ns	18.53x faster
NetFabric_Short	Vector128	Short	1000	2,001.51 ns	93.791 ns	2,016.08 ns	1.89x slower
Baseline_Short	Vector256	Short	1000	1,125.94 ns	93.649 ns	1,092.19 ns	1.05x slower
System_Short	Vector256	Short	1000	39.64 ns	3.571 ns	38.01 ns	26.02x faster
NetFabric_Short	Vector256	Short	1000	1,980.78 ns	87.917 ns	2,002.02 ns	1.85x slower

Darelbi commented 2 months ago

Maybe run into bandwith limitation?

Il gio 25 apr 2024, 17:41 Antão Almada @.***> ha scritto:

I ran tests on branch #29 https://github.com/NetFabric/NetFabric.Numerics.Tensors/pull/29, but the results aren't too encouraging. The multicore performance is slower, regardless of whether SIMD is employed or not. I need to explore the hardware constraints to understand this better.

BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.4353/22H2/2022Update) Intel Core i7-7567U CPU 3.50GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores .NET SDK 9.0.100-preview.1.24101.2 [Host] : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2 Scalar : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT Vector128 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX Vector256 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2

Method Job Categories Count Mean StdDev Median Ratio

-------------: -----------: -------------: --------------: -

Baseline_Double Scalar Double 1000 1,078.60 ns 40.114 ns

1,060.22 ns baseline

System_Double Scalar Double 1000 406.24 ns 13.758 ns 399.47

ns 2.67x faster

NetFabric_Double Scalar Double 1000 2,180.41 ns 166.326 ns

2,150.15 ns 2.05x slower

Baseline_Double Vector128 Double 1000 1,042.76 ns 20.052 ns

1,035.44 ns 1.04x faster

System_Double Vector128 Double 1000 205.04 ns 4.889 ns

203.97 ns 5.28x faster

NetFabric_Double Vector128 Double 1000 2,270.89 ns 125.795 ns

2,307.96 ns 2.07x slower

Baseline_Double Vector256 Double 1000 1,444.38 ns 116.838 ns

1,456.48 ns 1.31x slower

System_Double Vector256 Double 1000 152.14 ns 11.168 ns

149.91 ns 7.20x faster

NetFabric_Double Vector256 Double 1000 2,201.52 ns 80.728 ns

2,205.31 ns 2.04x slower

Baseline_Float Scalar Float 1000 1,209.83 ns 61.967 ns

1,197.02 ns baseline

System_Float Scalar Float 1000 480.37 ns 33.768 ns 472.45 ns

2.54x faster

NetFabric_Float Scalar Float 1000 2,359.35 ns 93.242 ns

2,387.54 ns 1.96x slower

Baseline_Float Vector128 Float 1000 770.33 ns 53.606 ns

750.29 ns 1.57x faster

System_Float Vector128 Float 1000 126.49 ns 9.335 ns 125.69

ns 9.58x faster

NetFabric_Float Vector128 Float 1000 2,152.34 ns 89.694 ns

2,153.25 ns 1.79x slower

Baseline_Float Vector256 Float 1000 762.05 ns 79.493 ns

753.45 ns 1.56x faster

System_Float Vector256 Float 1000 67.04 ns 1.134 ns 66.90 ns

18.42x faster

NetFabric_Float Vector256 Float 1000 1,999.26 ns 90.642 ns

2,017.39 ns 1.66x slower

Baseline_Half Scalar Half 1000 12,504.44 ns 286.312 ns

12,399.19 ns baseline

System_Half Scalar Half 1000 12,231.32 ns 120.729 ns

12,238.40 ns 1.02x faster

NetFabric_Half Scalar Half 1000 9,433.74 ns 867.650 ns

9,546.42 ns 1.35x faster

Baseline_Half Vector128 Half 1000 9,697.71 ns 240.589 ns

9,676.38 ns 1.29x faster

System_Half Vector128 Half 1000 10,333.35 ns 852.316 ns

9,931.87 ns 1.18x faster

NetFabric_Half Vector128 Half 1000 8,915.24 ns 799.399 ns

8,905.60 ns 1.51x faster

Baseline_Half Vector256 Half 1000 10,267.79 ns 924.079 ns

9,858.21 ns 1.26x faster

System_Half Vector256 Half 1000 9,777.72 ns 98.069 ns

9,765.89 ns 1.28x faster

NetFabric_Half Vector256 Half 1000 9,393.03 ns 475.270 ns

9,403.79 ns 1.36x faster

Baseline_Int Scalar Int 1000 1,297.64 ns 12.022 ns 1,299.23

ns baseline

System_Int Scalar Int 1000 407.63 ns 4.247 ns 409.42 ns

3.18x faster

NetFabric_Int Scalar Int 1000 2,341.00 ns 112.485 ns

2,360.99 ns 1.69x slower

Baseline_Int Vector128 Int 1000 1,353.19 ns 75.724 ns

1,316.32 ns 1.05x slower

System_Int Vector128 Int 1000 115.52 ns 6.332 ns 114.52 ns

11.38x faster

NetFabric_Int Vector128 Int 1000 2,108.18 ns 110.913 ns

2,122.89 ns 1.54x slower

Baseline_Int Vector256 Int 1000 1,307.51 ns 21.841 ns

1,305.11 ns 1.01x slower

System_Int Vector256 Int 1000 64.33 ns 1.039 ns 64.19 ns

20.18x faster

NetFabric_Int Vector256 Int 1000 1,993.01 ns 90.504 ns

2,016.42 ns 1.55x slower

Baseline_Long Scalar Long 1000 1,045.51 ns 18.504 ns

1,044.03 ns baseline

System_Long Scalar Long 1000 406.87 ns 7.117 ns 405.92 ns

2.57x faster

NetFabric_Long Scalar Long 1000 2,256.12 ns 163.947 ns

2,250.57 ns 2.18x slower

Baseline_Long Vector128 Long 1000 1,071.94 ns 48.088 ns

1,050.91 ns 1.04x slower

System_Long Vector128 Long 1000 207.46 ns 4.846 ns 205.69 ns

5.03x faster

NetFabric_Long Vector128 Long 1000 2,197.30 ns 162.174 ns

2,164.07 ns 2.15x slower

Baseline_Long Vector256 Long 1000 1,047.96 ns 16.598 ns

1,042.90 ns 1.00x slower

System_Long Vector256 Long 1000 123.71 ns 0.750 ns 123.83 ns

8.46x faster

NetFabric_Long Vector256 Long 1000 2,191.66 ns 103.227 ns

2,201.34 ns 2.03x slower

Baseline_Short Scalar Short 1000 1,050.32 ns 13.160 ns

1,051.75 ns baseline

System_Short Scalar Short 1000 413.54 ns 14.802 ns 409.65 ns

2.52x faster

NetFabric_Short Scalar Short 1000 2,185.30 ns 169.597 ns

2,129.97 ns 2.16x slower

Baseline_Short Vector128 Short 1000 1,042.56 ns 10.547 ns

1,041.07 ns 1.01x faster

System_Short Vector128 Short 1000 57.45 ns 2.324 ns 56.74 ns

18.53x faster

NetFabric_Short Vector128 Short 1000 2,001.51 ns 93.791 ns

2,016.08 ns 1.89x slower

Baseline_Short Vector256 Short 1000 1,125.94 ns 93.649 ns

1,092.19 ns 1.05x slower

System_Short Vector256 Short 1000 39.64 ns 3.571 ns 38.01 ns

26.02x faster

NetFabric_Short Vector256 Short 1000 1,980.78 ns 87.917 ns

2,002.02 ns 1.85x slower

— Reply to this email directly, view it on GitHub https://github.com/NetFabric/NetFabric.Numerics.Tensors/issues/24#issuecomment-2077721460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2R4E3CZQUJEGBZ33SFCRDY7EW3JAVCNFSM6AAAAABFO3XALCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZXG4ZDCNBWGA . You are receiving this because you were mentioned.Message ID: @.***>

Method	Job	Categories	Count	Mean	StdDev	Median
-------------:	-----------:	-------------:	--------------:	-
Baseline_Double	Scalar	Double	1000	1,078.60 ns	40.114 ns
1,060.22 ns	baseline
System_Double	Scalar	Double	1000	406.24 ns	13.758 ns	399.47
ns	2.67x faster
NetFabric_Double	Scalar	Double	1000	2,180.41 ns	166.326 ns
2,150.15 ns	2.05x slower
Baseline_Double	Vector128	Double	1000	1,042.76 ns	20.052 ns
1,035.44 ns	1.04x faster
System_Double	Vector128	Double	1000	205.04 ns	4.889 ns
203.97 ns	5.28x faster
NetFabric_Double	Vector128	Double	1000	2,270.89 ns	125.795 ns
2,307.96 ns	2.07x slower
Baseline_Double	Vector256	Double	1000	1,444.38 ns	116.838 ns
1,456.48 ns	1.31x slower
System_Double	Vector256	Double	1000	152.14 ns	11.168 ns
149.91 ns	7.20x faster
NetFabric_Double	Vector256	Double	1000	2,201.52 ns	80.728 ns
2,205.31 ns	2.04x slower

Baseline_Float	Scalar	Float	1000	1,209.83 ns	61.967 ns
1,197.02 ns	baseline
System_Float	Scalar	Float	1000	480.37 ns	33.768 ns	472.45 ns
2.54x faster
NetFabric_Float	Scalar	Float	1000	2,359.35 ns	93.242 ns
2,387.54 ns	1.96x slower
Baseline_Float	Vector128	Float	1000	770.33 ns	53.606 ns
750.29 ns	1.57x faster
System_Float	Vector128	Float	1000	126.49 ns	9.335 ns	125.69
ns	9.58x faster
NetFabric_Float	Vector128	Float	1000	2,152.34 ns	89.694 ns
2,153.25 ns	1.79x slower
Baseline_Float	Vector256	Float	1000	762.05 ns	79.493 ns
753.45 ns	1.56x faster
System_Float	Vector256	Float	1000	67.04 ns	1.134 ns	66.90 ns
18.42x faster
NetFabric_Float	Vector256	Float	1000	1,999.26 ns	90.642 ns
2,017.39 ns	1.66x slower

Baseline_Half	Scalar	Half	1000	12,504.44 ns	286.312 ns
12,399.19 ns	baseline
System_Half	Scalar	Half	1000	12,231.32 ns	120.729 ns
12,238.40 ns	1.02x faster
NetFabric_Half	Scalar	Half	1000	9,433.74 ns	867.650 ns
9,546.42 ns	1.35x faster
Baseline_Half	Vector128	Half	1000	9,697.71 ns	240.589 ns
9,676.38 ns	1.29x faster
System_Half	Vector128	Half	1000	10,333.35 ns	852.316 ns
9,931.87 ns	1.18x faster
NetFabric_Half	Vector128	Half	1000	8,915.24 ns	799.399 ns
8,905.60 ns	1.51x faster
Baseline_Half	Vector256	Half	1000	10,267.79 ns	924.079 ns
9,858.21 ns	1.26x faster
System_Half	Vector256	Half	1000	9,777.72 ns	98.069 ns
9,765.89 ns	1.28x faster
NetFabric_Half	Vector256	Half	1000	9,393.03 ns	475.270 ns
9,403.79 ns	1.36x faster

Baseline_Int	Scalar	Int	1000	1,297.64 ns	12.022 ns	1,299.23
ns	baseline
System_Int	Scalar	Int	1000	407.63 ns	4.247 ns	409.42 ns
3.18x faster
NetFabric_Int	Scalar	Int	1000	2,341.00 ns	112.485 ns
2,360.99 ns	1.69x slower
Baseline_Int	Vector128	Int	1000	1,353.19 ns	75.724 ns
1,316.32 ns	1.05x slower
System_Int	Vector128	Int	1000	115.52 ns	6.332 ns	114.52 ns
11.38x faster
NetFabric_Int	Vector128	Int	1000	2,108.18 ns	110.913 ns
2,122.89 ns	1.54x slower
Baseline_Int	Vector256	Int	1000	1,307.51 ns	21.841 ns
1,305.11 ns	1.01x slower
System_Int	Vector256	Int	1000	64.33 ns	1.039 ns	64.19 ns
20.18x faster
NetFabric_Int	Vector256	Int	1000	1,993.01 ns	90.504 ns
2,016.42 ns	1.55x slower

Baseline_Long	Scalar	Long	1000	1,045.51 ns	18.504 ns
1,044.03 ns	baseline
System_Long	Scalar	Long	1000	406.87 ns	7.117 ns	405.92 ns
2.57x faster
NetFabric_Long	Scalar	Long	1000	2,256.12 ns	163.947 ns
2,250.57 ns	2.18x slower
Baseline_Long	Vector128	Long	1000	1,071.94 ns	48.088 ns
1,050.91 ns	1.04x slower
System_Long	Vector128	Long	1000	207.46 ns	4.846 ns	205.69 ns
5.03x faster
NetFabric_Long	Vector128	Long	1000	2,197.30 ns	162.174 ns
2,164.07 ns	2.15x slower
Baseline_Long	Vector256	Long	1000	1,047.96 ns	16.598 ns
1,042.90 ns	1.00x slower
System_Long	Vector256	Long	1000	123.71 ns	0.750 ns	123.83 ns
8.46x faster
NetFabric_Long	Vector256	Long	1000	2,191.66 ns	103.227 ns
2,201.34 ns	2.03x slower

Baseline_Short	Scalar	Short	1000	1,050.32 ns	13.160 ns
1,051.75 ns	baseline
System_Short	Scalar	Short	1000	413.54 ns	14.802 ns	409.65 ns
2.52x faster
NetFabric_Short	Scalar	Short	1000	2,185.30 ns	169.597 ns
2,129.97 ns	2.16x slower
Baseline_Short	Vector128	Short	1000	1,042.56 ns	10.547 ns
1,041.07 ns	1.01x faster
System_Short	Vector128	Short	1000	57.45 ns	2.324 ns	56.74 ns
18.53x faster
NetFabric_Short	Vector128	Short	1000	2,001.51 ns	93.791 ns
2,016.08 ns	1.89x slower
Baseline_Short	Vector256	Short	1000	1,125.94 ns	93.649 ns
1,092.19 ns	1.05x slower
System_Short	Vector256	Short	1000	39.64 ns	3.571 ns	38.01 ns
26.02x faster
NetFabric_Short	Vector256	Short	1000	1,980.78 ns	87.917 ns
2,002.02 ns	1.85x slower

aalmada commented 2 months ago

@Darelbi I kept on researching and wrote an article explaining the current steps: https://aalmada.github.io/posts/Unleashing-parallelism/ Feedback and ideas are welcome!

Darelbi commented 2 months ago

Thanks for It. Very interesting. Btw what are the specs of your system?

Il sab 4 mag 2024, 00:58 Antão Almada @.***> ha scritto:

@Darelbi https://github.com/Darelbi I kept on researching and wrote an article explaining the current steps: https://aalmada.github.io/posts/Unleashing-parallelism/ Feedback and ideas are welcome!

— Reply to this email directly, view it on GitHub https://github.com/NetFabric/NetFabric.Numerics.Tensors/issues/24#issuecomment-2093864072, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2R4E6M6CNQ4JXLMALBV6TZAQI73AVCNFSM6AAAAABFO3XALCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTHA3DIMBXGI . You are receiving this because you were mentioned.Message ID: @.***>

aalmada commented 2 months ago

I've been testing it on multiple systems:

Apple M1 (arm64, 8 logical cores, AdvSIMD)
Intel i7-7567U (x64, 4 logical cores, AVX2)
AMD Ryzen 9 7940HS (x64, 16 logical cores, AVX512)

The benchmarks on the article are for the AMD.

NetFabric / NetFabric.Numerics.Tensors

Questions #24