NetFabric / NetFabric.Numerics.Tensors

A generic tensor library for .NET
https://netfabric.github.io/NetFabric.Numerics.Tensors/
MIT License
7 stars 1 forks source link

Questions #24

Open Darelbi opened 3 months ago

Darelbi commented 3 months ago

Hi!, Thank you for this amazing library, however it is not clear by documentation if it supports matrix/tensor multiplication..

Does it employ also thread parallelism? (Parallel.For, in addition to SIMD instructions)

If it supports tensor/matrix multiplication it would be great for use in machine learning, just in case it supports it already (in example a forward pass in a neural network is just a(Wx+b) where W is matrices of weights, x input vector, b bias and a activation function) , how do I use the tensor/matrix multiplication? thanks

Darelbi commented 3 months ago

Ok nevermind, looked at the very clean source code, it obviously don't support it.

It would be nice to have, implementing it would not be easy though. Also I would add a IParallelProvider generic parameter allowing people to switch a Parallel.For when needed.

For the machine learning PART, usually the input is computed in parallel, so there are x1,x2,x3,x4 inputs

therefore there is a matrix operation a(W*(x1|x2|x3|x4) + b) = (y1|y2|y3|y4) allowing such operation with SIMD-smart-optimized matrix multiplication plus Parallel.For would make NetFabric.Numberics.Tensors very appetible for machine learning. Of course my example is with vector input, so W is 2D and x 1D but nobody prevents using as input 2D images (thus making W 3D tensor) and so on.. (even though 3D input or more dimensional input is very rare)

Nonetheless I'm doing that right now, was searching for a opensource framework in C# doing that and this one is the most close to what I need. (Just needing the matrix multiplication, I'm not requiring other methods like Singular values decomposition and so on)

aalmada commented 2 months ago

Hi @Darelbi! I kicked off this library to streamline SIMD operations on spans. Along the way, I stumbled upon System.Numerics.Tensors, which shared some similarities but came with its own set of limitations. So, I've been refining my version to overcome these limitations and enhance both performance and functionality. I've laid a solid foundation and now I'm ready to ramp up improvements. Open to new ideas and contributions. Also intrigued by the idea of using Parallel.For.

aalmada commented 2 months ago

I experimented adding Parallel.For but run into an issue. Spans are ref struct so they can't be used in lambdas. There's one unsafe workaround: https://stackoverflow.com/a/66747462/861773

aalmada commented 2 months ago

Hi @Darelbi,

I've been playing around with this idea lately. Unfortunately, every attempt I've made runs into the snag that any ref struct, like Span<T>, can't reside in the heap. That means they're a no-go for lambda expressions which are required for all parallelization solutions.

Check out this prototype you can fiddle with:

using System.Numerics;

const int size = 10_000;
var source = Enumerable.Range(0, size).ToArray();
var destination = new int[size];

Tensor.Apply<int, DoubleOperator<int>>(source, destination);

Console.WriteLine("Array processing complete.");

static class Tensor
{   
    public static void Apply<T, TOperator>(ReadOnlySpan<T> source, Span<T> destination)
        where TOperator: IUnaryOperator<T, T>
        => Apply<T, T, TOperator>(source, destination);

    public static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {      
        if(source.Length > 100)
            ParallelApply<T, TResult, TOperator>(source, destination);
        else
            Apply<T, TResult, TOperator>(source, destination, 0, source.Length);
    }

    static void ParallelApply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {
        var availableCores = Environment.ProcessorCount;
        var size = source.Length;
        var chunkSize = size / availableCores;

        var actions = new Action[availableCores];
        for (var coreIndex = 0; coreIndex < availableCores; coreIndex++)
        {
            var startIndex = coreIndex * chunkSize;
            var endIndex = (coreIndex == availableCores - 1) 
                ? size 
                : (coreIndex + 1) * chunkSize;
            actions[coreIndex] = () => Apply<T, TResult, TOperator>(source, destination, startIndex, endIndex);
        }
        Parallel.Invoke(actions);
    }

    static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination, int startIndex, int endIndex)
        where TOperator: IUnaryOperator<T, TResult>
    {
        for (var index = startIndex; index < endIndex; index++)
        {
            destination[index] = TOperator.Invoke(source[index]);
        }
    }
}

interface IUnaryOperator<T, TResult>
{
    static abstract TResult Invoke(T x);
}

readonly struct DoubleOperator<T>
    : IUnaryOperator<T, T>
    where T: INumberBase<T>, IMultiplyOperators<T, T, T>
{
    public static T Invoke(T x) => T.CreateChecked(2) * x;
}

Attempting to pin the span, as suggested here, doesn't work with generics.

Also, giving a callback delegate a shot, as suggested here, lands us in the same heap issue (boxing).

At this point, it seems like the only way forward is to switch to Memory<T>, which means reworking the entire public API, but it may be worth it.

Here's a prototype using Memory<T>:

using System.Numerics;

const int size = 10_000;
var source = Enumerable.Range(0, size).ToArray();
var destination = new int[size];

Tensor.Apply<int, DoubleOperator<int>>(source, destination);

Console.WriteLine("Array processing complete.");

static class Tensor
{   
    public static void Apply<T, TOperator>(ReadOnlyMemory<T> source, Memory<T> destination)
        where TOperator: IUnaryOperator<T, T>
        => Apply<T, T, TOperator>(source, destination);

    public static void Apply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {      
        if(source.Length > 100)
            ParallelApply<T, TResult, TOperator>(source, destination);
        else
            Apply<T, TResult, TOperator>(source.Span, destination.Span, 0, source.Length);
    }

    static void ParallelApply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {
        var availableCores = Environment.ProcessorCount;
        var size = source.Length;
        var chunkSize = size / availableCores;

        var actions = new Action[availableCores];
        for (var coreIndex = 0; coreIndex < availableCores; coreIndex++)
        {
            var startIndex = coreIndex * chunkSize;
            var endIndex = (coreIndex == availableCores - 1) 
                ? size 
                : (coreIndex + 1) * chunkSize;
            actions[coreIndex] = () => Apply<T, TResult, TOperator>(source.Span, destination.Span, startIndex, endIndex);
        }
        Parallel.Invoke(actions);
    }

    static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination, int startIndex, int endIndex)
        where TOperator: IUnaryOperator<T, TResult>
    {
        for (var index = startIndex; index < endIndex; index++)
        {
            destination[index] = TOperator.Invoke(source[index]);
        }
    }
}

interface IUnaryOperator<T, TResult>
{
    static abstract TResult Invoke(T x);
}

readonly struct DoubleOperator<T>
    : IUnaryOperator<T, T>
    where T: INumberBase<T>, IMultiplyOperators<T, T, T>
{
    public static T Invoke(T x) => T.CreateChecked(2) * x;
}

Do you have any more suggestions?

aalmada commented 2 months ago

I made some enhancements. Now, each chunk has a minimum size to prevent spending more time managing threads than processing the data. The APIs now support both Memory<T> and Span<T>, but CPU parallelization is only available for Memory<T> APIs. To prevent ambiguity, overloads for arrays are also required. This implies that all operations must include these overloads, which is quite a bit of work...

using System.Numerics;

const int size = 10_100;
var source = Enumerable.Range(0, size).ToArray();
var destination = new int[size];

Console.WriteLine("Array processing started.");
Tensor.Apply<int, DoubleOperator<int>>(source, destination);
Console.WriteLine("Array processing complete.");

static class Tensor
{   
    const int minChunkSize = 100;

    public static void Apply<T, TOperator>(T[] source, T[] destination)
        where TOperator: IUnaryOperator<T, T>
        => Apply<T, T, TOperator>(source.AsMemory(), destination.AsMemory());

    public static void Apply<T, TResult, TOperator>(T[] source, TResult[] destination)
        where TOperator: IUnaryOperator<T, TResult>
        => Apply<T, TResult, TOperator>(source.AsMemory(), destination.AsMemory());

    public static void Apply<T, TOperator>(ReadOnlyMemory<T> source, Memory<T> destination)
        where TOperator: IUnaryOperator<T, T>
        => Apply<T, T, TOperator>(source, destination);

    public static void Apply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {      
        if(source.Length > 2 * minChunkSize)
            ParallelApply<T, TResult, TOperator>(source, destination);
        else
            Apply<T, TResult, TOperator>(source.Span, destination.Span);
    }

    static void ParallelApply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {
        var availableCores = Environment.ProcessorCount;
        var size = source.Length;
        var chunkSize = int.Max(size / availableCores, minChunkSize);

        var actions = new Action[size / chunkSize];
        for (var index = 0; index < actions.Length; index++)
        {
            var start = index * chunkSize;
            var length = (index == actions.Length - 1) 
                ? size - start
                : chunkSize;

            Console.WriteLine($"Core: {index} Start: {start} Length: {length}");

            var sourceSlice = source.Slice(start, length);
            var destinationSlice = destination.Slice(start, length);
            actions[index] = () => Apply<T, TResult, TOperator>(sourceSlice.Span, destinationSlice.Span);
        }
        Console.WriteLine("Parallel processing started.");
        Parallel.Invoke(actions);
    }

    public static void Apply<T, TOperator>(ReadOnlySpan<T> source, Span<T> destination)
        where TOperator: IUnaryOperator<T, T>
        => Apply<T, T, TOperator>(source, destination);

    public static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination)
        where TOperator: IUnaryOperator<T, TResult>
    {
        Console.WriteLine($"Processing chunk! Source: {source.Length} Destination: {destination.Length}");
        // SIMD processing to be added here
        for (var index = 0; index < source.Length && index < destination.Length; index++)
            destination[index] = TOperator.Invoke(source[index]);
    }
}

interface IUnaryOperator<T, TResult>
{
    static abstract TResult Invoke(T x);
}

readonly struct DoubleOperator<T>
    : IUnaryOperator<T, T>
    where T: INumberBase<T>, IMultiplyOperators<T, T, T>
{
    public static T Invoke(T x) => T.CreateChecked(2) * x;
}
aalmada commented 2 months ago

I ran tests on branch #29, but the results aren't too encouraging. The multicore performance is slower, regardless of whether SIMD is employed or not. I need to explore the hardware constraints to understand this better.


BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.4353/22H2/2022Update)
Intel Core i7-7567U CPU 3.50GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
.NET SDK 9.0.100-preview.1.24101.2
  [Host]    : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
  Scalar    : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT
  Vector128 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX
  Vector256 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
Method Job Categories Count Mean StdDev Median Ratio
Baseline_Double Scalar Double 1000 1,078.60 ns 40.114 ns 1,060.22 ns baseline
System_Double Scalar Double 1000 406.24 ns 13.758 ns 399.47 ns 2.67x faster
NetFabric_Double Scalar Double 1000 2,180.41 ns 166.326 ns 2,150.15 ns 2.05x slower
Baseline_Double Vector128 Double 1000 1,042.76 ns 20.052 ns 1,035.44 ns 1.04x faster
System_Double Vector128 Double 1000 205.04 ns 4.889 ns 203.97 ns 5.28x faster
NetFabric_Double Vector128 Double 1000 2,270.89 ns 125.795 ns 2,307.96 ns 2.07x slower
Baseline_Double Vector256 Double 1000 1,444.38 ns 116.838 ns 1,456.48 ns 1.31x slower
System_Double Vector256 Double 1000 152.14 ns 11.168 ns 149.91 ns 7.20x faster
NetFabric_Double Vector256 Double 1000 2,201.52 ns 80.728 ns 2,205.31 ns 2.04x slower
Baseline_Float Scalar Float 1000 1,209.83 ns 61.967 ns 1,197.02 ns baseline
System_Float Scalar Float 1000 480.37 ns 33.768 ns 472.45 ns 2.54x faster
NetFabric_Float Scalar Float 1000 2,359.35 ns 93.242 ns 2,387.54 ns 1.96x slower
Baseline_Float Vector128 Float 1000 770.33 ns 53.606 ns 750.29 ns 1.57x faster
System_Float Vector128 Float 1000 126.49 ns 9.335 ns 125.69 ns 9.58x faster
NetFabric_Float Vector128 Float 1000 2,152.34 ns 89.694 ns 2,153.25 ns 1.79x slower
Baseline_Float Vector256 Float 1000 762.05 ns 79.493 ns 753.45 ns 1.56x faster
System_Float Vector256 Float 1000 67.04 ns 1.134 ns 66.90 ns 18.42x faster
NetFabric_Float Vector256 Float 1000 1,999.26 ns 90.642 ns 2,017.39 ns 1.66x slower
Baseline_Half Scalar Half 1000 12,504.44 ns 286.312 ns 12,399.19 ns baseline
System_Half Scalar Half 1000 12,231.32 ns 120.729 ns 12,238.40 ns 1.02x faster
NetFabric_Half Scalar Half 1000 9,433.74 ns 867.650 ns 9,546.42 ns 1.35x faster
Baseline_Half Vector128 Half 1000 9,697.71 ns 240.589 ns 9,676.38 ns 1.29x faster
System_Half Vector128 Half 1000 10,333.35 ns 852.316 ns 9,931.87 ns 1.18x faster
NetFabric_Half Vector128 Half 1000 8,915.24 ns 799.399 ns 8,905.60 ns 1.51x faster
Baseline_Half Vector256 Half 1000 10,267.79 ns 924.079 ns 9,858.21 ns 1.26x faster
System_Half Vector256 Half 1000 9,777.72 ns 98.069 ns 9,765.89 ns 1.28x faster
NetFabric_Half Vector256 Half 1000 9,393.03 ns 475.270 ns 9,403.79 ns 1.36x faster
Baseline_Int Scalar Int 1000 1,297.64 ns 12.022 ns 1,299.23 ns baseline
System_Int Scalar Int 1000 407.63 ns 4.247 ns 409.42 ns 3.18x faster
NetFabric_Int Scalar Int 1000 2,341.00 ns 112.485 ns 2,360.99 ns 1.69x slower
Baseline_Int Vector128 Int 1000 1,353.19 ns 75.724 ns 1,316.32 ns 1.05x slower
System_Int Vector128 Int 1000 115.52 ns 6.332 ns 114.52 ns 11.38x faster
NetFabric_Int Vector128 Int 1000 2,108.18 ns 110.913 ns 2,122.89 ns 1.54x slower
Baseline_Int Vector256 Int 1000 1,307.51 ns 21.841 ns 1,305.11 ns 1.01x slower
System_Int Vector256 Int 1000 64.33 ns 1.039 ns 64.19 ns 20.18x faster
NetFabric_Int Vector256 Int 1000 1,993.01 ns 90.504 ns 2,016.42 ns 1.55x slower
Baseline_Long Scalar Long 1000 1,045.51 ns 18.504 ns 1,044.03 ns baseline
System_Long Scalar Long 1000 406.87 ns 7.117 ns 405.92 ns 2.57x faster
NetFabric_Long Scalar Long 1000 2,256.12 ns 163.947 ns 2,250.57 ns 2.18x slower
Baseline_Long Vector128 Long 1000 1,071.94 ns 48.088 ns 1,050.91 ns 1.04x slower
System_Long Vector128 Long 1000 207.46 ns 4.846 ns 205.69 ns 5.03x faster
NetFabric_Long Vector128 Long 1000 2,197.30 ns 162.174 ns 2,164.07 ns 2.15x slower
Baseline_Long Vector256 Long 1000 1,047.96 ns 16.598 ns 1,042.90 ns 1.00x slower
System_Long Vector256 Long 1000 123.71 ns 0.750 ns 123.83 ns 8.46x faster
NetFabric_Long Vector256 Long 1000 2,191.66 ns 103.227 ns 2,201.34 ns 2.03x slower
Baseline_Short Scalar Short 1000 1,050.32 ns 13.160 ns 1,051.75 ns baseline
System_Short Scalar Short 1000 413.54 ns 14.802 ns 409.65 ns 2.52x faster
NetFabric_Short Scalar Short 1000 2,185.30 ns 169.597 ns 2,129.97 ns 2.16x slower
Baseline_Short Vector128 Short 1000 1,042.56 ns 10.547 ns 1,041.07 ns 1.01x faster
System_Short Vector128 Short 1000 57.45 ns 2.324 ns 56.74 ns 18.53x faster
NetFabric_Short Vector128 Short 1000 2,001.51 ns 93.791 ns 2,016.08 ns 1.89x slower
Baseline_Short Vector256 Short 1000 1,125.94 ns 93.649 ns 1,092.19 ns 1.05x slower
System_Short Vector256 Short 1000 39.64 ns 3.571 ns 38.01 ns 26.02x faster
NetFabric_Short Vector256 Short 1000 1,980.78 ns 87.917 ns 2,002.02 ns 1.85x slower
Darelbi commented 2 months ago

Maybe run into bandwith limitation?

Il gio 25 apr 2024, 17:41 Antão Almada @.***> ha scritto:

I ran tests on branch #29 https://github.com/NetFabric/NetFabric.Numerics.Tensors/pull/29, but the results aren't too encouraging. The multicore performance is slower, regardless of whether SIMD is employed or not. I need to explore the hardware constraints to understand this better.

BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.4353/22H2/2022Update) Intel Core i7-7567U CPU 3.50GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores .NET SDK 9.0.100-preview.1.24101.2 [Host] : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2 Scalar : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT Vector128 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX Vector256 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2

Method Job Categories Count Mean StdDev Median Ratio
-------------: -----------: -------------: --------------: -
Baseline_Double Scalar Double 1000 1,078.60 ns 40.114 ns
1,060.22 ns baseline
System_Double Scalar Double 1000 406.24 ns 13.758 ns 399.47
ns 2.67x faster
NetFabric_Double Scalar Double 1000 2,180.41 ns 166.326 ns
2,150.15 ns 2.05x slower
Baseline_Double Vector128 Double 1000 1,042.76 ns 20.052 ns
1,035.44 ns 1.04x faster
System_Double Vector128 Double 1000 205.04 ns 4.889 ns
203.97 ns 5.28x faster
NetFabric_Double Vector128 Double 1000 2,270.89 ns 125.795 ns
2,307.96 ns 2.07x slower
Baseline_Double Vector256 Double 1000 1,444.38 ns 116.838 ns
1,456.48 ns 1.31x slower
System_Double Vector256 Double 1000 152.14 ns 11.168 ns
149.91 ns 7.20x faster
NetFabric_Double Vector256 Double 1000 2,201.52 ns 80.728 ns
2,205.31 ns 2.04x slower
Baseline_Float Scalar Float 1000 1,209.83 ns 61.967 ns
1,197.02 ns baseline
System_Float Scalar Float 1000 480.37 ns 33.768 ns 472.45 ns
2.54x faster
NetFabric_Float Scalar Float 1000 2,359.35 ns 93.242 ns
2,387.54 ns 1.96x slower
Baseline_Float Vector128 Float 1000 770.33 ns 53.606 ns
750.29 ns 1.57x faster
System_Float Vector128 Float 1000 126.49 ns 9.335 ns 125.69
ns 9.58x faster
NetFabric_Float Vector128 Float 1000 2,152.34 ns 89.694 ns
2,153.25 ns 1.79x slower
Baseline_Float Vector256 Float 1000 762.05 ns 79.493 ns
753.45 ns 1.56x faster
System_Float Vector256 Float 1000 67.04 ns 1.134 ns 66.90 ns
18.42x faster
NetFabric_Float Vector256 Float 1000 1,999.26 ns 90.642 ns
2,017.39 ns 1.66x slower
Baseline_Half Scalar Half 1000 12,504.44 ns 286.312 ns
12,399.19 ns baseline
System_Half Scalar Half 1000 12,231.32 ns 120.729 ns
12,238.40 ns 1.02x faster
NetFabric_Half Scalar Half 1000 9,433.74 ns 867.650 ns
9,546.42 ns 1.35x faster
Baseline_Half Vector128 Half 1000 9,697.71 ns 240.589 ns
9,676.38 ns 1.29x faster
System_Half Vector128 Half 1000 10,333.35 ns 852.316 ns
9,931.87 ns 1.18x faster
NetFabric_Half Vector128 Half 1000 8,915.24 ns 799.399 ns
8,905.60 ns 1.51x faster
Baseline_Half Vector256 Half 1000 10,267.79 ns 924.079 ns
9,858.21 ns 1.26x faster
System_Half Vector256 Half 1000 9,777.72 ns 98.069 ns
9,765.89 ns 1.28x faster
NetFabric_Half Vector256 Half 1000 9,393.03 ns 475.270 ns
9,403.79 ns 1.36x faster
Baseline_Int Scalar Int 1000 1,297.64 ns 12.022 ns 1,299.23
ns baseline
System_Int Scalar Int 1000 407.63 ns 4.247 ns 409.42 ns
3.18x faster
NetFabric_Int Scalar Int 1000 2,341.00 ns 112.485 ns
2,360.99 ns 1.69x slower
Baseline_Int Vector128 Int 1000 1,353.19 ns 75.724 ns
1,316.32 ns 1.05x slower
System_Int Vector128 Int 1000 115.52 ns 6.332 ns 114.52 ns
11.38x faster
NetFabric_Int Vector128 Int 1000 2,108.18 ns 110.913 ns
2,122.89 ns 1.54x slower
Baseline_Int Vector256 Int 1000 1,307.51 ns 21.841 ns
1,305.11 ns 1.01x slower
System_Int Vector256 Int 1000 64.33 ns 1.039 ns 64.19 ns
20.18x faster
NetFabric_Int Vector256 Int 1000 1,993.01 ns 90.504 ns
2,016.42 ns 1.55x slower
Baseline_Long Scalar Long 1000 1,045.51 ns 18.504 ns
1,044.03 ns baseline
System_Long Scalar Long 1000 406.87 ns 7.117 ns 405.92 ns
2.57x faster
NetFabric_Long Scalar Long 1000 2,256.12 ns 163.947 ns
2,250.57 ns 2.18x slower
Baseline_Long Vector128 Long 1000 1,071.94 ns 48.088 ns
1,050.91 ns 1.04x slower
System_Long Vector128 Long 1000 207.46 ns 4.846 ns 205.69 ns
5.03x faster
NetFabric_Long Vector128 Long 1000 2,197.30 ns 162.174 ns
2,164.07 ns 2.15x slower
Baseline_Long Vector256 Long 1000 1,047.96 ns 16.598 ns
1,042.90 ns 1.00x slower
System_Long Vector256 Long 1000 123.71 ns 0.750 ns 123.83 ns
8.46x faster
NetFabric_Long Vector256 Long 1000 2,191.66 ns 103.227 ns
2,201.34 ns 2.03x slower
Baseline_Short Scalar Short 1000 1,050.32 ns 13.160 ns
1,051.75 ns baseline
System_Short Scalar Short 1000 413.54 ns 14.802 ns 409.65 ns
2.52x faster
NetFabric_Short Scalar Short 1000 2,185.30 ns 169.597 ns
2,129.97 ns 2.16x slower
Baseline_Short Vector128 Short 1000 1,042.56 ns 10.547 ns
1,041.07 ns 1.01x faster
System_Short Vector128 Short 1000 57.45 ns 2.324 ns 56.74 ns
18.53x faster
NetFabric_Short Vector128 Short 1000 2,001.51 ns 93.791 ns
2,016.08 ns 1.89x slower
Baseline_Short Vector256 Short 1000 1,125.94 ns 93.649 ns
1,092.19 ns 1.05x slower
System_Short Vector256 Short 1000 39.64 ns 3.571 ns 38.01 ns
26.02x faster
NetFabric_Short Vector256 Short 1000 1,980.78 ns 87.917 ns
2,002.02 ns 1.85x slower

— Reply to this email directly, view it on GitHub https://github.com/NetFabric/NetFabric.Numerics.Tensors/issues/24#issuecomment-2077721460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2R4E3CZQUJEGBZ33SFCRDY7EW3JAVCNFSM6AAAAABFO3XALCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZXG4ZDCNBWGA . You are receiving this because you were mentioned.Message ID: @.***>

aalmada commented 2 months ago

@Darelbi I kept on researching and wrote an article explaining the current steps: https://aalmada.github.io/posts/Unleashing-parallelism/ Feedback and ideas are welcome!

Darelbi commented 2 months ago

Thanks for It. Very interesting. Btw what are the specs of your system?

Il sab 4 mag 2024, 00:58 Antão Almada @.***> ha scritto:

@Darelbi https://github.com/Darelbi I kept on researching and wrote an article explaining the current steps: https://aalmada.github.io/posts/Unleashing-parallelism/ Feedback and ideas are welcome!

— Reply to this email directly, view it on GitHub https://github.com/NetFabric/NetFabric.Numerics.Tensors/issues/24#issuecomment-2093864072, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2R4E6M6CNQ4JXLMALBV6TZAQI73AVCNFSM6AAAAABFO3XALCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTHA3DIMBXGI . You are receiving this because you were mentioned.Message ID: @.***>

aalmada commented 2 months ago

I've been testing it on multiple systems:

The benchmarks on the article are for the AMD.