Open Darelbi opened 3 months ago
Ok nevermind, looked at the very clean source code, it obviously don't support it.
It would be nice to have, implementing it would not be easy though. Also I would add a IParallelProvider generic parameter allowing people to switch a Parallel.For when needed.
For the machine learning PART, usually the input is computed in parallel, so there are x1,x2,x3,x4 inputs
therefore there is a matrix operation a(W*(x1|x2|x3|x4) + b) = (y1|y2|y3|y4) allowing such operation with SIMD-smart-optimized matrix multiplication plus Parallel.For would make NetFabric.Numberics.Tensors very appetible for machine learning. Of course my example is with vector input, so W is 2D and x 1D but nobody prevents using as input 2D images (thus making W 3D tensor) and so on.. (even though 3D input or more dimensional input is very rare)
Nonetheless I'm doing that right now, was searching for a opensource framework in C# doing that and this one is the most close to what I need. (Just needing the matrix multiplication, I'm not requiring other methods like Singular values decomposition and so on)
Hi @Darelbi! I kicked off this library to streamline SIMD operations on spans. Along the way, I stumbled upon System.Numerics.Tensors, which shared some similarities but came with its own set of limitations. So, I've been refining my version to overcome these limitations and enhance both performance and functionality. I've laid a solid foundation and now I'm ready to ramp up improvements. Open to new ideas and contributions. Also intrigued by the idea of using Parallel.For.
I experimented adding Parallel.For
but run into an issue. Spans are ref struct
so they can't be used in lambdas. There's one unsafe workaround: https://stackoverflow.com/a/66747462/861773
Hi @Darelbi,
I've been playing around with this idea lately. Unfortunately, every attempt I've made runs into the snag that any ref struct
, like Span<T>
, can't reside in the heap. That means they're a no-go for lambda expressions which are required for all parallelization solutions.
Check out this prototype you can fiddle with:
using System.Numerics;
const int size = 10_000;
var source = Enumerable.Range(0, size).ToArray();
var destination = new int[size];
Tensor.Apply<int, DoubleOperator<int>>(source, destination);
Console.WriteLine("Array processing complete.");
static class Tensor
{
public static void Apply<T, TOperator>(ReadOnlySpan<T> source, Span<T> destination)
where TOperator: IUnaryOperator<T, T>
=> Apply<T, T, TOperator>(source, destination);
public static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination)
where TOperator: IUnaryOperator<T, TResult>
{
if(source.Length > 100)
ParallelApply<T, TResult, TOperator>(source, destination);
else
Apply<T, TResult, TOperator>(source, destination, 0, source.Length);
}
static void ParallelApply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination)
where TOperator: IUnaryOperator<T, TResult>
{
var availableCores = Environment.ProcessorCount;
var size = source.Length;
var chunkSize = size / availableCores;
var actions = new Action[availableCores];
for (var coreIndex = 0; coreIndex < availableCores; coreIndex++)
{
var startIndex = coreIndex * chunkSize;
var endIndex = (coreIndex == availableCores - 1)
? size
: (coreIndex + 1) * chunkSize;
actions[coreIndex] = () => Apply<T, TResult, TOperator>(source, destination, startIndex, endIndex);
}
Parallel.Invoke(actions);
}
static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination, int startIndex, int endIndex)
where TOperator: IUnaryOperator<T, TResult>
{
for (var index = startIndex; index < endIndex; index++)
{
destination[index] = TOperator.Invoke(source[index]);
}
}
}
interface IUnaryOperator<T, TResult>
{
static abstract TResult Invoke(T x);
}
readonly struct DoubleOperator<T>
: IUnaryOperator<T, T>
where T: INumberBase<T>, IMultiplyOperators<T, T, T>
{
public static T Invoke(T x) => T.CreateChecked(2) * x;
}
Attempting to pin the span, as suggested here, doesn't work with generics.
Also, giving a callback delegate a shot, as suggested here, lands us in the same heap issue (boxing).
At this point, it seems like the only way forward is to switch to Memory<T>
, which means reworking the entire public API, but it may be worth it.
Here's a prototype using Memory<T>
:
using System.Numerics;
const int size = 10_000;
var source = Enumerable.Range(0, size).ToArray();
var destination = new int[size];
Tensor.Apply<int, DoubleOperator<int>>(source, destination);
Console.WriteLine("Array processing complete.");
static class Tensor
{
public static void Apply<T, TOperator>(ReadOnlyMemory<T> source, Memory<T> destination)
where TOperator: IUnaryOperator<T, T>
=> Apply<T, T, TOperator>(source, destination);
public static void Apply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
where TOperator: IUnaryOperator<T, TResult>
{
if(source.Length > 100)
ParallelApply<T, TResult, TOperator>(source, destination);
else
Apply<T, TResult, TOperator>(source.Span, destination.Span, 0, source.Length);
}
static void ParallelApply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
where TOperator: IUnaryOperator<T, TResult>
{
var availableCores = Environment.ProcessorCount;
var size = source.Length;
var chunkSize = size / availableCores;
var actions = new Action[availableCores];
for (var coreIndex = 0; coreIndex < availableCores; coreIndex++)
{
var startIndex = coreIndex * chunkSize;
var endIndex = (coreIndex == availableCores - 1)
? size
: (coreIndex + 1) * chunkSize;
actions[coreIndex] = () => Apply<T, TResult, TOperator>(source.Span, destination.Span, startIndex, endIndex);
}
Parallel.Invoke(actions);
}
static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination, int startIndex, int endIndex)
where TOperator: IUnaryOperator<T, TResult>
{
for (var index = startIndex; index < endIndex; index++)
{
destination[index] = TOperator.Invoke(source[index]);
}
}
}
interface IUnaryOperator<T, TResult>
{
static abstract TResult Invoke(T x);
}
readonly struct DoubleOperator<T>
: IUnaryOperator<T, T>
where T: INumberBase<T>, IMultiplyOperators<T, T, T>
{
public static T Invoke(T x) => T.CreateChecked(2) * x;
}
Do you have any more suggestions?
I made some enhancements. Now, each chunk has a minimum size to prevent spending more time managing threads than processing the data. The APIs now support both Memory<T>
and Span<T>
, but CPU parallelization is only available for Memory<T>
APIs. To prevent ambiguity, overloads for arrays are also required. This implies that all operations must include these overloads, which is quite a bit of work...
using System.Numerics;
const int size = 10_100;
var source = Enumerable.Range(0, size).ToArray();
var destination = new int[size];
Console.WriteLine("Array processing started.");
Tensor.Apply<int, DoubleOperator<int>>(source, destination);
Console.WriteLine("Array processing complete.");
static class Tensor
{
const int minChunkSize = 100;
public static void Apply<T, TOperator>(T[] source, T[] destination)
where TOperator: IUnaryOperator<T, T>
=> Apply<T, T, TOperator>(source.AsMemory(), destination.AsMemory());
public static void Apply<T, TResult, TOperator>(T[] source, TResult[] destination)
where TOperator: IUnaryOperator<T, TResult>
=> Apply<T, TResult, TOperator>(source.AsMemory(), destination.AsMemory());
public static void Apply<T, TOperator>(ReadOnlyMemory<T> source, Memory<T> destination)
where TOperator: IUnaryOperator<T, T>
=> Apply<T, T, TOperator>(source, destination);
public static void Apply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
where TOperator: IUnaryOperator<T, TResult>
{
if(source.Length > 2 * minChunkSize)
ParallelApply<T, TResult, TOperator>(source, destination);
else
Apply<T, TResult, TOperator>(source.Span, destination.Span);
}
static void ParallelApply<T, TResult, TOperator>(ReadOnlyMemory<T> source, Memory<TResult> destination)
where TOperator: IUnaryOperator<T, TResult>
{
var availableCores = Environment.ProcessorCount;
var size = source.Length;
var chunkSize = int.Max(size / availableCores, minChunkSize);
var actions = new Action[size / chunkSize];
for (var index = 0; index < actions.Length; index++)
{
var start = index * chunkSize;
var length = (index == actions.Length - 1)
? size - start
: chunkSize;
Console.WriteLine($"Core: {index} Start: {start} Length: {length}");
var sourceSlice = source.Slice(start, length);
var destinationSlice = destination.Slice(start, length);
actions[index] = () => Apply<T, TResult, TOperator>(sourceSlice.Span, destinationSlice.Span);
}
Console.WriteLine("Parallel processing started.");
Parallel.Invoke(actions);
}
public static void Apply<T, TOperator>(ReadOnlySpan<T> source, Span<T> destination)
where TOperator: IUnaryOperator<T, T>
=> Apply<T, T, TOperator>(source, destination);
public static void Apply<T, TResult, TOperator>(ReadOnlySpan<T> source, Span<TResult> destination)
where TOperator: IUnaryOperator<T, TResult>
{
Console.WriteLine($"Processing chunk! Source: {source.Length} Destination: {destination.Length}");
// SIMD processing to be added here
for (var index = 0; index < source.Length && index < destination.Length; index++)
destination[index] = TOperator.Invoke(source[index]);
}
}
interface IUnaryOperator<T, TResult>
{
static abstract TResult Invoke(T x);
}
readonly struct DoubleOperator<T>
: IUnaryOperator<T, T>
where T: INumberBase<T>, IMultiplyOperators<T, T, T>
{
public static T Invoke(T x) => T.CreateChecked(2) * x;
}
I ran tests on branch #29, but the results aren't too encouraging. The multicore performance is slower, regardless of whether SIMD is employed or not. I need to explore the hardware constraints to understand this better.
BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.4353/22H2/2022Update)
Intel Core i7-7567U CPU 3.50GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
.NET SDK 9.0.100-preview.1.24101.2
[Host] : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
Scalar : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT
Vector128 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX
Vector256 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
Method | Job | Categories | Count | Mean | StdDev | Median | Ratio |
---|---|---|---|---|---|---|---|
Baseline_Double | Scalar | Double | 1000 | 1,078.60 ns | 40.114 ns | 1,060.22 ns | baseline |
System_Double | Scalar | Double | 1000 | 406.24 ns | 13.758 ns | 399.47 ns | 2.67x faster |
NetFabric_Double | Scalar | Double | 1000 | 2,180.41 ns | 166.326 ns | 2,150.15 ns | 2.05x slower |
Baseline_Double | Vector128 | Double | 1000 | 1,042.76 ns | 20.052 ns | 1,035.44 ns | 1.04x faster |
System_Double | Vector128 | Double | 1000 | 205.04 ns | 4.889 ns | 203.97 ns | 5.28x faster |
NetFabric_Double | Vector128 | Double | 1000 | 2,270.89 ns | 125.795 ns | 2,307.96 ns | 2.07x slower |
Baseline_Double | Vector256 | Double | 1000 | 1,444.38 ns | 116.838 ns | 1,456.48 ns | 1.31x slower |
System_Double | Vector256 | Double | 1000 | 152.14 ns | 11.168 ns | 149.91 ns | 7.20x faster |
NetFabric_Double | Vector256 | Double | 1000 | 2,201.52 ns | 80.728 ns | 2,205.31 ns | 2.04x slower |
Baseline_Float | Scalar | Float | 1000 | 1,209.83 ns | 61.967 ns | 1,197.02 ns | baseline |
System_Float | Scalar | Float | 1000 | 480.37 ns | 33.768 ns | 472.45 ns | 2.54x faster |
NetFabric_Float | Scalar | Float | 1000 | 2,359.35 ns | 93.242 ns | 2,387.54 ns | 1.96x slower |
Baseline_Float | Vector128 | Float | 1000 | 770.33 ns | 53.606 ns | 750.29 ns | 1.57x faster |
System_Float | Vector128 | Float | 1000 | 126.49 ns | 9.335 ns | 125.69 ns | 9.58x faster |
NetFabric_Float | Vector128 | Float | 1000 | 2,152.34 ns | 89.694 ns | 2,153.25 ns | 1.79x slower |
Baseline_Float | Vector256 | Float | 1000 | 762.05 ns | 79.493 ns | 753.45 ns | 1.56x faster |
System_Float | Vector256 | Float | 1000 | 67.04 ns | 1.134 ns | 66.90 ns | 18.42x faster |
NetFabric_Float | Vector256 | Float | 1000 | 1,999.26 ns | 90.642 ns | 2,017.39 ns | 1.66x slower |
Baseline_Half | Scalar | Half | 1000 | 12,504.44 ns | 286.312 ns | 12,399.19 ns | baseline |
System_Half | Scalar | Half | 1000 | 12,231.32 ns | 120.729 ns | 12,238.40 ns | 1.02x faster |
NetFabric_Half | Scalar | Half | 1000 | 9,433.74 ns | 867.650 ns | 9,546.42 ns | 1.35x faster |
Baseline_Half | Vector128 | Half | 1000 | 9,697.71 ns | 240.589 ns | 9,676.38 ns | 1.29x faster |
System_Half | Vector128 | Half | 1000 | 10,333.35 ns | 852.316 ns | 9,931.87 ns | 1.18x faster |
NetFabric_Half | Vector128 | Half | 1000 | 8,915.24 ns | 799.399 ns | 8,905.60 ns | 1.51x faster |
Baseline_Half | Vector256 | Half | 1000 | 10,267.79 ns | 924.079 ns | 9,858.21 ns | 1.26x faster |
System_Half | Vector256 | Half | 1000 | 9,777.72 ns | 98.069 ns | 9,765.89 ns | 1.28x faster |
NetFabric_Half | Vector256 | Half | 1000 | 9,393.03 ns | 475.270 ns | 9,403.79 ns | 1.36x faster |
Baseline_Int | Scalar | Int | 1000 | 1,297.64 ns | 12.022 ns | 1,299.23 ns | baseline |
System_Int | Scalar | Int | 1000 | 407.63 ns | 4.247 ns | 409.42 ns | 3.18x faster |
NetFabric_Int | Scalar | Int | 1000 | 2,341.00 ns | 112.485 ns | 2,360.99 ns | 1.69x slower |
Baseline_Int | Vector128 | Int | 1000 | 1,353.19 ns | 75.724 ns | 1,316.32 ns | 1.05x slower |
System_Int | Vector128 | Int | 1000 | 115.52 ns | 6.332 ns | 114.52 ns | 11.38x faster |
NetFabric_Int | Vector128 | Int | 1000 | 2,108.18 ns | 110.913 ns | 2,122.89 ns | 1.54x slower |
Baseline_Int | Vector256 | Int | 1000 | 1,307.51 ns | 21.841 ns | 1,305.11 ns | 1.01x slower |
System_Int | Vector256 | Int | 1000 | 64.33 ns | 1.039 ns | 64.19 ns | 20.18x faster |
NetFabric_Int | Vector256 | Int | 1000 | 1,993.01 ns | 90.504 ns | 2,016.42 ns | 1.55x slower |
Baseline_Long | Scalar | Long | 1000 | 1,045.51 ns | 18.504 ns | 1,044.03 ns | baseline |
System_Long | Scalar | Long | 1000 | 406.87 ns | 7.117 ns | 405.92 ns | 2.57x faster |
NetFabric_Long | Scalar | Long | 1000 | 2,256.12 ns | 163.947 ns | 2,250.57 ns | 2.18x slower |
Baseline_Long | Vector128 | Long | 1000 | 1,071.94 ns | 48.088 ns | 1,050.91 ns | 1.04x slower |
System_Long | Vector128 | Long | 1000 | 207.46 ns | 4.846 ns | 205.69 ns | 5.03x faster |
NetFabric_Long | Vector128 | Long | 1000 | 2,197.30 ns | 162.174 ns | 2,164.07 ns | 2.15x slower |
Baseline_Long | Vector256 | Long | 1000 | 1,047.96 ns | 16.598 ns | 1,042.90 ns | 1.00x slower |
System_Long | Vector256 | Long | 1000 | 123.71 ns | 0.750 ns | 123.83 ns | 8.46x faster |
NetFabric_Long | Vector256 | Long | 1000 | 2,191.66 ns | 103.227 ns | 2,201.34 ns | 2.03x slower |
Baseline_Short | Scalar | Short | 1000 | 1,050.32 ns | 13.160 ns | 1,051.75 ns | baseline |
System_Short | Scalar | Short | 1000 | 413.54 ns | 14.802 ns | 409.65 ns | 2.52x faster |
NetFabric_Short | Scalar | Short | 1000 | 2,185.30 ns | 169.597 ns | 2,129.97 ns | 2.16x slower |
Baseline_Short | Vector128 | Short | 1000 | 1,042.56 ns | 10.547 ns | 1,041.07 ns | 1.01x faster |
System_Short | Vector128 | Short | 1000 | 57.45 ns | 2.324 ns | 56.74 ns | 18.53x faster |
NetFabric_Short | Vector128 | Short | 1000 | 2,001.51 ns | 93.791 ns | 2,016.08 ns | 1.89x slower |
Baseline_Short | Vector256 | Short | 1000 | 1,125.94 ns | 93.649 ns | 1,092.19 ns | 1.05x slower |
System_Short | Vector256 | Short | 1000 | 39.64 ns | 3.571 ns | 38.01 ns | 26.02x faster |
NetFabric_Short | Vector256 | Short | 1000 | 1,980.78 ns | 87.917 ns | 2,002.02 ns | 1.85x slower |
Maybe run into bandwith limitation?
Il gio 25 apr 2024, 17:41 Antão Almada @.***> ha scritto:
I ran tests on branch #29 https://github.com/NetFabric/NetFabric.Numerics.Tensors/pull/29, but the results aren't too encouraging. The multicore performance is slower, regardless of whether SIMD is employed or not. I need to explore the hardware constraints to understand this better.
BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.4353/22H2/2022Update) Intel Core i7-7567U CPU 3.50GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores .NET SDK 9.0.100-preview.1.24101.2 [Host] : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2 Scalar : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT Vector128 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX Vector256 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2
Method Job Categories Count Mean StdDev Median Ratio -------------: -----------: -------------: --------------: - Baseline_Double Scalar Double 1000 1,078.60 ns 40.114 ns 1,060.22 ns baseline System_Double Scalar Double 1000 406.24 ns 13.758 ns 399.47 ns 2.67x faster NetFabric_Double Scalar Double 1000 2,180.41 ns 166.326 ns 2,150.15 ns 2.05x slower Baseline_Double Vector128 Double 1000 1,042.76 ns 20.052 ns 1,035.44 ns 1.04x faster System_Double Vector128 Double 1000 205.04 ns 4.889 ns 203.97 ns 5.28x faster NetFabric_Double Vector128 Double 1000 2,270.89 ns 125.795 ns 2,307.96 ns 2.07x slower Baseline_Double Vector256 Double 1000 1,444.38 ns 116.838 ns 1,456.48 ns 1.31x slower System_Double Vector256 Double 1000 152.14 ns 11.168 ns 149.91 ns 7.20x faster NetFabric_Double Vector256 Double 1000 2,201.52 ns 80.728 ns 2,205.31 ns 2.04x slower Baseline_Float Scalar Float 1000 1,209.83 ns 61.967 ns 1,197.02 ns baseline System_Float Scalar Float 1000 480.37 ns 33.768 ns 472.45 ns 2.54x faster NetFabric_Float Scalar Float 1000 2,359.35 ns 93.242 ns 2,387.54 ns 1.96x slower Baseline_Float Vector128 Float 1000 770.33 ns 53.606 ns 750.29 ns 1.57x faster System_Float Vector128 Float 1000 126.49 ns 9.335 ns 125.69 ns 9.58x faster NetFabric_Float Vector128 Float 1000 2,152.34 ns 89.694 ns 2,153.25 ns 1.79x slower Baseline_Float Vector256 Float 1000 762.05 ns 79.493 ns 753.45 ns 1.56x faster System_Float Vector256 Float 1000 67.04 ns 1.134 ns 66.90 ns 18.42x faster NetFabric_Float Vector256 Float 1000 1,999.26 ns 90.642 ns 2,017.39 ns 1.66x slower Baseline_Half Scalar Half 1000 12,504.44 ns 286.312 ns 12,399.19 ns baseline System_Half Scalar Half 1000 12,231.32 ns 120.729 ns 12,238.40 ns 1.02x faster NetFabric_Half Scalar Half 1000 9,433.74 ns 867.650 ns 9,546.42 ns 1.35x faster Baseline_Half Vector128 Half 1000 9,697.71 ns 240.589 ns 9,676.38 ns 1.29x faster System_Half Vector128 Half 1000 10,333.35 ns 852.316 ns 9,931.87 ns 1.18x faster NetFabric_Half Vector128 Half 1000 8,915.24 ns 799.399 ns 8,905.60 ns 1.51x faster Baseline_Half Vector256 Half 1000 10,267.79 ns 924.079 ns 9,858.21 ns 1.26x faster System_Half Vector256 Half 1000 9,777.72 ns 98.069 ns 9,765.89 ns 1.28x faster NetFabric_Half Vector256 Half 1000 9,393.03 ns 475.270 ns 9,403.79 ns 1.36x faster Baseline_Int Scalar Int 1000 1,297.64 ns 12.022 ns 1,299.23 ns baseline System_Int Scalar Int 1000 407.63 ns 4.247 ns 409.42 ns 3.18x faster NetFabric_Int Scalar Int 1000 2,341.00 ns 112.485 ns 2,360.99 ns 1.69x slower Baseline_Int Vector128 Int 1000 1,353.19 ns 75.724 ns 1,316.32 ns 1.05x slower System_Int Vector128 Int 1000 115.52 ns 6.332 ns 114.52 ns 11.38x faster NetFabric_Int Vector128 Int 1000 2,108.18 ns 110.913 ns 2,122.89 ns 1.54x slower Baseline_Int Vector256 Int 1000 1,307.51 ns 21.841 ns 1,305.11 ns 1.01x slower System_Int Vector256 Int 1000 64.33 ns 1.039 ns 64.19 ns 20.18x faster NetFabric_Int Vector256 Int 1000 1,993.01 ns 90.504 ns 2,016.42 ns 1.55x slower Baseline_Long Scalar Long 1000 1,045.51 ns 18.504 ns 1,044.03 ns baseline System_Long Scalar Long 1000 406.87 ns 7.117 ns 405.92 ns 2.57x faster NetFabric_Long Scalar Long 1000 2,256.12 ns 163.947 ns 2,250.57 ns 2.18x slower Baseline_Long Vector128 Long 1000 1,071.94 ns 48.088 ns 1,050.91 ns 1.04x slower System_Long Vector128 Long 1000 207.46 ns 4.846 ns 205.69 ns 5.03x faster NetFabric_Long Vector128 Long 1000 2,197.30 ns 162.174 ns 2,164.07 ns 2.15x slower Baseline_Long Vector256 Long 1000 1,047.96 ns 16.598 ns 1,042.90 ns 1.00x slower System_Long Vector256 Long 1000 123.71 ns 0.750 ns 123.83 ns 8.46x faster NetFabric_Long Vector256 Long 1000 2,191.66 ns 103.227 ns 2,201.34 ns 2.03x slower Baseline_Short Scalar Short 1000 1,050.32 ns 13.160 ns 1,051.75 ns baseline System_Short Scalar Short 1000 413.54 ns 14.802 ns 409.65 ns 2.52x faster NetFabric_Short Scalar Short 1000 2,185.30 ns 169.597 ns 2,129.97 ns 2.16x slower Baseline_Short Vector128 Short 1000 1,042.56 ns 10.547 ns 1,041.07 ns 1.01x faster System_Short Vector128 Short 1000 57.45 ns 2.324 ns 56.74 ns 18.53x faster NetFabric_Short Vector128 Short 1000 2,001.51 ns 93.791 ns 2,016.08 ns 1.89x slower Baseline_Short Vector256 Short 1000 1,125.94 ns 93.649 ns 1,092.19 ns 1.05x slower System_Short Vector256 Short 1000 39.64 ns 3.571 ns 38.01 ns 26.02x faster NetFabric_Short Vector256 Short 1000 1,980.78 ns 87.917 ns 2,002.02 ns 1.85x slower — Reply to this email directly, view it on GitHub https://github.com/NetFabric/NetFabric.Numerics.Tensors/issues/24#issuecomment-2077721460, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2R4E3CZQUJEGBZ33SFCRDY7EW3JAVCNFSM6AAAAABFO3XALCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZXG4ZDCNBWGA . You are receiving this because you were mentioned.Message ID: @.***>
@Darelbi I kept on researching and wrote an article explaining the current steps: https://aalmada.github.io/posts/Unleashing-parallelism/ Feedback and ideas are welcome!
Thanks for It. Very interesting. Btw what are the specs of your system?
Il sab 4 mag 2024, 00:58 Antão Almada @.***> ha scritto:
@Darelbi https://github.com/Darelbi I kept on researching and wrote an article explaining the current steps: https://aalmada.github.io/posts/Unleashing-parallelism/ Feedback and ideas are welcome!
— Reply to this email directly, view it on GitHub https://github.com/NetFabric/NetFabric.Numerics.Tensors/issues/24#issuecomment-2093864072, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2R4E6M6CNQ4JXLMALBV6TZAQI73AVCNFSM6AAAAABFO3XALCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDAOJTHA3DIMBXGI . You are receiving this because you were mentioned.Message ID: @.***>
I've been testing it on multiple systems:
The benchmarks on the article are for the AMD.
Hi!, Thank you for this amazing library, however it is not clear by documentation if it supports matrix/tensor multiplication..
Does it employ also thread parallelism? (Parallel.For, in addition to SIMD instructions)
If it supports tensor/matrix multiplication it would be great for use in machine learning, just in case it supports it already (in example a forward pass in a neural network is just a(Wx+b) where W is matrices of weights, x input vector, b bias and a activation function) , how do I use the tensor/matrix multiplication? thanks