SciSharp / NumSharp

High Performance Computation for N-D Tensors in .NET, similar API to NumPy.
https://github.com/SciSharp
Apache License 2.0
1.34k stars 187 forks source link

Extremely poor performance on sum reduce #509

Open lucdem opened 4 months ago

lucdem commented 4 months ago

Creating a large 3 dimensional array and calling sum(axis: 0) takes an enormous amount of time on my machine (11s), far more than Numpy (around 75ms) or even just a simple C# naive implementation (around 200ms).

C# Code:

using NumSharp;
using System.Diagnostics;
using System.Linq;

namespace NumSharpTest;

internal class Program
{
    static void Main(string[] args)
    {
        Console.WriteLine("NumSharp");
        NumSharp();
        Console.WriteLine("#######################");
        Console.WriteLine("Naive");
        Naive();
    }

    static void NumSharp()
    {
        var randArr = np.random.uniform(0, 1, [500, 500, 500]).astype(np.float32);

        var stopwatch = new Stopwatch();
        stopwatch.Start();
        var linesSum = randArr.sum(axis: 0);
        stopwatch.Stop();
        Console.WriteLine($"Elapsed: {stopwatch.ElapsedMilliseconds} ms");
    }

    static void Naive()
    {
        var random = new Random();
        var shape = new int[] { 500, 500, 500 };
        var randArr = new float[shape[0] * shape[1] * shape[2]];
        for (int i = 0; i < randArr.Length; i++) { randArr[i] = (float)random.NextDouble(); }

        var stopwatch = new Stopwatch();
        stopwatch.Start();
        var sum = new float[shape[1] * shape[2]];
        for (int i = 0; i < shape[0]; i++)
        {
            for (int j = 0; j < shape[1]; j++)
            {
                for (int k = 0; k < shape[2]; k++)
                {
                    sum[shape[2] * j + k] += randArr[shape[1] * shape[2] * i + shape[2] * j + k];
                }
            }
        }
        stopwatch.Stop();
        Console.WriteLine($"Elapsed: {stopwatch.ElapsedMilliseconds} ms");
    }
}

Python Code:

import numpy as np
from time import time

randArr = np.random.normal(1, 0.5, (500, 500, 500))

start = time()
sum = randArr.sum(axis=0)
end = time()

print(f'Elapsed: {(end - start) * 1000} ms')
vkribo commented 3 months ago

I profiled it and the problem seems to be in the method GetOffset in the Shape class. In the method there is a list called coords being created milions of times. I tried reimplementing it using a stackalloced array instead and it went from 6287 ms to 2315 ms. Maybe someone more familliar with the code could look into it.