SciSharp / NumSharp

High Performance Computation for N-D Tensors in .NET, similar API to NumPy.
https://github.com/SciSharp
Apache License 2.0
1.36k stars 189 forks source link

The new storage UnmanagedByteStorage #301

Closed Oceania2018 closed 5 years ago

Oceania2018 commented 5 years ago

Performance is an important part, and we've been working hard to improve the performance of NumSharp, from ArrayStorage to TypedArrayStorage. We are not satisfied with this, we are always looking for a way to make memory allocation more efficient, we propose to use jemalloc.NET to help NumSharp to manage and allocate memory, the new engine will be named UnmanagedByteStorage.

Using unmanaged memory for storage is advantageous for interworking with other Tensor libraries, ideally for zero copy.

Welcome to comments.

henon commented 5 years ago

What about the other plan to use managed memory and SIMD when it comes out with .NET Core 3?

Nucs commented 5 years ago

@henon, After looking into the System.Runtime.Intrinsics source code, it does support unmannaged memory. These are the support types:

        internal static bool IsSupported
        {
            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            get
            {
                return (typeof(T) == typeof(byte)) ||
                       (typeof(T) == typeof(sbyte)) ||
                       (typeof(T) == typeof(short)) ||
                       (typeof(T) == typeof(ushort)) ||
                       (typeof(T) == typeof(int)) ||
                       (typeof(T) == typeof(uint)) ||
                       (typeof(T) == typeof(long)) ||
                       (typeof(T) == typeof(ulong)) ||
                       (typeof(T) == typeof(float)) ||
                       (typeof(T) == typeof(double));
            }
        }

We are missing decimal but I guess we'll be missing it any use case of unmanaged memory. We are also missing char support so we'll have to handle strings as bytes.

Intrinsic requires CPU that supports AVX/AVX2 instructions set. I took a look at Intel's I3-2100 which was released in 2011 and it does support AVX1 which lead me to believe we can assume every CPU supports it nowadays. A fallback can be easily handled via: image

Jemalloc.NET has a prerequisite for Microsoft Visual C++ Redistributable for Visual Studio 2017
and net-core 2. But adding a fallback for memory allocation should be fairly easy.

We need to test if netcore3 intrinsic can be used along Jemalloc allocations. If so, I'm up for basing the default backend math-ops (which was untouched yet) with System.Numerics.Vector<T> and System.Memory.Span<T> and System.Runtime.Intrinsic.Vector*<>

hez2010 commented 5 years ago

I tried to impl an UnmanagedArray for C# and tested it compared with C++:

Result: Test result

I found that the performance of .NET Core 3.0 was even to C++ which impressed me a lot.

My UnmanagedArray code:

using System;
using System.Collections;
using System.Collections.Generic;
using System.Runtime.CompilerServices;
using System.Runtime.InteropServices;

namespace UnmanagedByteStorage
{
    [StructLayout(LayoutKind.Sequential)]
    public unsafe struct UnmanagedArray<T> : IDisposable, IEnumerable<T>, IEquatable<UnmanagedArray<T>>, ICollection<T> where T : unmanaged
    {
        internal int m_Length;
        internal T* m_Buffer;

        public int Count => m_Length;

        public bool IsReadOnly => false;

        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        public UnmanagedArray(int length)
        {
            m_Length = 0;
            m_Buffer = null;
            Allocate(length);
        }

        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        public void Allocate(int length, bool copyOldValue = false)
        {
            var ptr = (T*)Marshal.AllocHGlobal(length * sizeof(T));
            if (copyOldValue)
            {
                var x = Math.Min(m_Length, length);
                for (int i = 0; i < x; i++)
                {
                    *(ptr + i) = *(m_Buffer + i);
                }
            }
            Dispose();
            m_Length = length;
            m_Buffer = ptr;
        }

        public T this[int index]
        {
            get => GetValue(index);
            set => SetValue(index, ref value);
        }

        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        public T GetValue(int index)
        {
            return *(m_Buffer + index);
        }

        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        public void SetValue(int index, ref T value)
        {
            *(m_Buffer + index) = value;
        }

        public void Dispose()
        {
            if (m_Buffer != null)
            {
                Marshal.FreeHGlobal((IntPtr)m_Buffer);
            }
        }

        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        public bool Equals(UnmanagedArray<T> other)
        {
            return other.m_Buffer == this.m_Buffer;
        }

        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        public IEnumerator<T> GetEnumerator()
        {
            for (var i = 0; i < m_Length; i++) yield return GetValue(i);
        }

        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        IEnumerator IEnumerable.GetEnumerator()
        {
            return this.GetEnumerator();
        }

        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        public override bool Equals(object obj)
        {
            if (obj is UnmanagedArray<T> array)
            {
                return array.m_Buffer == this.m_Buffer;
            }
            return false;
        }

        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        public override int GetHashCode()
        {
            return HashCode.Combine(m_Length, (IntPtr)m_Buffer);
        }

        /// <summary>
        /// NotSupported
        /// </summary>
        /// <param name="item"></param>
        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        public void Add(T item)
        {
            throw new NotSupportedException();
        }

        /// <summary>
        /// NotSupported
        /// </summary>
        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        public void Clear()
        {
            throw new NotImplementedException();
        }

        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        public bool Contains(T item)
        {
            for (var i = 0; i < m_Length; i++)
            {
                if ((*(m_Buffer + i)).Equals(item)) return true;
            }
            return false;
        }

        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        public void CopyTo(T[] array, int arrayIndex)
        {
            for (var i = 0; i < m_Length; i++)
            {
                array[i + arrayIndex] = *(m_Buffer + i);
            }
        }

        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        public void CopyTo(UnmanagedArray<T> array, int arrayIndex)
        {
            var length = this.m_Length - arrayIndex;
            Buffer.MemoryCopy(this.m_Buffer + arrayIndex, array.m_Buffer, sizeof(T) * array.m_Length, sizeof(T) * length);
        }

        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        public void CopyTo(T* array, int arrayIndex, int lengthToCopy)
        {
            Buffer.MemoryCopy(this.m_Buffer + arrayIndex, array, sizeof(T) * lengthToCopy, sizeof(T) * lengthToCopy);
        }

        /// <summary>
        /// NotSupported
        /// </summary>
        /// <param name="item"></param>
        /// <returns></returns>
        [MethodImpl(MethodImplOptions.AggressiveOptimization)]
        public bool Remove(T item)
        {
            throw new NotImplementedException();
        }
    }
}

Test code: C++

#include <cstring>
#include <cstdio>
#include <windows.h>

int main()
{
    DWORD k = ::GetTickCount();
    auto x = new int[104857600];
    for (auto i = 0; i < 104857600; i++)
    {
        x[i] = i;
    }
    auto y = new int[104857600];
    memcpy(y, x, 104857600 * sizeof(int));
    long long sum = 0;
    for (auto i = 0; i < 104857600; i++)
    {
        sum += y[i];
    }
    printf("C++ Array: %dms, Result: %lld", ::GetTickCount() - k, sum);
    getchar();
}

C#

            var stop = new Stopwatch();
            {
                stop.Start();
                var x = new UnmanagedArray<int>(104857600);
                for (var i = 0; i < 104857600; i++)
                {
                    x[i] = i;
                }
                var y = new UnmanagedArray<int>(104857600);
                x.CopyTo(y, 0);
                long accu = 0;
                for (var i = 0;i< 104857600; i++)
                {
                    accu += y[i];
                }
                stop.Stop();
                Console.WriteLine($"Unmanaged Array: {stop.ElapsedMilliseconds}ms, Result: {accu}");
                x.Dispose();
                y.Dispose();
            }
            stop.Reset();
            {
                stop.Start();
                var x = new int[104857600];
                for (var i = 0; i < 104857600; i++)
                {
                    x[i] = i;
                }
                var y = new int[104857600];
                x.CopyTo(y, 0);
                long accu = 0;
                for (var i = 0; i < 104857600; i++)
                {
                    accu += y[i];
                }
                stop.Stop();
                Console.WriteLine($"Managed Array: {stop.ElapsedMilliseconds}ms, Result: {accu}");
            }

You need to run Test() multiple times in order to warm up .NET Core JIT, so you can get the optimized result.

Also, both C++ side and C# side can enable SIMD to speed up the calculation process which will lead to faster result.

Nucs commented 5 years ago

This discussion has been resolved with the new release of NumSharp 0.11.0-alpha2 and available on nuget.

@hez2010 Thank you Steve for the template, It serves as a good starting point.