draconware-dev / SpanExtensions.Net

SpanExtensions.Net aims to make ReadOnlySpan<T> and Span<T> more accessible and overall better integrated into the .Net Ecosystem, by providing alternatives for many missing Extension Methods for Span<T>, ranging from string.Split() over Enumerable.Skip() and Enumerable.Take() to an improved ReadOnlySpan<T>.IndexOf().
MIT License
17 stars 5 forks source link

Speed improvement #15

Closed SadToBeNep closed 4 months ago

SadToBeNep commented 5 months ago

What?

I did some benchmarks on the already implemented Sum() function and tried to improve it's speed a bit (results bellow)

Why?

I did the benchmarks as part of my learning about C# code optimizations, and from that I know that the for loops can be sometimes slow

How we could improve it?

Well, there are obviously a lot of ways to do that, I think as an early development suggestion, it would be good if we used foreach instead of basic for loops even tho this isn't the biggest performance improvement, unrolling the loop is much faster, but again, for the sake of maintainability simply changing for to foreach is enough

Results of tests


BenchmarkDotNet v0.13.12, Arch Linux
Intel Core i5-8400 CPU 2.80GHz (Coffee Lake), 1 CPU, 6 logical and 6 physical cores
.NET SDK 8.0.101
  [Host]     : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX2
  DefaultJob : .NET 8.0.1 (8.0.123.58001), X64 RyuJIT AVX2
Method lengthOfList Mean Error StdDev Ratio RatioSD
TestUnroll 100 23.32 ns 0.115 ns 0.108 ns 0.51 0.00
TestIdomatic 100 42.61 ns 0.386 ns 0.342 ns 0.93 0.01
TestSum 100 45.62 ns 0.274 ns 0.243 ns 1.00 0.00
TestUnroll 1000 265.46 ns 1.074 ns 0.897 ns 0.69 0.01
TestIdomatic 1000 353.41 ns 2.807 ns 2.488 ns 0.92 0.01
TestSum 1000 383.83 ns 3.140 ns 2.783 ns 1.00 0.00
TestUnroll 100000 27,049.76 ns 213.029 ns 188.844 ns 0.71 0.01
TestIdomatic 100000 34,221.19 ns 370.153 ns 328.131 ns 0.90 0.01
TestSum 100000 38,184.94 ns 455.654 ns 426.219 ns 1.00 0.00
TestUnroll 1000000 278,427.73 ns 5,564.379 ns 5,204.924 ns 0.71 0.02
TestIdomatic 1000000 350,900.60 ns 6,282.049 ns 5,876.233 ns 0.89 0.02
TestSum 1000000 393,756.77 ns 7,274.756 ns 6,804.811 ns 1.00 0.00

Some explanation

Even tho the compiler does some optimization on the loop, we can achieve much better performance by implementing the loop in a way, where it's adding together source[x]+source[x+1]+source[x+3]... Using this method as we can see the improvements with really small lists of numbers(100) the loop is almost twice as fast, and after that it keeps a 20-30% faster numbers As for my suggestion of changing for to foreach the improvement are around 10% stable regardless of the size, and again maintaining the simpler solution is easier

Can try to implement it by myself if needed

dragon7307 commented 5 months ago

I'll do some benchmarks on my own, this is a feature I am working on currently, and get back to you...

dragon7307 commented 5 months ago

Can you share the code you conducted the benchmarks on? Have you implemented bound-checks?? Since I am only getting comparable performance without bound-checking, what of course is dangerous.

dragon7307 commented 4 months ago

I have been unable to achieve said performance improvements.
However ,I will improve the performance by utilising vectorizationd (SIMD!).