Performance impact when using {ReadOnly}Span<T>

aalmada commented 3 years ago

Description

I'm no low-level expert but I care much about performance. One trick I've learned with the experts is that when iterating an array, using foreach guarantees that local variables are used so that bounds checking can be removed.

I will not always want to iterate the whole array. In this case, I have to use a for loop iterating on a defined range of indices. I understand that in this case bounds checking may not be removed.

Now that Span<T> and ReadOnlySpan<T> are available, the foreach can be used instead. In this case, the bounds of the span are guaranteed to be valid and I was expecting the bounds checking to be removed, but my benchmarks show that it doesn't.

I don't know if this is by design...

The source for the benchmarks shown here is available at https://github.com/aalmada/ArrayIteration

Configuration


BenchmarkDotNet=v0.13.0, OS=Windows 10.0.19043.1081 (21H1/May2021Update)
Intel Core i7-7567U CPU 3.50GHz (Kaby Lake), 1 CPU, 4 logical and 2 physical cores
.NET SDK=6.0.100-preview.5.21302.13
  [Host]     : .NET 6.0.0 (6.0.21.30105), X64 RyuJIT
  DefaultJob : .NET 6.0.0 (6.0.21.30105), X64 RyuJIT

Regression?

It's not a regression.

Data

Read Benchmarks

To test the array iteration I wrote the following methods:


    static class Sum
    {

        public static int ForEach(int[] array)
        {
            var sum = 0;
            foreach (var item in array)
                sum += item;
            return sum;
        }

        public static int For(int[] source, int offset, int count)
        {
            var sum = 0;
            for (var index = offset; index < offset + count; index++)
                sum += source[index];
            return sum;        
        }  

        public static int ForEach(ReadOnlySpan<int> source)
        {
            var sum = 0;
            foreach (var item in source)
                sum += item;
            return sum;
        }

    }

I benchmarked using the following code:

        [Benchmark(Baseline = true)]
        public int Full_ForEach()
            => Sum.ForEach(_array);

        [Benchmark]
        public int Slice_For()
            => Sum.For(_array, 0, _array.Length);

        [Benchmark]
        public int Slice_Foreach()
            => Sum.ForEach(_array.AsSpan());

Obtained the following results:

Method	Count	Mean	Error	StdDev	Ratio	RatioSD
Full_ForEach	1000000	426.0 μs	3.26 μs	2.72 μs	baseline
Slice_For	1000000	576.2 μs	2.50 μs	2.34 μs	1.35x slower	0.01x
Slice_Foreach	1000000	559.7 μs	4.64 μs	3.88 μs	1.31x slower	0.01x

Write Benchmarks

I wrote one more benchmark to test also the array assignment. I used the following methods:


    static class Copy
    {
        public static void Array_For<T>(T[] source, int sourceOffset, T[] destination, int destinationOffset, int count)
        {
            for (var index = 0; index < count; index++)
                destination[index + destinationOffset] = source[index + sourceOffset];
        }

        public static void Span_For<T>(ReadOnlySpan<T> source, Span<T> destination)
        {
            for (var index = 0; index < source.Length; index++)
                destination[index] = source[index];
        }

    }

I benchmarked using the following code:


        [Benchmark(Baseline = true)]
        public void Array_Copy()
            => Array.Copy(_source, 0, _destination, 0, _source.Length);

        [Benchmark]
        public void Span_CopyTo()
            => _source.AsSpan().CopyTo(_destination);

        [Benchmark]
        public void Array_For()
            => Copy.Array_For(_source, 0, _destination, 0, _source.Length);

        [Benchmark]
        public void Span_For()
            => Copy.Span_For<int>(_source, _destination);

Obtained the following results:

Method	Count	Mean	Error	StdDev	Ratio	RatioSD
Array_Copy	1000000	210.9 μs	1.73 μs	1.53 μs	baseline
Span_CopyTo	1000000	211.6 μs	1.66 μs	1.55 μs	1.00x slower	0.01x
Array_For	1000000	486.9 μs	4.47 μs	4.19 μs	2.31x slower	0.02x
Span_For	1000000	713.2 μs	7.10 μs	5.93 μs	3.38x slower	0.04x

Analysis

The read benchmarks show that iterating on a slice of an array, using for on the array or foreach on the ReadOnlySpan<T>, have similar performance but 33% slower than using foreach on the full array. I assume it's the bounds checking performance hit.

For the write benchmarks, I would not expect any of the for loops to perform as well as Array.Copy() or ReadOnlySpan<T>.CopyTo(), but I wasn't expecting the for on the spans to be slower than on the arrays.

Again, I may be wrong but I was expecting bounds checking to be removed when using {ReadOnly}Span<T>. Is this possible?

I'm trying to develop a high-performance library and this means that I will need to maintain code for both arrays and {ReadOnly}Span<T>, which I was trying to avoid.

Thank you all! 👍

category:performance theme:basic-cq

dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

SingleAccretion commented 3 years ago

The benchmark results are puzzling, as I see no codegen differences for the Read variants, and for the Write variant, the span version actually has one less bounds check. Could you share the disassembly (via [DisassemblyDiagnoser])?

huoyaoyuan commented 3 years ago

aalmada commented 3 years ago

@SingleAccretion I hope this helps

.NET 6.0.0 (6.0.21.30105), X64 RyuJIT

; ArrayIteration.ReadSliceBenchmarks.Full_ForEach()
       mov       rcx,[rcx+8]
       jmp       near ptr ArrayIteration.Sum.ForEach(Int32[])
; Total bytes of code 9

; ArrayIteration.Sum.ForEach(Int32[])
       xor       eax,eax
       xor       edx,edx
       mov       r8d,[rcx+8]
       test      r8d,r8d
       jle       short M01_L01
M01_L00:
       movsxd    r9,edx
       mov       r9d,[rcx+r9*4+10]
       add       eax,r9d
       inc       edx
       cmp       r8d,edx
       jg        short M01_L00
M01_L01:
       ret
; Total bytes of code 32

.NET 6.0.0 (6.0.21.30105), X64 RyuJIT

; ArrayIteration.ReadSliceBenchmarks.Slice_Foreach()
       push      rsi
       sub       rsp,30
       xor       eax,eax
       mov       [rsp+20],rax
       mov       rcx,[rcx+8]
       test      rcx,rcx
       je        short M00_L01
       lea       rax,[rcx+10]
       mov       esi,[rcx+8]
M00_L00:
       mov       [rsp+20],rax
       mov       [rsp+28],esi
       lea       rcx,[rsp+20]
       call      ArrayIteration.Sum.ForEach(System.ReadOnlySpan`1<Int32>)
       nop
       add       rsp,30
       pop       rsi
       ret
M00_L01:
       xor       eax,eax
       xor       esi,esi
       jmp       short M00_L00
; Total bytes of code 60

; ArrayIteration.Sum.ForEach(System.ReadOnlySpan`1<Int32>)
       xor       eax,eax
       mov       rdx,[rcx]
       mov       ecx,[rcx+8]
       xor       r8d,r8d
       test      ecx,ecx
       jle       short M01_L01
       nop
M01_L00:
       movsxd    r9,r8d
       mov       r9d,[rdx+r9*4]
       add       eax,r9d
       inc       r8d
       cmp       r8d,ecx
       jl        short M01_L00
M01_L01:
       ret
; Total bytes of code 35

rootflood commented 3 years ago

BenchmarkDotNet=v0.13.0, OS=Windows 10.0.19043.985 (21H1/May2021Update) .NET SDK=5.0.100 [Host] : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT DefaultJob : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT

Method	Count	Mean	Error	StdDev	Ratio	RatioSD	Code Size	Gen 0	Gen 1	Gen 2	Allocated
Array_ForEach	1000000	742.0 ms	12.16 ms	10.78 ms	baseline		69 B	-	-	-	560 B
Array_ForEach_Inside	1000000	1,047.6 ms	5.31 ms	4.44 ms	1.41x slower	0.02x	52 B	-	-	-	1,240 B
Array_Ptr	1000000	819.5 ms	8.88 ms	8.31 ms	1.11x slower	0.02x	134 B	-	-	-	1,336 B
Span_Foreach	1000000	1,070.1 ms	18.13 ms	15.14 ms	1.44x slower	0.03x	116 B	-	-	-	-
Span_ForEach_Inside	1000000	724.8 ms	1.74 ms	1.46 ms	1.02x faster	0.01x	65 B	-	-	-	1,240 B

        [Benchmark(Baseline = true)]
        public int Array_ForEach()
        {
            static int AC(int[] array)
            {
                var sum = 0;
                foreach (var item in array)
                    sum += item;
                return sum;
            }
            var Res = 0;
            for (int i = 0; i < 1000; i++)
                Res = AC(_array);
            return Res;
        }

        [Benchmark(Baseline = false)]
        public int Array_ForEach_Inside()
        {
            var array = _array;
            var Res = 0;
            for (int i = 0; i < 1000; i++)
            {
                foreach (var item in array)
                    Res += item;
            }
            return Res;
        }

        [Benchmark]
        public int Array_Ptr()
        {
            static unsafe int AC(int[] source)
            {
                var sum = 0;
                var End = source.Length;
                fixed (int* p = source)
                {
                    var Ptr = p;
                    var EndPtr = p + End;
                    for (; Ptr < EndPtr; Ptr++)
                        sum += *Ptr;
                }
                return sum;
            }

            var Res = 0;
            for (int i = 0; i < 1000; i++)
                Res = AC(_array);
            return Res;
        }

        [Benchmark]
        public int Span_Foreach()
        {
            static int AC(Span<int> source)
            {
                var sum = 0;
                foreach (var item in source)
                    sum += item;
                return sum;
            }
            var Span = _array.AsSpan();
            var Res = 0;
            for (int i = 0; i < 1000; i++)
                Res = AC(Span);
            return Res;
        }

        [Benchmark]
        public int Span_ForEach_Inside()
        {
            var Res = 0;
            var source = _array.AsSpan();
            for (int i = 0; i < 1000; i++)
            {
                Res = 0;
                foreach (var item in source)
                    Res += item;
            }
            return Res;
        }

I think the delay come from of giving the span to the method, i think this is because span is value type and need recreate again in transfer.

huoyaoyuan commented 3 years ago

I think the delay come from of giving the span to the method, i think this is because span is value type and need recreate again in transfer.

Span is very lightweight and shouldn't cause such difference.

also the "Array_ForEach_Inside" method is slower because of extra declaration array.

No. You are doing manual inlining. Difference caused by inlining should be discussed as another topic.

rootflood commented 3 years ago

also you can add "MethodImplOptions.AggressiveInlining" to make that faster

        public int Span_Foreach()
        {
            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            static int AC(Span<int> source)
            {
                var sum = 0;
                foreach (var item in source)
                    sum += item;
                return sum;
            }
            var Span = _array.AsSpan();
            var Res = 0;
            for (int i = 0; i < 1000; i++)
                Res = AC(Span);
            return Res;
        }

Method	Count	Mean	Error	StdDev	Ratio	RatioSD	Code Size	Gen 0	Gen 1	Gen 2	Allocated
Array_ForEach	1000000	730.6 ms	1.14 ms	1.01 ms	baseline		69 B	-	-	-	1,240 B
Array_ForEach_Inside	1000000	1,073.2 ms	19.25 ms	17.06 ms	1.47x slower	0.02x	52 B	-	-	-	-
Array_Ptr	1000000	819.6 ms	7.30 ms	6.47 ms	1.12x slower	0.01x	134 B	-	-	-	1,336 B
Span_Foreach	1000000	725.7 ms	1.80 ms	1.60 ms	1.01x faster	0.00x	65 B	-	-	-	1,336 B
Span_ForEach_Inside	1000000	730.1 ms	6.68 ms	5.58 ms	1.00x faster	0.01x	65 B	-	-	-	1,336 B

huoyaoyuan commented 3 years ago

also you can add "MethodImplOptions.AggressiveInlining" to make that faster

We are NOT discussing inlining. We should follow the best practice of BDN that prevents field address being optimized to constant.

rootflood commented 3 years ago

ok friend.

dotnet / runtime