Open aalmada opened 3 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
The benchmark results are puzzling, as I see no codegen differences for the Read
variants, and for the Write
variant, the span version actually has one less bounds check. Could you share the disassembly (via [DisassemblyDiagnoser]
)?
@SingleAccretion I hope this helps
; ArrayIteration.ReadSliceBenchmarks.Full_ForEach()
mov rcx,[rcx+8]
jmp near ptr ArrayIteration.Sum.ForEach(Int32[])
; Total bytes of code 9
; ArrayIteration.Sum.ForEach(Int32[])
xor eax,eax
xor edx,edx
mov r8d,[rcx+8]
test r8d,r8d
jle short M01_L01
M01_L00:
movsxd r9,edx
mov r9d,[rcx+r9*4+10]
add eax,r9d
inc edx
cmp r8d,edx
jg short M01_L00
M01_L01:
ret
; Total bytes of code 32
; ArrayIteration.ReadSliceBenchmarks.Slice_Foreach()
push rsi
sub rsp,30
xor eax,eax
mov [rsp+20],rax
mov rcx,[rcx+8]
test rcx,rcx
je short M00_L01
lea rax,[rcx+10]
mov esi,[rcx+8]
M00_L00:
mov [rsp+20],rax
mov [rsp+28],esi
lea rcx,[rsp+20]
call ArrayIteration.Sum.ForEach(System.ReadOnlySpan`1<Int32>)
nop
add rsp,30
pop rsi
ret
M00_L01:
xor eax,eax
xor esi,esi
jmp short M00_L00
; Total bytes of code 60
; ArrayIteration.Sum.ForEach(System.ReadOnlySpan`1<Int32>)
xor eax,eax
mov rdx,[rcx]
mov ecx,[rcx+8]
xor r8d,r8d
test ecx,ecx
jle short M01_L01
nop
M01_L00:
movsxd r9,r8d
mov r9d,[rdx+r9*4]
add eax,r9d
inc r8d
cmp r8d,ecx
jl short M01_L00
M01_L01:
ret
; Total bytes of code 35
BenchmarkDotNet=v0.13.0, OS=Windows 10.0.19043.985 (21H1/May2021Update) .NET SDK=5.0.100 [Host] : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT DefaultJob : .NET 5.0.0 (5.0.20.51904), X64 RyuJIT
Method | Count | Mean | Error | StdDev | Ratio | RatioSD | Code Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
Array_ForEach | 1000000 | 742.0 ms | 12.16 ms | 10.78 ms | baseline | 69 B | - | - | - | 560 B | |
Array_ForEach_Inside | 1000000 | 1,047.6 ms | 5.31 ms | 4.44 ms | 1.41x slower | 0.02x | 52 B | - | - | - | 1,240 B |
Array_Ptr | 1000000 | 819.5 ms | 8.88 ms | 8.31 ms | 1.11x slower | 0.02x | 134 B | - | - | - | 1,336 B |
Span_Foreach | 1000000 | 1,070.1 ms | 18.13 ms | 15.14 ms | 1.44x slower | 0.03x | 116 B | - | - | - | - |
Span_ForEach_Inside | 1000000 | 724.8 ms | 1.74 ms | 1.46 ms | 1.02x faster | 0.01x | 65 B | - | - | - | 1,240 B |
[Benchmark(Baseline = true)]
public int Array_ForEach()
{
static int AC(int[] array)
{
var sum = 0;
foreach (var item in array)
sum += item;
return sum;
}
var Res = 0;
for (int i = 0; i < 1000; i++)
Res = AC(_array);
return Res;
}
[Benchmark(Baseline = false)]
public int Array_ForEach_Inside()
{
var array = _array;
var Res = 0;
for (int i = 0; i < 1000; i++)
{
foreach (var item in array)
Res += item;
}
return Res;
}
[Benchmark]
public int Array_Ptr()
{
static unsafe int AC(int[] source)
{
var sum = 0;
var End = source.Length;
fixed (int* p = source)
{
var Ptr = p;
var EndPtr = p + End;
for (; Ptr < EndPtr; Ptr++)
sum += *Ptr;
}
return sum;
}
var Res = 0;
for (int i = 0; i < 1000; i++)
Res = AC(_array);
return Res;
}
[Benchmark]
public int Span_Foreach()
{
static int AC(Span<int> source)
{
var sum = 0;
foreach (var item in source)
sum += item;
return sum;
}
var Span = _array.AsSpan();
var Res = 0;
for (int i = 0; i < 1000; i++)
Res = AC(Span);
return Res;
}
[Benchmark]
public int Span_ForEach_Inside()
{
var Res = 0;
var source = _array.AsSpan();
for (int i = 0; i < 1000; i++)
{
Res = 0;
foreach (var item in source)
Res += item;
}
return Res;
}
I think the delay come from of giving the span to the method, i think this is because span is value type and need recreate again in transfer.
I think the delay come from of giving the span to the method, i think this is because span is value type and need recreate again in transfer.
Span is very lightweight and shouldn't cause such difference.
also the "Array_ForEach_Inside" method is slower because of extra declaration array.
No. You are doing manual inlining. Difference caused by inlining should be discussed as another topic.
also you can add "MethodImplOptions.AggressiveInlining" to make that faster
public int Span_Foreach()
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
static int AC(Span<int> source)
{
var sum = 0;
foreach (var item in source)
sum += item;
return sum;
}
var Span = _array.AsSpan();
var Res = 0;
for (int i = 0; i < 1000; i++)
Res = AC(Span);
return Res;
}
Method | Count | Mean | Error | StdDev | Ratio | RatioSD | Code Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
Array_ForEach | 1000000 | 730.6 ms | 1.14 ms | 1.01 ms | baseline | 69 B | - | - | - | 1,240 B | |
Array_ForEach_Inside | 1000000 | 1,073.2 ms | 19.25 ms | 17.06 ms | 1.47x slower | 0.02x | 52 B | - | - | - | - |
Array_Ptr | 1000000 | 819.6 ms | 7.30 ms | 6.47 ms | 1.12x slower | 0.01x | 134 B | - | - | - | 1,336 B |
Span_Foreach | 1000000 | 725.7 ms | 1.80 ms | 1.60 ms | 1.01x faster | 0.00x | 65 B | - | - | - | 1,336 B |
Span_ForEach_Inside | 1000000 | 730.1 ms | 6.68 ms | 5.58 ms | 1.00x faster | 0.01x | 65 B | - | - | - | 1,336 B |
also you can add "MethodImplOptions.AggressiveInlining" to make that faster
We are NOT discussing inlining. We should follow the best practice of BDN that prevents field address being optimized to constant.
ok friend.
Description
I'm no low-level expert but I care much about performance. One trick I've learned with the experts is that when iterating an array, using
foreach
guarantees that local variables are used so that bounds checking can be removed.I will not always want to iterate the whole array. In this case, I have to use a
for
loop iterating on a defined range of indices. I understand that in this case bounds checking may not be removed.Now that
Span<T>
andReadOnlySpan<T>
are available, theforeach
can be used instead. In this case, the bounds of the span are guaranteed to be valid and I was expecting the bounds checking to be removed, but my benchmarks show that it doesn't.I don't know if this is by design...
The source for the benchmarks shown here is available at https://github.com/aalmada/ArrayIteration
Configuration
Regression?
It's not a regression.
Data
Read Benchmarks
To test the array iteration I wrote the following methods:
I benchmarked using the following code:
Obtained the following results:
Write Benchmarks
I wrote one more benchmark to test also the array assignment. I used the following methods:
I benchmarked using the following code:
Obtained the following results:
Analysis
The read benchmarks show that iterating on a slice of an array, using
for
on the array orforeach
on theReadOnlySpan<T>
, have similar performance but 33% slower than usingforeach
on the full array. I assume it's the bounds checking performance hit.For the write benchmarks, I would not expect any of the
for
loops to perform as well asArray.Copy()
orReadOnlySpan<T>.CopyTo()
, but I wasn't expecting thefor
on the spans to be slower than on the arrays.Again, I may be wrong but I was expecting bounds checking to be removed when using
{ReadOnly}Span<T>
. Is this possible?I'm trying to develop a high-performance library and this means that I will need to maintain code for both arrays and
{ReadOnly}Span<T>
, which I was trying to avoid.Thank you all! 👍
category:performance theme:basic-cq