Open kuda1978 opened 6 months ago
Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.
I wonder if inverting the if (value == null)
condition would change the result. Not null value should be treated as more common case. Finding null value can be delegated to SIMD path.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
A minimal repro for the underlying issue:
[MethodImpl(MethodImplOptions.NoInlining)]
public bool Test<T>(T[] array, T value) where T : IEquatable<T>
{
return array[0].Equals(value);
}
We currently are never able to optimize this Equals if the method is not inlined into non-shared generic context. Dynamic PGO also gives up here because VM tells JIT the method should be invoked as indirect and we never optimize such calls with PGO currently. Codegen:
; Assembly listing for method Test[System.__Canon](System.__Canon[],System.__Canon):ubyte:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
G_M40771_IG01: ;; offset=0x0000
push rsi
push rbx
sub rsp, 40
mov qword ptr [rsp+0x20], rdx
mov rbx, r8
mov rsi, r9
;; size=17 bbWeight=1 PerfScore 3.75
G_M40771_IG02: ;; offset=0x0011
mov rcx, qword ptr [rdx+0x38]
mov r11, qword ptr [rcx+0x10]
test r11, r11
je SHORT G_M40771_IG04
;; size=13 bbWeight=1 PerfScore 5.25
G_M40771_IG03: ;; offset=0x001E
jmp SHORT G_M40771_IG05
;; size=2 bbWeight=0.80 PerfScore 1.60
G_M40771_IG04: ;; offset=0x0020
mov rcx, rdx
mov rdx, 0x7FFB713E76B0 ; global ptr
call CORINFO_HELP_RUNTIMEHANDLE_METHOD
mov r11, rax
;; size=21 bbWeight=0.20 PerfScore 0.35
G_M40771_IG05: ;; offset=0x0035
cmp dword ptr [rbx+0x08], 0
jbe SHORT G_M40771_IG07
mov rcx, gword ptr [rbx+0x10]
mov rdx, rsi
call [r11]
nop
;; size=17 bbWeight=1 PerfScore 9.50
G_M40771_IG06: ;; offset=0x0046
add rsp, 40
pop rbx
pop rsi
ret
;; size=7 bbWeight=1 PerfScore 2.25
G_M40771_IG07: ;; offset=0x004D
call CORINFO_HELP_RNGCHKFAIL
int3
; Total bytes of code 83
***** BB01 [0000]
STMT00001 ( ??? ... ??? )
[000009] DAC-G+----- * STORE_LCL_VAR long V06 tmp2
[000006] --C-G+----- \--* CALL help long CORINFO_HELP_RUNTIMEHANDLE_METHOD
[000004] !----+----- arg0 in rcx +--* LCL_VAR long V01 TypeCtx
[000005] H----+-N--- arg1 in rdx \--* CNS_INT(h) long 0x7ffb7140bfe0 global ptr
***** BB01 [0000]
STMT00003 ( ??? ... ??? )
[000015] --CXG+----- * RETURN int
[000011] --CXG+----- \--* CALL ind stub int
[000010] -----+----- calli tgt \--* LCL_VAR long V06 tmp2
[000027] ---XG+----- this in rcx +--* COMMA ref
[000020] ---X-+----- | +--* BOUNDS_CHECK_Rng void
[000001] -----+----- | | +--* CNS_INT int 0
[000019] ---X-+----- | | \--* ARR_LENGTH int
[000000] -----+----- | | \--* LCL_VAR ref V02 arg1
[000028] n---G+----- | \--* IND ref
[000026] -----+----- | \--* ARR_ADDR byref ref[]
[000025] -----+----- | \--* ADD byref
[000017] -----+----- | +--* LCL_VAR ref V02 arg1
[000024] -----+----- | \--* CNS_INT long 16
[000016] -----+----- vsd cell in r11 +--* LCL_VAR long V06 tmp2
[000003] -----+----- arg2 in rdx \--* LCL_VAR ref V03 arg2
cc @AndyAyersMS
@EgorBo I think that you have the mistake in minimal repro.
It seems strange to me that after splitting the method into three, it runs twice as fast. See the method IndexOfSplitted
Simpler:
[Benchmark]
public int IndexOfNullCheck() => IndexOfNullCheck<string>(_strArr, "10000", 0, _strArr.Length);
private static int IndexOfNullCheck<T>(T[] array, T value, int startIndex, int count)
where T : IEquatable<T>
{
if (value == null)
{
return -1;
}
int endIndex = startIndex + count;
for (int i = startIndex; i < endIndex; i++)
{
if (array[i] != null && array[i]!.Equals(value)) return i;
}
return -1;
}
[Benchmark]
public int IndexOfNoNullCheck() => IndexOfNoNullCheck<string>(_strArr, "10000", 0, _strArr.Length);
private static int IndexOfNoNullCheck<T>(T[] array, T value, int startIndex, int count)
where T : IEquatable<T>
{
int endIndex = startIndex + count;
for (int i = startIndex; i < endIndex; i++)
{
if (array[i] != null && array[i]!.Equals(value)) return i;
}
return -1;
}
Method | Mean | Error | StdDev |
---|---|---|---|
IndexOfNullCheck | 22.34 us | 0.099 us | 0.092 us |
IndexOfNoNullCheck | 11.17 us | 0.108 us | 0.090 us |
Just a null check can trigger the pessimisation. I can't reproduce it under non-generic context. It actually performs better than both generic versions. @EgorBo Hopefully you can improve dasm tools under generic context so that I can investigate further.
@EgorBo Hopefully you can improve dasm tools under generic context so that I can investigate further
You can use [DisassemblyDiagnoser]
with the desired call depth to see what's going on.
It may be an artifact of measurement then. The null check affects inline decision. Both benchmarks provides very similar result with same AggressiveInlining
or NoInlining
config.
Egor is probably correct that it's about inlining and devirtualization. Manually extract the called methods and apply AggressiveInlining
:
public class IndexOfBenchmark
{
private readonly string[] _strArr = Enumerable.Range(1, 10_000).Select(n => n.ToString()).ToArray();
[Params("10000")]
public string Value { get; set; }
[Benchmark]
public int IndexOfStr() => Array.IndexOf(_strArr, Value);
[Benchmark]
public int IndexOfStrNaive()
{
for (int i = 0; i < _strArr.Length; i++)
if (_strArr[i] == Value)
return i;
return -1;
}
[Benchmark]
public int IndexOfAggressiveInlining() => ArrayIndexOf(_strArr, Value, 0, _strArr.Length);
[MethodImpl(MethodImplOptions.AggressiveInlining)]
private static unsafe int ArrayIndexOf<T>(T[] array, T value, int startIndex, int count)
where T : IEquatable<T>
{
if (array == null)
{
throw new Exception();
}
if ((uint)startIndex > (uint)array.Length)
{
throw new Exception();
}
if ((uint)count > (uint)(array.Length - startIndex))
{
throw new Exception();
}
return GenericEqualityComparer<T>(array, value, startIndex, count);
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
internal static int GenericEqualityComparer<T>(T[] array, T value, int startIndex, int count)
where T : IEquatable<T>
{
int endIndex = startIndex + count;
if (value == null)
{
for (int i = startIndex; i < endIndex; i++)
{
if (array[i] == null) return i;
}
}
else
{
for (int i = startIndex; i < endIndex; i++)
{
if (array[i] != null && array[i]!.Equals(value)) return i;
}
}
return -1;
}
}
It just performs much better:
Method | Value | Mean | Error | StdDev |
---|---|---|---|---|
IndexOfStr | 10000 | 20.592 us | 0.3979 us | 0.4737 us |
IndexOfStrNaive | 10000 | 11.106 us | 0.0736 us | 0.0689 us |
IndexOfAggressiveInlining | 10000 | 7.085 us | 0.0434 us | 0.0385 us |
Description
Array.IndexOf on string array slower in 2.5 times that naive iterraction and compare.
Configuration
Data
Benchmark result
Benchmark source
Analysis
I think, that problem in JIT optimization of method
System.Collections.Generic.GenericEqualityComparer<T>.IndexOf
In case of split this method to two methods for finding null values and not null values it runs faster.Result: