dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.26k stars 4.73k forks source link

Array.IndexOf at string[] slower that naive implementation #101595

Open kuda1978 opened 6 months ago

kuda1978 commented 6 months ago

Description

Array.IndexOf on string array slower in 2.5 times that naive iterraction and compare.

Configuration

BenchmarkDotNet v0.13.12, Windows 10 (10.0.19045.4291/22H2/2022Update)
Intel Core i5-9600K CPU 3.70GHz (Coffee Lake), 1 CPU, 6 logical and 6 physical cores
.NET SDK 8.0.300-preview.24203.14
  [Host]   : .NET 6.0.29 (6.0.2924.17105), X64 RyuJIT AVX2
  .NET 6.0 : .NET 6.0.29 (6.0.2924.17105), X64 RyuJIT AVX2
  .NET 7.0 : .NET 7.0.18 (7.0.1824.16914), X64 RyuJIT AVX2
  .NET 8.0 : .NET 8.0.2 (8.0.224.6711), X64 RyuJIT AVX2

Data

Benchmark result

| Method          | Job      | Runtime  | Mean      | Error     | StdDev    |
|---------------- |--------- |--------- |----------:|----------:|----------:|
| IndexOfStr      | .NET 6.0 | .NET 6.0 | 31.391 us | 0.1531 us | 0.1357 us |
| IndexOfStrNaive | .NET 6.0 | .NET 6.0 | 10.701 us | 0.0878 us | 0.0778 us |
| IndexOfStr      | .NET 7.0 | .NET 7.0 | 26.413 us | 0.2391 us | 0.2236 us |
| IndexOfStrNaive | .NET 7.0 | .NET 7.0 |  8.789 us | 0.1657 us | 0.1842 us |
| IndexOfStr      | .NET 8.0 | .NET 8.0 | 21.671 us | 0.1674 us | 0.1566 us |
| IndexOfStrNaive | .NET 8.0 | .NET 8.0 |  8.931 us | 0.1344 us | 0.1192 us |

Benchmark source

[SimpleJob(RuntimeMoniker.Net60)]
[SimpleJob(RuntimeMoniker.Net70)]
[SimpleJob(RuntimeMoniker.Net80)]
public class IndexOfBenchmark
{
    private readonly string[] _strArr = Enumerable.Range(1, 10_000).Select(n => n.ToString()).ToArray();

    [Benchmark]
    public int IndexOfStr() => Array.IndexOf(_strArr, "10000");

    [Benchmark]
    public int IndexOfStrNaive()
    {
        for (int i = 0; i < _strArr.Length; i++)
            if (_strArr[i] == "10000")
                return i;

        return -1;
    }
}

Analysis

I think, that problem in JIT optimization of method System.Collections.Generic.GenericEqualityComparer<T>.IndexOf In case of split this method to two methods for finding null values and not null values it runs faster.

[SimpleJob(RuntimeMoniker.Net60)]
[SimpleJob(RuntimeMoniker.Net70)]
[SimpleJob(RuntimeMoniker.Net80)]
public class IndexOfInternalBenchmark
{
    private readonly string[] _strArr = Enumerable.Range(1, 10_000).Select(n => n.ToString()).ToArray();

    [Benchmark]
    public int IndexOfOriginal() => IndexOfOriginal<string>(_strArr, "10000", 0, _strArr.Length);

    // Copy of: System.Collections.Generic.GenericEqualityComparer<T>.IndexOf
    private static int IndexOfOriginal<T>(T[] array, T value, int startIndex, int count)
        where T : IEquatable<T>
    {
        int endIndex = startIndex + count;
        if (value == null)
        {
            for (int i = startIndex; i < endIndex; i++)
            {
                if (array[i] == null) return i;
            }
        }
        else
        {
            for (int i = startIndex; i < endIndex; i++)
            {
                if (array[i] != null && array[i]!.Equals(value)) return i;
            }
        }
        return -1;
    }

    [Benchmark]
    public int IndexOfSplitted() => IndexOfSplitted<string>(_strArr, "10000", 0, _strArr.Length);

    private static int IndexOfSplitted<T>(T[] array, T value, int startIndex, int count)
        where T : IEquatable<T>
    {
        if (value != null)
        {
            return IndexOfSplittedNotNull(array, value, startIndex, count);
        }
        else
        {
            return IndexOfSplittedNull(array, value, startIndex, count);
        }
    }

    internal static int IndexOfSplittedNull<T>(T[] array, T value, int startIndex, int count)
        where T : IEquatable<T>
    {
        int endIndex = startIndex + count;
        for (int i = startIndex; i < endIndex; i++)
        {
            if (array[i] == null) return i;
        }
        return -1;
    }

    internal static int IndexOfSplittedNotNull<T>(T[] array, T value, int startIndex, int count)
        where T : IEquatable<T>
    {
        int endIndex = startIndex + count;
        for (int i = 0; i < endIndex; i++)
        {
            if (array[i] != null && array[i]!.Equals(value)) return i;
        }
        return -1;
    }
}

Result:

| Method          | Job      | Runtime  | Mean     | Error    | StdDev   |
|---------------- |--------- |--------- |---------:|---------:|---------:|
| IndexOfOriginal | .NET 6.0 | .NET 6.0 | 26.40 us | 0.096 us | 0.080 us |
| IndexOfSplitted | .NET 6.0 | .NET 6.0 | 26.41 us | 0.123 us | 0.115 us |
| IndexOfOriginal | .NET 7.0 | .NET 7.0 | 31.21 us | 0.187 us | 0.166 us |
| IndexOfSplitted | .NET 7.0 | .NET 7.0 | 10.44 us | 0.157 us | 0.147 us |
| IndexOfOriginal | .NET 8.0 | .NET 8.0 | 24.16 us | 0.130 us | 0.122 us |
| IndexOfSplitted | .NET 8.0 | .NET 8.0 | 12.28 us | 0.165 us | 0.154 us |
dotnet-policy-service[bot] commented 6 months ago

Tagging subscribers to this area: @dotnet/area-system-runtime See info in area-owners.md if you want to be subscribed.

huoyaoyuan commented 6 months ago

I wonder if inverting the if (value == null) condition would change the result. Not null value should be treated as more common case. Finding null value can be delegated to SIMD path.

dotnet-policy-service[bot] commented 6 months ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

EgorBo commented 6 months ago

A minimal repro for the underlying issue:

[MethodImpl(MethodImplOptions.NoInlining)]
public bool Test<T>(T[] array, T value) where T : IEquatable<T>
{
    return array[0].Equals(value);
}

We currently are never able to optimize this Equals if the method is not inlined into non-shared generic context. Dynamic PGO also gives up here because VM tells JIT the method should be invoked as indirect and we never optimize such calls with PGO currently. Codegen:

; Assembly listing for method Test[System.__Canon](System.__Canon[],System.__Canon):ubyte:this (Tier1)
; Emitting BLENDED_CODE for X64 with AVX512 - Windows
; Tier1 code
; optimized code
G_M40771_IG01:  ;; offset=0x0000
       push     rsi
       push     rbx
       sub      rsp, 40
       mov      qword ptr [rsp+0x20], rdx
       mov      rbx, r8
       mov      rsi, r9
                        ;; size=17 bbWeight=1 PerfScore 3.75
G_M40771_IG02:  ;; offset=0x0011
       mov      rcx, qword ptr [rdx+0x38]
       mov      r11, qword ptr [rcx+0x10]
       test     r11, r11
       je       SHORT G_M40771_IG04
                        ;; size=13 bbWeight=1 PerfScore 5.25
G_M40771_IG03:  ;; offset=0x001E
       jmp      SHORT G_M40771_IG05
                        ;; size=2 bbWeight=0.80 PerfScore 1.60
G_M40771_IG04:  ;; offset=0x0020
       mov      rcx, rdx
       mov      rdx, 0x7FFB713E76B0      ; global ptr
       call     CORINFO_HELP_RUNTIMEHANDLE_METHOD
       mov      r11, rax
                        ;; size=21 bbWeight=0.20 PerfScore 0.35
G_M40771_IG05:  ;; offset=0x0035
       cmp      dword ptr [rbx+0x08], 0
       jbe      SHORT G_M40771_IG07
       mov      rcx, gword ptr [rbx+0x10]
       mov      rdx, rsi
       call     [r11]
       nop      
                        ;; size=17 bbWeight=1 PerfScore 9.50
G_M40771_IG06:  ;; offset=0x0046
       add      rsp, 40
       pop      rbx
       pop      rsi
       ret      
                        ;; size=7 bbWeight=1 PerfScore 2.25
G_M40771_IG07:  ;; offset=0x004D
       call     CORINFO_HELP_RNGCHKFAIL
       int3     
; Total bytes of code 83
***** BB01 [0000]
STMT00001 ( ??? ... ??? )
               [000009] DAC-G+-----                         *  STORE_LCL_VAR long   V06 tmp2         
               [000006] --C-G+-----                         \--*  CALL help long   CORINFO_HELP_RUNTIMEHANDLE_METHOD
               [000004] !----+----- arg0 in rcx                +--*  LCL_VAR   long   V01 TypeCtx      
               [000005] H----+-N--- arg1 in rdx                \--*  CNS_INT(h) long   0x7ffb7140bfe0 global ptr

***** BB01 [0000]
STMT00003 ( ??? ... ??? )
               [000015] --CXG+-----                         *  RETURN    int   
               [000011] --CXG+-----                         \--*  CALL ind stub int   
               [000010] -----+----- calli tgt                  \--*  LCL_VAR   long   V06 tmp2         
               [000027] ---XG+----- this in rcx                +--*  COMMA     ref   
               [000020] ---X-+-----                            |  +--*  BOUNDS_CHECK_Rng void  
               [000001] -----+-----                            |  |  +--*  CNS_INT   int    0
               [000019] ---X-+-----                            |  |  \--*  ARR_LENGTH int   
               [000000] -----+-----                            |  |     \--*  LCL_VAR   ref    V02 arg1         
               [000028] n---G+-----                            |  \--*  IND       ref   
               [000026] -----+-----                            |     \--*  ARR_ADDR  byref ref[]
               [000025] -----+-----                            |        \--*  ADD       byref 
               [000017] -----+-----                            |           +--*  LCL_VAR   ref    V02 arg1         
               [000024] -----+-----                            |           \--*  CNS_INT   long   16
               [000016] -----+----- vsd cell in r11            +--*  LCL_VAR   long   V06 tmp2         
               [000003] -----+----- arg2 in rdx                \--*  LCL_VAR   ref    V03 arg2       

cc @AndyAyersMS

kuda1978 commented 6 months ago

@EgorBo I think that you have the mistake in minimal repro.

It seems strange to me that after splitting the method into three, it runs twice as fast. See the method IndexOfSplitted of second benchmark.

huoyaoyuan commented 6 months ago

Simpler:


    [Benchmark]
    public int IndexOfNullCheck() => IndexOfNullCheck<string>(_strArr, "10000", 0, _strArr.Length);

    private static int IndexOfNullCheck<T>(T[] array, T value, int startIndex, int count)
        where T : IEquatable<T>
    {
        if (value == null)
        {
            return -1;
        }

        int endIndex = startIndex + count;

        for (int i = startIndex; i < endIndex; i++)
        {
            if (array[i] != null && array[i]!.Equals(value)) return i;
        }
        return -1;
    }

    [Benchmark]
    public int IndexOfNoNullCheck() => IndexOfNoNullCheck<string>(_strArr, "10000", 0, _strArr.Length);

    private static int IndexOfNoNullCheck<T>(T[] array, T value, int startIndex, int count)
        where T : IEquatable<T>
    {
        int endIndex = startIndex + count;

        for (int i = startIndex; i < endIndex; i++)
        {
            if (array[i] != null && array[i]!.Equals(value)) return i;
        }
        return -1;
    }
Method Mean Error StdDev
IndexOfNullCheck 22.34 us 0.099 us 0.092 us
IndexOfNoNullCheck 11.17 us 0.108 us 0.090 us

Just a null check can trigger the pessimisation. I can't reproduce it under non-generic context. It actually performs better than both generic versions. @EgorBo Hopefully you can improve dasm tools under generic context so that I can investigate further.

neon-sunset commented 6 months ago

@EgorBo Hopefully you can improve dasm tools under generic context so that I can investigate further

You can use [DisassemblyDiagnoser] with the desired call depth to see what's going on.

huoyaoyuan commented 6 months ago

It may be an artifact of measurement then. The null check affects inline decision. Both benchmarks provides very similar result with same AggressiveInlining or NoInlining config.

huoyaoyuan commented 6 months ago

Egor is probably correct that it's about inlining and devirtualization. Manually extract the called methods and apply AggressiveInlining:


public class IndexOfBenchmark
{
    private readonly string[] _strArr = Enumerable.Range(1, 10_000).Select(n => n.ToString()).ToArray();

    [Params("10000")]
    public string Value { get; set; }

    [Benchmark]
    public int IndexOfStr() => Array.IndexOf(_strArr, Value);

    [Benchmark]
    public int IndexOfStrNaive()
    {
        for (int i = 0; i < _strArr.Length; i++)
            if (_strArr[i] == Value)
                return i;

        return -1;
    }

    [Benchmark]
    public int IndexOfAggressiveInlining() => ArrayIndexOf(_strArr, Value, 0, _strArr.Length);

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    private static unsafe int ArrayIndexOf<T>(T[] array, T value, int startIndex, int count)
        where T : IEquatable<T>
    {
        if (array == null)
        {
            throw new Exception();
        }

        if ((uint)startIndex > (uint)array.Length)
        {
            throw new Exception();
        }

        if ((uint)count > (uint)(array.Length - startIndex))
        {
            throw new Exception();
        }

        return GenericEqualityComparer<T>(array, value, startIndex, count);
    }

    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    internal static int GenericEqualityComparer<T>(T[] array, T value, int startIndex, int count)
        where T : IEquatable<T>
    {
        int endIndex = startIndex + count;
        if (value == null)
        {
            for (int i = startIndex; i < endIndex; i++)
            {
                if (array[i] == null) return i;
            }
        }
        else
        {
            for (int i = startIndex; i < endIndex; i++)
            {
                if (array[i] != null && array[i]!.Equals(value)) return i;
            }
        }
        return -1;
    }
}

It just performs much better:

Method Value Mean Error StdDev
IndexOfStr 10000 20.592 us 0.3979 us 0.4737 us
IndexOfStrNaive 10000 11.106 us 0.0736 us 0.0689 us
IndexOfAggressiveInlining 10000 7.085 us 0.0434 us 0.0385 us