dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.97k stars 4.66k forks source link

Span<ReferenceType>.CopyTo is 10-20% slower than Array<ReferenceType>.CopyTo #29093

Open adamsitnik opened 5 years ago

adamsitnik commented 5 years ago

How to run the benchmarks:

git clone https://github.com/dotnet/performance.git
# if you have .NET Core 3.0 installed
dotnet run -c Release -f netcoreapp3.0 -p .\performance\src\benchmarks\micro\MicroBenchmarks.csproj --filter *CopyTo<String>*.Array *CopyTo<String>.Span --join --affinity 2
# if you don't have .NET Core 3.0 installed
py .\performance\scripts\benchmarks_ci.py -f netcoreapp3.0 --filter *CopyTo<String>*.Array *CopyTo<String>.Span --bdn-arguments="--join true --affinity 2"
Method Size Mean
Array 2048 511.1 ns
Span 2048 609.9 ns

Full docs for the new benchmarking workflow: https://github.com/dotnet/performance/blob/master/docs/benchmarking-workflow-corefx.md

category:cq theme:optimization skill-level:expert cost:medium

karelz commented 5 years ago

@adamsitnik only 1 area label please :)

adamsitnik commented 5 years ago

only 1 area label please

sure! I did not know about this convention

ahsonkhan commented 5 years ago

Array CopyTo: https://github.com/dotnet/coreclr/blob/2f4a7564beb6f8c1f4929504ee2221953d25fc39/src/classlibnative/bcltype/arraynative.cpp#L932

Span CopyTo (MemmoveNativeThreshold == 2048): https://github.com/dotnet/corefx/blob/d9dcfc5ab3c5bda9fe211699d5bde7df8c5fe6d5/src/Common/src/CoreLib/System/Buffer.cs#L451-L513

karelz commented 5 years ago

sure! I did not know about this convention

Thanks and no worries, here are our triage rules: https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/issue-guide.md#triage-rules---simplified

GrabYourPitchforks commented 4 years ago

Is there still a perf difference in these scenarios today? For reference types, both Array.CopyTo and Span<T>.CopyTo should be going through the same code path now:

https://github.com/dotnet/runtime/blob/70c1e9dbebe2d737e93b7900c34c6f1dff6a2d77/src/coreclr/src/System.Private.CoreLib/src/System/Buffer.CoreCLR.cs#L44-L50

If I recall correctly the perf speedup from this code path (as opposed to copying elements one-by-one) is that we don't need to perform a volatile write + card table update on every element copy. We can instead perform the updates in chunks to take advantage of more efficient memory access patterns.

@jkotas, I was looking through the code again and noticed that the implementation of BulkMoveWithWriteBarrier ends up casting the incoming void* to an Object**, as below.

https://github.com/dotnet/runtime/blob/fcd862e06413a000f9cafa9d2f359226c60b9b42/src/coreclr/src/classlibnative/bcltype/arraynative.inl#L314-L323

Is this allowable? If the array elements aren't actually reference types - but instead they're value types that contain references - wouldn't the cast to Object** be incorrect?

jkotas commented 4 years ago

There is no functional problem with it. If you look at what InlinedSetCardsAfterBulkCopyHelper does, it casts the Object** to BYTE* or size_t.

Maybe the argument of InlinedSetCardsAfterBulkCopyHelper should be typed void* to make this less confusing.

adamsitnik commented 4 years ago

Is there still a perf difference in these scenarios today?

Yes ;(

Btw the commands from the issue description are still working:

git clone https://github.com/dotnet/performance.git
# if you have .NET Core 5.0 installed
dotnet run -c Release -f netcoreapp5.0 -p .\performance\src\benchmarks\micro\MicroBenchmarks.csproj --filter *CopyTo<String>*.Array *CopyTo<String>.Span --join --affinity 2
# if you don't have .NET Core 5.0 installed
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter *CopyTo<String>*.Array *CopyTo<String>.Span --bdn-arguments="--join true --affinity 2"
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.2.20119.13
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.11807, CoreFX 5.0.20.11807), X64 RyuJIT
  Job-EFYAKM : .NET Core 5.0.0 (CoreCLR 5.0.20.11807, CoreFX 5.0.20.11807), X64 RyuJIT

Affinity=000000000010  PowerPlanMode=00000000-0000-0000-0000-000000000000  IterationTime=250.0000 ms  
MaxIterationCount=20  MinIterationCount=15  WarmupCount=1  
Method Size Mean Error StdDev Median Min Max Gen 0 Gen 1 Gen 2 Allocated
Array 2048 509.2 ns 1.12 ns 0.87 ns 509.2 ns 507.9 ns 510.7 ns - - - -
Span 2048 622.3 ns 1.77 ns 1.38 ns 622.1 ns 620.0 ns 624.4 ns - - - -
GrabYourPitchforks commented 4 years ago

Ran some preliminary tests. I can repro this only when the test class is defined as CopyTo<T>. The issue does not repro if the test class is defined in non-generic terms; e.g. the class is defined as CopyTo and the fields are string[] instead of T[].

Possible something related to how generic instantiation is performed by the runtime. Still looking into it.

GrabYourPitchforks commented 4 years ago
; CopyTo.Array (non-generic)

00007ff9`b171b290 48894c2408      mov     qword ptr [rsp+8],rcx
00007ff9`b171b295 488b4908        mov     rcx,qword ptr [rcx+8]
00007ff9`b171b299 4c8b442408      mov     r8,qword ptr [rsp+8]
00007ff9`b171b29e 498b5018        mov     rdx,qword ptr [r8+18h]
00007ff9`b171b2a2 458b4020        mov     r8d,dword ptr [r8+20h]
00007ff9`b171b2a6 e95d6ad8ff      jmp     <Array.CopyTo(Array, Array, int)>

; CopyTo.Span (non-generic)

00007ff9`b171b400 4883ec28        sub     rsp,28h
00007ff9`b171b404 488b5108        mov     rdx,qword ptr [rcx+8]
00007ff9`b171b408 4885d2          test    rdx,rdx
00007ff9`b171b40b 7507            jne     ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x1704 (00007ff9`b171b414)
00007ff9`b171b40d 33d2            xor     edx,edx
00007ff9`b171b40f 4533c0          xor     r8d,r8d
00007ff9`b171b412 eb19            jmp     ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x171d (00007ff9`b171b42d)
00007ff9`b171b414 49b868c554b1f97f0000 mov r8,7FF9B154C568h
00007ff9`b171b41e 4c3902          cmp     qword ptr [rdx],r8
00007ff9`b171b421 7559            jne     ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x176c (00007ff9`b171b47c)
00007ff9`b171b423 4c8d4210        lea     r8,[rdx+10h]
00007ff9`b171b427 8b5208          mov     edx,dword ptr [rdx+8]
00007ff9`b171b42a 4987d0          xchg    rdx,r8
00007ff9`b171b42d 488b4918        mov     rcx,qword ptr [rcx+18h]
00007ff9`b171b431 4885c9          test    rcx,rcx
00007ff9`b171b434 7506            jne     ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x172c (00007ff9`b171b43c)
00007ff9`b171b436 33c9            xor     ecx,ecx
00007ff9`b171b438 33c0            xor     eax,eax
00007ff9`b171b43a eb19            jmp     ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x1745 (00007ff9`b171b455)
00007ff9`b171b43c 48b868c554b1f97f0000 mov rax,7FF9B154C568h
00007ff9`b171b446 483901          cmp     qword ptr [rcx],rax
00007ff9`b171b449 7531            jne     ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x176c (00007ff9`b171b47c)
00007ff9`b171b44b 488d4110        lea     rax,[rcx+10h]
00007ff9`b171b44f 8b4908          mov     ecx,dword ptr [rcx+8]
00007ff9`b171b452 4887c1          xchg    rax,rcx
00007ff9`b171b455 443bc0          cmp     r8d,eax
00007ff9`b171b458 7728            ja      ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x1772 (00007ff9`b171b482)
00007ff9`b171b45a 4d63c0          movsxd  r8,r8d
00007ff9`b171b45d 49c1e003        shl     r8,3
00007ff9`b171b461 4981f800400000  cmp     r8,4000h
00007ff9`b171b468 7707            ja      ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x1761 (00007ff9`b171b471)
00007ff9`b171b46a e80159925f      call    CoreCLR!Buffer::BulkMoveWithWriteBarrier (00007ffa`11040d70)
00007ff9`b171b46f eb05            jmp     ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x1766 (00007ff9`b171b476)
00007ff9`b171b471 e8f2f1d8ff      call    CLRStub[MethodDescPrestub]@7ff9b14aa668 (00007ff9`b14aa668)
00007ff9`b171b476 90              nop
00007ff9`b171b477 4883c428        add     rsp,28h
00007ff9`b171b47b c3              ret

(Even though the Span code is larger than the Array code, it's basically the same as what Array.CopyTo would've done, but inlined into the caller. So the benchmark results are nearly identical.)

GrabYourPitchforks commented 4 years ago

Contrast this against the codegen for the generic CopyTo test:

; CopyTo<string>.Array

00007ff9`b6fab690 48894c2408      mov     qword ptr [rsp+8],rcx
00007ff9`b6fab695 488b4908        mov     rcx,qword ptr [rcx+8]
00007ff9`b6fab699 4c8b442408      mov     r8,qword ptr [rsp+8]
00007ff9`b6fab69e 498b5018        mov     rdx,qword ptr [r8+18h]
00007ff9`b6fab6a2 458b4020        mov     r8d,dword ptr [r8+20h]
00007ff9`b6fab6a6 e95d66d8ff      jmp     <Array.CopyTo(Array, Array, int)>

; CopyTo<string>.Span

00007ff9`b6fdbb40 57              push    rdi
00007ff9`b6fdbb41 56              push    rsi
00007ff9`b6fdbb42 4883ec68        sub     rsp,68h
00007ff9`b6fdbb46 488bf1          mov     rsi,rcx
00007ff9`b6fdbb49 488d7c2420      lea     rdi,[rsp+20h]
00007ff9`b6fdbb4e b910000000      mov     ecx,10h
00007ff9`b6fdbb53 33c0            xor     eax,eax
00007ff9`b6fdbb55 f3ab            rep stos dword ptr [rdi]
00007ff9`b6fdbb57 488bce          mov     rcx,rsi
00007ff9`b6fdbb5a 48894c2460      mov     qword ptr [rsp+60h],rcx
00007ff9`b6fdbb5f 488bf1          mov     rsi,rcx
00007ff9`b6fdbb62 488b0e          mov     rcx,qword ptr [rsi] ; classHnd
00007ff9`b6fdbb65 48ba80f407b7f97f0000 mov rdx,7FF9B707F480h  ; signature
00007ff9`b6fdbb6f e88c88825f      call    CoreCLR!JIT_GenericHandleClass (00007ffa`16804400)
00007ff9`b6fdbb74 488bf8          mov     rdi,rax
00007ff9`b6fdbb77 488bd7          mov     rdx,rdi
00007ff9`b6fdbb7a 4c8b4608        mov     r8,qword ptr [rsi+8]
00007ff9`b6fdbb7e 488d4c2440      lea     rcx,[rsp+40h]
00007ff9`b6fdbb83 e810daffff      call    <Span<__Canon>.ctor(__Canon[])>
00007ff9`b6fdbb88 488b542440      mov     rdx,qword ptr [rsp+40h]
00007ff9`b6fdbb8d 4889542450      mov     qword ptr [rsp+50h],rdx
00007ff9`b6fdbb92 8b542448        mov     edx,dword ptr [rsp+48h]
00007ff9`b6fdbb96 89542458        mov     dword ptr [rsp+58h],edx
00007ff9`b6fdbb9a 488bd7          mov     rdx,rdi
00007ff9`b6fdbb9d 4c8b4618        mov     r8,qword ptr [rsi+18h]
00007ff9`b6fdbba1 488d4c2430      lea     rcx,[rsp+30h]
00007ff9`b6fdbba6 e8edd9ffff      call    CLRStub[MethodDescPrestub]@7ff9b6fd9598 (00007ff9`b6fd9598)
00007ff9`b6fdbbab 488d4c2450      lea     rcx,[rsp+50h]
00007ff9`b6fdbbb0 488bd7          mov     rdx,rdi
00007ff9`b6fdbbb3 4c8d442420      lea     r8,[rsp+20h]
00007ff9`b6fdbbb8 488b442430      mov     rax,qword ptr [rsp+30h]
00007ff9`b6fdbbbd 498900          mov     qword ptr [r8],rax
00007ff9`b6fdbbc0 8b442438        mov     eax,dword ptr [rsp+38h]
00007ff9`b6fdbbc4 41894008        mov     dword ptr [r8+8],eax
00007ff9`b6fdbbc8 4c8d442420      lea     r8,[rsp+20h]
00007ff9`b6fdbbcd e85edaffff      call    CLRStub[MethodDescPrestub]@7ff9b6fd9630 (00007ff9`b6fd9630)
00007ff9`b6fdbbd2 90              nop
00007ff9`b6fdbbd3 4883c468        add     rsp,68h
00007ff9`b6fdbbd7 5e              pop     rsi
00007ff9`b6fdbbd8 5f              pop     rdi
00007ff9`b6fdbbd9 c3              ret

; Span<__Canon>.ctor(__Canon[])

00007ff9`b6fdba50 57              push    rdi
00007ff9`b6fdba51 56              push    rsi
00007ff9`b6fdba52 55              push    rbp
00007ff9`b6fdba53 53              push    rbx
00007ff9`b6fdba54 4883ec28        sub     rsp,28h
00007ff9`b6fdba58 c5f877          vzeroupper
00007ff9`b6fdba5b 4889542420      mov     qword ptr [rsp+20h],rdx
00007ff9`b6fdba60 488bd9          mov     rbx,rcx
00007ff9`b6fdba63 488bfa          mov     rdi,rdx
00007ff9`b6fdba66 498bf0          mov     rsi,r8
00007ff9`b6fdba69 4885f6          test    rsi,rsi
00007ff9`b6fdba6c 7511            jne     System_Private_CoreLib!System.Span`1[[System.__Canon, System.Private.CoreLib]]..ctor(System.__Canon[])+0xffffffff`a3af5ddf (00007ff9`b6fdba7f)
00007ff9`b6fdba6e c5f857c0        vxorps  xmm0,xmm0,xmm0
00007ff9`b6fdba72 c5fa7f03        vmovdqu xmmword ptr [rbx],xmm0
00007ff9`b6fdba76 4883c428        add     rsp,28h
00007ff9`b6fdba7a 5b              pop     rbx
00007ff9`b6fdba7b 5d              pop     rbp
00007ff9`b6fdba7c 5e              pop     rsi
00007ff9`b6fdba7d 5f              pop     rdi
00007ff9`b6fdba7e c3              ret
00007ff9`b6fdba7f 488bce          mov     rcx,rsi
00007ff9`b6fdba82 e889988d5f      call    CoreCLR!ObjectNative::GetClass (00007ffa`168b5310)
00007ff9`b6fdba87 488be8          mov     rbp,rax
00007ff9`b6fdba8a 488b4f30        mov     rcx,qword ptr [rdi+30h]
00007ff9`b6fdba8e 488b09          mov     rcx,qword ptr [rcx]
00007ff9`b6fdba91 488b4908        mov     rcx,qword ptr [rcx+8]
00007ff9`b6fdba95 4885c9          test    rcx,rcx
00007ff9`b6fdba98 7402            je      System_Private_CoreLib!System.Span`1[[System.__Canon, System.Private.CoreLib]]..ctor(System.__Canon[])+0xffffffff`a3af5dfc (00007ff9`b6fdba9c)
00007ff9`b6fdba9a eb15            jmp     System_Private_CoreLib!System.Span`1[[System.__Canon, System.Private.CoreLib]]..ctor(System.__Canon[])+0xffffffff`a3af5e11 (00007ff9`b6fdbab1)
00007ff9`b6fdba9c 488bcf          mov     rcx,rdi
00007ff9`b6fdba9f 48ba40060ab7f97f0000 mov rdx,7FF9B70A0640h
00007ff9`b6fdbaa9 e85289825f      call    CoreCLR!JIT_GenericHandleClass (00007ffa`16804400)
00007ff9`b6fdbaae 488bc8          mov     rcx,rax
00007ff9`b6fdbab1 e8eaed825f      call    CoreCLR!JIT_GetRuntimeType (00007ffa`1680a8a0)
00007ff9`b6fdbab6 483bc5          cmp     rax,rbp
00007ff9`b6fdbab9 7516            jne     System_Private_CoreLib!System.Span`1[[System.__Canon, System.Private.CoreLib]]..ctor(System.__Canon[])+0xffffffff`a3af5e31 (00007ff9`b6fdbad1)
00007ff9`b6fdbabb 488d4610        lea     rax,[rsi+10h]
00007ff9`b6fdbabf 488903          mov     qword ptr [rbx],rax
00007ff9`b6fdbac2 8b4608          mov     eax,dword ptr [rsi+8]
00007ff9`b6fdbac5 894308          mov     dword ptr [rbx+8],eax
00007ff9`b6fdbac8 4883c428        add     rsp,28h
00007ff9`b6fdbacc 5b              pop     rbx
00007ff9`b6fdbacd 5d              pop     rbp
00007ff9`b6fdbace 5e              pop     rsi
00007ff9`b6fdbacf 5f              pop     rdi
00007ff9`b6fdbad0 c3              ret
00007ff9`b6fdbad1 e8a2e2d9ff      call    CLRStub[MethodDescPrestub]@7ff9b6d79d78 (00007ff9`b6d79d78)
00007ff9`b6fdbad6 cc              int     3

I didn't even dump all the relevant codegen since it's so big, but a few things immediately stand out:

<< Partial stack trace showing memmove not inlined >>
00 0000009a`7337ce08 00007ff9`b6fdbc56 System_Private_CoreLib!System.Buffer.Memmove[[System.__Canon, System.Private.CoreLib]](System.__Canon ByRef, System.__Canon ByRef, UInt64)+0xffffffff`a3b81660 [C:\runtime\src\libraries\System.Private.CoreLib\src\System\Buffer.cs @ 338] 
01 0000009a`7337ce10 00007ff9`b6fdbbd2 System_Private_CoreLib!System.Span`1[[System.__Canon, System.Private.CoreLib]].CopyTo(System.Span`1<System.__Canon>)+0xffffffff`a3af5976 [C:\runtime\src\libraries\System.Private.CoreLib\src\System\Span.cs @ 363] 

/cc @EgorBo In case he has any suggestions.

jkotas commented 4 years ago

This is the well-known limitation of inlining generics into generics. @davidwrighton started working on fixing it in https://github.com/dotnet/runtime/pull/31833

AndyAyersMS commented 4 years ago

Cross-class shared methods don't compose well, and in particular in CoreCLR there are inlining restrictions. See eg #7580. For Span we worked around this for the indexer but not for the constructor.

@davidwrighton is looking to relax some of these restrictions in #31833.

EgorBo commented 4 years ago
array.GetType() != typeof(T[])

so if I understand it correctly, it's a Co(ntra)variance check. For String as T in Span<T>(T[] array) there is no way this check will be true, right? So the jit actually should be able to eliminate it, shouldn't it? (it currently doesn't)

AndyAyersMS commented 4 years ago

The jit doesn't see string here, it sees __Canon. If we inline enough we can sometimes get all this into unshared code and figure out the actual types, but not always...

AndyAyersMS commented 4 years ago

That part is potentially easier to fix.

Over in gtFoldTypeCompare for comparison of a handle to an unknown, we can see what we know about the type of the unknown, likely running it through impIsClassExact after updating that to know about some effectively sealed array types.

I think I have parts of it coded up somewhere...

EgorBo commented 4 years ago

That part is potentially easier to fix.

Over in gtFoldTypeCompare for comparison of a handle to an unknown, we can see what we know about the type of the unknown, likely running it through impIsClassExact after updating that to know about some effectively sealed array types.

I think I have parts of it coded up somewhere...

yeah I was just playing with gtFoldTypeCompare like this: image

so I guess if we can obtain the array element type of that local (we call .GetType from) and if it's sealed -- optimize? (to GT_NULLCHECK?)

EgorBo commented 4 years ago

@AndyAyersMS it seems impIsClassExact returns false for System.String[] cls (it even has a comment We are conservative on arrays here. It might be worth checking for types like int[].

UPD: now it supports arrays:


bool Compiler::impIsClassExact(CORINFO_CLASS_HANDLE classHnd)
{
    DWORD flags     = info.compCompHnd->getClassAttribs(classHnd);
    DWORD flagsMask = CORINFO_FLG_FINAL | CORINFO_FLG_MARSHAL_BYREF | CORINFO_FLG_CONTEXTFUL | CORINFO_FLG_VARIANCE;

    if ((flags & flagsMask) != CORINFO_FLG_FINAL)
    {
        return false;
    }
    if (flags & CORINFO_FLG_ARRAY)
    {
        CORINFO_CLASS_HANDLE arrayElementHandle = nullptr;
        info.compCompHnd->getChildType(classHnd, &arrayElementHandle);
        return impIsClassExact(arrayElementHandle);
    }
    return true;
}

UPD2 hm.. it seems this change alone produces jit-diffs (improvements)

AndyAyersMS commented 4 years ago

The change to impClassIsExact is not safe for int-typed arrays (despite what the comment says) since we allow equivalence between int[] and uint[] and likewise for integral types.

So we'd need to restrict the array cases somewhat, or pose the question differently.

For gtFoldTypeCompare the best way to fetch what we know about a type is via gtGetClassHandle, it can deal with a wider variety of tree types and knows how some variables are defined. Also yes we need to watch for potential exceptions....

EgorBo commented 4 years ago

@AndyAyersMS I have a draft: https://github.com/EgorBo/runtime-1/commit/0141547f2dbc0705941b9854e488b100781fc4ce

Test:

static bool Test(string[] array)
{
    return array.GetType() != typeof(string[]);
}

current codegen:

       mov      rax, 0xD1FFAB1E
       cmp      qword ptr [rcx], rax
       setne    al
       movzx    rax, al
       ret

new codegen:

       cmp      dword ptr [rcx], ecx  ;; null-check, can be eliminated in real code
       xor      eax, eax
       ret
GrabYourPitchforks commented 4 years ago

I reran the tests again today with the latest master, and I'm seeing that the Array and Span scenarios are nearly aligned in terms of performance.

Method Size Mean Error StdDev Ratio RatioSD
Array 2048 434.1 ns 6.26 ns 5.55 ns 1.00 0.00
Span 2048 453.1 ns 8.73 ns 9.34 ns 1.04 0.03

But when I dumped the codegen, it was identical to what I posted earlier at https://github.com/dotnet/runtime/issues/29093#issuecomment-590572852. So I'm not quite sure what happened here and I'm suspicious of my numbers.

EgorBo commented 4 years ago

Just a note: https://github.com/dotnet/runtime/pull/32790 should not affect CopyTo codegen yet as far as I understand - maybe with https://github.com/dotnet/runtime/pull/31833 merged.