Open adamsitnik opened 5 years ago
@adamsitnik only 1 area label please :)
only 1 area label please
sure! I did not know about this convention
Span CopyTo (MemmoveNativeThreshold
== 2048):
https://github.com/dotnet/corefx/blob/d9dcfc5ab3c5bda9fe211699d5bde7df8c5fe6d5/src/Common/src/CoreLib/System/Buffer.cs#L451-L513
sure! I did not know about this convention
Thanks and no worries, here are our triage rules: https://github.com/dotnet/corefx/blob/master/Documentation/project-docs/issue-guide.md#triage-rules---simplified
Is there still a perf difference in these scenarios today? For reference types, both Array.CopyTo
and Span<T>.CopyTo
should be going through the same code path now:
If I recall correctly the perf speedup from this code path (as opposed to copying elements one-by-one) is that we don't need to perform a volatile write + card table update on every element copy. We can instead perform the updates in chunks to take advantage of more efficient memory access patterns.
@jkotas, I was looking through the code again and noticed that the implementation of BulkMoveWithWriteBarrier
ends up casting the incoming void*
to an Object**
, as below.
Is this allowable? If the array elements aren't actually reference types - but instead they're value types that contain references - wouldn't the cast to Object**
be incorrect?
There is no functional problem with it. If you look at what InlinedSetCardsAfterBulkCopyHelper
does, it casts the Object**
to BYTE*
or size_t
.
Maybe the argument of InlinedSetCardsAfterBulkCopyHelper
should be typed void*
to make this less confusing.
Is there still a perf difference in these scenarios today?
Yes ;(
Btw the commands from the issue description are still working:
git clone https://github.com/dotnet/performance.git
# if you have .NET Core 5.0 installed
dotnet run -c Release -f netcoreapp5.0 -p .\performance\src\benchmarks\micro\MicroBenchmarks.csproj --filter *CopyTo<String>*.Array *CopyTo<String>.Span --join --affinity 2
# if you don't have .NET Core 5.0 installed
py .\performance\scripts\benchmarks_ci.py -f netcoreapp5.0 --filter *CopyTo<String>*.Array *CopyTo<String>.Span --bdn-arguments="--join true --affinity 2"
BenchmarkDotNet=v0.12.0, OS=Windows 10.0.18363
Intel Xeon CPU E5-1650 v4 3.60GHz, 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.2.20119.13
[Host] : .NET Core 5.0.0 (CoreCLR 5.0.20.11807, CoreFX 5.0.20.11807), X64 RyuJIT
Job-EFYAKM : .NET Core 5.0.0 (CoreCLR 5.0.20.11807, CoreFX 5.0.20.11807), X64 RyuJIT
Affinity=000000000010 PowerPlanMode=00000000-0000-0000-0000-000000000000 IterationTime=250.0000 ms
MaxIterationCount=20 MinIterationCount=15 WarmupCount=1
Method | Size | Mean | Error | StdDev | Median | Min | Max | Gen 0 | Gen 1 | Gen 2 | Allocated |
---|---|---|---|---|---|---|---|---|---|---|---|
Array | 2048 | 509.2 ns | 1.12 ns | 0.87 ns | 509.2 ns | 507.9 ns | 510.7 ns | - | - | - | - |
Span | 2048 | 622.3 ns | 1.77 ns | 1.38 ns | 622.1 ns | 620.0 ns | 624.4 ns | - | - | - | - |
Ran some preliminary tests. I can repro this only when the test class is defined as CopyTo<T>
. The issue does not repro if the test class is defined in non-generic terms; e.g. the class is defined as CopyTo
and the fields are string[]
instead of T[]
.
Possible something related to how generic instantiation is performed by the runtime. Still looking into it.
; CopyTo.Array (non-generic)
00007ff9`b171b290 48894c2408 mov qword ptr [rsp+8],rcx
00007ff9`b171b295 488b4908 mov rcx,qword ptr [rcx+8]
00007ff9`b171b299 4c8b442408 mov r8,qword ptr [rsp+8]
00007ff9`b171b29e 498b5018 mov rdx,qword ptr [r8+18h]
00007ff9`b171b2a2 458b4020 mov r8d,dword ptr [r8+20h]
00007ff9`b171b2a6 e95d6ad8ff jmp <Array.CopyTo(Array, Array, int)>
; CopyTo.Span (non-generic)
00007ff9`b171b400 4883ec28 sub rsp,28h
00007ff9`b171b404 488b5108 mov rdx,qword ptr [rcx+8]
00007ff9`b171b408 4885d2 test rdx,rdx
00007ff9`b171b40b 7507 jne ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x1704 (00007ff9`b171b414)
00007ff9`b171b40d 33d2 xor edx,edx
00007ff9`b171b40f 4533c0 xor r8d,r8d
00007ff9`b171b412 eb19 jmp ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x171d (00007ff9`b171b42d)
00007ff9`b171b414 49b868c554b1f97f0000 mov r8,7FF9B154C568h
00007ff9`b171b41e 4c3902 cmp qword ptr [rdx],r8
00007ff9`b171b421 7559 jne ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x176c (00007ff9`b171b47c)
00007ff9`b171b423 4c8d4210 lea r8,[rdx+10h]
00007ff9`b171b427 8b5208 mov edx,dword ptr [rdx+8]
00007ff9`b171b42a 4987d0 xchg rdx,r8
00007ff9`b171b42d 488b4918 mov rcx,qword ptr [rcx+18h]
00007ff9`b171b431 4885c9 test rcx,rcx
00007ff9`b171b434 7506 jne ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x172c (00007ff9`b171b43c)
00007ff9`b171b436 33c9 xor ecx,ecx
00007ff9`b171b438 33c0 xor eax,eax
00007ff9`b171b43a eb19 jmp ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x1745 (00007ff9`b171b455)
00007ff9`b171b43c 48b868c554b1f97f0000 mov rax,7FF9B154C568h
00007ff9`b171b446 483901 cmp qword ptr [rcx],rax
00007ff9`b171b449 7531 jne ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x176c (00007ff9`b171b47c)
00007ff9`b171b44b 488d4110 lea rax,[rcx+10h]
00007ff9`b171b44f 8b4908 mov ecx,dword ptr [rcx+8]
00007ff9`b171b452 4887c1 xchg rax,rcx
00007ff9`b171b455 443bc0 cmp r8d,eax
00007ff9`b171b458 7728 ja ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x1772 (00007ff9`b171b482)
00007ff9`b171b45a 4d63c0 movsxd r8,r8d
00007ff9`b171b45d 49c1e003 shl r8,3
00007ff9`b171b461 4981f800400000 cmp r8,4000h
00007ff9`b171b468 7707 ja ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x1761 (00007ff9`b171b471)
00007ff9`b171b46a e80159925f call CoreCLR!Buffer::BulkMoveWithWriteBarrier (00007ffa`11040d70)
00007ff9`b171b46f eb05 jmp ConsoleAppBenchmark!ConsoleAppBenchmark.CopyToRunner.Span()+0x1766 (00007ff9`b171b476)
00007ff9`b171b471 e8f2f1d8ff call CLRStub[MethodDescPrestub]@7ff9b14aa668 (00007ff9`b14aa668)
00007ff9`b171b476 90 nop
00007ff9`b171b477 4883c428 add rsp,28h
00007ff9`b171b47b c3 ret
(Even though the Span code is larger than the Array code, it's basically the same as what Array.CopyTo
would've done, but inlined into the caller. So the benchmark results are nearly identical.)
Contrast this against the codegen for the generic CopyTo
test:
; CopyTo<string>.Array
00007ff9`b6fab690 48894c2408 mov qword ptr [rsp+8],rcx
00007ff9`b6fab695 488b4908 mov rcx,qword ptr [rcx+8]
00007ff9`b6fab699 4c8b442408 mov r8,qword ptr [rsp+8]
00007ff9`b6fab69e 498b5018 mov rdx,qword ptr [r8+18h]
00007ff9`b6fab6a2 458b4020 mov r8d,dword ptr [r8+20h]
00007ff9`b6fab6a6 e95d66d8ff jmp <Array.CopyTo(Array, Array, int)>
; CopyTo<string>.Span
00007ff9`b6fdbb40 57 push rdi
00007ff9`b6fdbb41 56 push rsi
00007ff9`b6fdbb42 4883ec68 sub rsp,68h
00007ff9`b6fdbb46 488bf1 mov rsi,rcx
00007ff9`b6fdbb49 488d7c2420 lea rdi,[rsp+20h]
00007ff9`b6fdbb4e b910000000 mov ecx,10h
00007ff9`b6fdbb53 33c0 xor eax,eax
00007ff9`b6fdbb55 f3ab rep stos dword ptr [rdi]
00007ff9`b6fdbb57 488bce mov rcx,rsi
00007ff9`b6fdbb5a 48894c2460 mov qword ptr [rsp+60h],rcx
00007ff9`b6fdbb5f 488bf1 mov rsi,rcx
00007ff9`b6fdbb62 488b0e mov rcx,qword ptr [rsi] ; classHnd
00007ff9`b6fdbb65 48ba80f407b7f97f0000 mov rdx,7FF9B707F480h ; signature
00007ff9`b6fdbb6f e88c88825f call CoreCLR!JIT_GenericHandleClass (00007ffa`16804400)
00007ff9`b6fdbb74 488bf8 mov rdi,rax
00007ff9`b6fdbb77 488bd7 mov rdx,rdi
00007ff9`b6fdbb7a 4c8b4608 mov r8,qword ptr [rsi+8]
00007ff9`b6fdbb7e 488d4c2440 lea rcx,[rsp+40h]
00007ff9`b6fdbb83 e810daffff call <Span<__Canon>.ctor(__Canon[])>
00007ff9`b6fdbb88 488b542440 mov rdx,qword ptr [rsp+40h]
00007ff9`b6fdbb8d 4889542450 mov qword ptr [rsp+50h],rdx
00007ff9`b6fdbb92 8b542448 mov edx,dword ptr [rsp+48h]
00007ff9`b6fdbb96 89542458 mov dword ptr [rsp+58h],edx
00007ff9`b6fdbb9a 488bd7 mov rdx,rdi
00007ff9`b6fdbb9d 4c8b4618 mov r8,qword ptr [rsi+18h]
00007ff9`b6fdbba1 488d4c2430 lea rcx,[rsp+30h]
00007ff9`b6fdbba6 e8edd9ffff call CLRStub[MethodDescPrestub]@7ff9b6fd9598 (00007ff9`b6fd9598)
00007ff9`b6fdbbab 488d4c2450 lea rcx,[rsp+50h]
00007ff9`b6fdbbb0 488bd7 mov rdx,rdi
00007ff9`b6fdbbb3 4c8d442420 lea r8,[rsp+20h]
00007ff9`b6fdbbb8 488b442430 mov rax,qword ptr [rsp+30h]
00007ff9`b6fdbbbd 498900 mov qword ptr [r8],rax
00007ff9`b6fdbbc0 8b442438 mov eax,dword ptr [rsp+38h]
00007ff9`b6fdbbc4 41894008 mov dword ptr [r8+8],eax
00007ff9`b6fdbbc8 4c8d442420 lea r8,[rsp+20h]
00007ff9`b6fdbbcd e85edaffff call CLRStub[MethodDescPrestub]@7ff9b6fd9630 (00007ff9`b6fd9630)
00007ff9`b6fdbbd2 90 nop
00007ff9`b6fdbbd3 4883c468 add rsp,68h
00007ff9`b6fdbbd7 5e pop rsi
00007ff9`b6fdbbd8 5f pop rdi
00007ff9`b6fdbbd9 c3 ret
; Span<__Canon>.ctor(__Canon[])
00007ff9`b6fdba50 57 push rdi
00007ff9`b6fdba51 56 push rsi
00007ff9`b6fdba52 55 push rbp
00007ff9`b6fdba53 53 push rbx
00007ff9`b6fdba54 4883ec28 sub rsp,28h
00007ff9`b6fdba58 c5f877 vzeroupper
00007ff9`b6fdba5b 4889542420 mov qword ptr [rsp+20h],rdx
00007ff9`b6fdba60 488bd9 mov rbx,rcx
00007ff9`b6fdba63 488bfa mov rdi,rdx
00007ff9`b6fdba66 498bf0 mov rsi,r8
00007ff9`b6fdba69 4885f6 test rsi,rsi
00007ff9`b6fdba6c 7511 jne System_Private_CoreLib!System.Span`1[[System.__Canon, System.Private.CoreLib]]..ctor(System.__Canon[])+0xffffffff`a3af5ddf (00007ff9`b6fdba7f)
00007ff9`b6fdba6e c5f857c0 vxorps xmm0,xmm0,xmm0
00007ff9`b6fdba72 c5fa7f03 vmovdqu xmmword ptr [rbx],xmm0
00007ff9`b6fdba76 4883c428 add rsp,28h
00007ff9`b6fdba7a 5b pop rbx
00007ff9`b6fdba7b 5d pop rbp
00007ff9`b6fdba7c 5e pop rsi
00007ff9`b6fdba7d 5f pop rdi
00007ff9`b6fdba7e c3 ret
00007ff9`b6fdba7f 488bce mov rcx,rsi
00007ff9`b6fdba82 e889988d5f call CoreCLR!ObjectNative::GetClass (00007ffa`168b5310)
00007ff9`b6fdba87 488be8 mov rbp,rax
00007ff9`b6fdba8a 488b4f30 mov rcx,qword ptr [rdi+30h]
00007ff9`b6fdba8e 488b09 mov rcx,qword ptr [rcx]
00007ff9`b6fdba91 488b4908 mov rcx,qword ptr [rcx+8]
00007ff9`b6fdba95 4885c9 test rcx,rcx
00007ff9`b6fdba98 7402 je System_Private_CoreLib!System.Span`1[[System.__Canon, System.Private.CoreLib]]..ctor(System.__Canon[])+0xffffffff`a3af5dfc (00007ff9`b6fdba9c)
00007ff9`b6fdba9a eb15 jmp System_Private_CoreLib!System.Span`1[[System.__Canon, System.Private.CoreLib]]..ctor(System.__Canon[])+0xffffffff`a3af5e11 (00007ff9`b6fdbab1)
00007ff9`b6fdba9c 488bcf mov rcx,rdi
00007ff9`b6fdba9f 48ba40060ab7f97f0000 mov rdx,7FF9B70A0640h
00007ff9`b6fdbaa9 e85289825f call CoreCLR!JIT_GenericHandleClass (00007ffa`16804400)
00007ff9`b6fdbaae 488bc8 mov rcx,rax
00007ff9`b6fdbab1 e8eaed825f call CoreCLR!JIT_GetRuntimeType (00007ffa`1680a8a0)
00007ff9`b6fdbab6 483bc5 cmp rax,rbp
00007ff9`b6fdbab9 7516 jne System_Private_CoreLib!System.Span`1[[System.__Canon, System.Private.CoreLib]]..ctor(System.__Canon[])+0xffffffff`a3af5e31 (00007ff9`b6fdbad1)
00007ff9`b6fdbabb 488d4610 lea rax,[rsi+10h]
00007ff9`b6fdbabf 488903 mov qword ptr [rbx],rax
00007ff9`b6fdbac2 8b4608 mov eax,dword ptr [rsi+8]
00007ff9`b6fdbac5 894308 mov dword ptr [rbx+8],eax
00007ff9`b6fdbac8 4883c428 add rsp,28h
00007ff9`b6fdbacc 5b pop rbx
00007ff9`b6fdbacd 5d pop rbp
00007ff9`b6fdbace 5e pop rsi
00007ff9`b6fdbacf 5f pop rdi
00007ff9`b6fdbad0 c3 ret
00007ff9`b6fdbad1 e8a2e2d9ff call CLRStub[MethodDescPrestub]@7ff9b6d79d78 (00007ff9`b6d79d78)
00007ff9`b6fdbad6 cc int 3
I didn't even dump all the relevant codegen since it's so big, but a few things immediately stand out:
Span<T>.ctor(T[])
and Buffer.Memmove(__Canon&, __Canon&, ulong)
aren't being inlined into their caller.RuntimeType
instance from the type handle, possibly due to the array.GetType() != typeof(T[])
check at the beginning of the ctor.<< Partial stack trace showing memmove not inlined >>
00 0000009a`7337ce08 00007ff9`b6fdbc56 System_Private_CoreLib!System.Buffer.Memmove[[System.__Canon, System.Private.CoreLib]](System.__Canon ByRef, System.__Canon ByRef, UInt64)+0xffffffff`a3b81660 [C:\runtime\src\libraries\System.Private.CoreLib\src\System\Buffer.cs @ 338]
01 0000009a`7337ce10 00007ff9`b6fdbbd2 System_Private_CoreLib!System.Span`1[[System.__Canon, System.Private.CoreLib]].CopyTo(System.Span`1<System.__Canon>)+0xffffffff`a3af5976 [C:\runtime\src\libraries\System.Private.CoreLib\src\System\Span.cs @ 363]
/cc @EgorBo In case he has any suggestions.
This is the well-known limitation of inlining generics into generics. @davidwrighton started working on fixing it in https://github.com/dotnet/runtime/pull/31833
Cross-class shared methods don't compose well, and in particular in CoreCLR there are inlining restrictions. See eg #7580. For Span we worked around this for the indexer but not for the constructor.
@davidwrighton is looking to relax some of these restrictions in #31833.
array.GetType() != typeof(T[])
so if I understand it correctly, it's a Co(ntra)variance check. For String
as T
in Span<T>(T[] array)
there is no way this check will be true, right? So the jit actually should be able to eliminate it, shouldn't it? (it currently doesn't)
The jit doesn't see string
here, it sees __Canon
. If we inline enough we can sometimes get all this into unshared code and figure out the actual types, but not always...
That part is potentially easier to fix.
Over in gtFoldTypeCompare
for comparison of a handle to an unknown, we can see what we know about the type of the unknown, likely running it through impIsClassExact
after updating that to know about some effectively sealed array types.
I think I have parts of it coded up somewhere...
That part is potentially easier to fix.
Over in
gtFoldTypeCompare
for comparison of a handle to an unknown, we can see what we know about the type of the unknown, likely running it throughimpIsClassExact
after updating that to know about some effectively sealed array types.I think I have parts of it coded up somewhere...
yeah I was just playing with gtFoldTypeCompare
like this:
so I guess if we can obtain the array element type of that local (we call .GetType from) and if it's sealed -- optimize? (to GT_NULLCHECK?)
@AndyAyersMS it seems impIsClassExact
returns false for System.String[]
cls (it even has a comment We are conservative on arrays here. It might be worth checking for types like int[].
UPD: now it supports arrays:
bool Compiler::impIsClassExact(CORINFO_CLASS_HANDLE classHnd)
{
DWORD flags = info.compCompHnd->getClassAttribs(classHnd);
DWORD flagsMask = CORINFO_FLG_FINAL | CORINFO_FLG_MARSHAL_BYREF | CORINFO_FLG_CONTEXTFUL | CORINFO_FLG_VARIANCE;
if ((flags & flagsMask) != CORINFO_FLG_FINAL)
{
return false;
}
if (flags & CORINFO_FLG_ARRAY)
{
CORINFO_CLASS_HANDLE arrayElementHandle = nullptr;
info.compCompHnd->getChildType(classHnd, &arrayElementHandle);
return impIsClassExact(arrayElementHandle);
}
return true;
}
UPD2 hm.. it seems this change alone produces jit-diffs (improvements)
The change to impClassIsExact
is not safe for int-typed arrays (despite what the comment says) since we allow equivalence between int[]
and uint[]
and likewise for integral types.
So we'd need to restrict the array cases somewhat, or pose the question differently.
For gtFoldTypeCompare
the best way to fetch what we know about a type is via gtGetClassHandle
, it can deal with a wider variety of tree types and knows how some variables are defined. Also yes we need to watch for potential exceptions....
@AndyAyersMS I have a draft: https://github.com/EgorBo/runtime-1/commit/0141547f2dbc0705941b9854e488b100781fc4ce
Test:
static bool Test(string[] array)
{
return array.GetType() != typeof(string[]);
}
mov rax, 0xD1FFAB1E
cmp qword ptr [rcx], rax
setne al
movzx rax, al
ret
cmp dword ptr [rcx], ecx ;; null-check, can be eliminated in real code
xor eax, eax
ret
I reran the tests again today with the latest master, and I'm seeing that the Array and Span scenarios are nearly aligned in terms of performance.
Method | Size | Mean | Error | StdDev | Ratio | RatioSD |
---|---|---|---|---|---|---|
Array | 2048 | 434.1 ns | 6.26 ns | 5.55 ns | 1.00 | 0.00 |
Span | 2048 | 453.1 ns | 8.73 ns | 9.34 ns | 1.04 | 0.03 |
But when I dumped the codegen, it was identical to what I posted earlier at https://github.com/dotnet/runtime/issues/29093#issuecomment-590572852. So I'm not quite sure what happened here and I'm suspicious of my numbers.
Just a note: https://github.com/dotnet/runtime/pull/32790 should not affect CopyTo codegen yet as far as I understand - maybe with https://github.com/dotnet/runtime/pull/31833 merged.
How to run the benchmarks:
Full docs for the new benchmarking workflow: https://github.com/dotnet/performance/blob/master/docs/benchmarking-workflow-corefx.md
category:cq theme:optimization skill-level:expert cost:medium