Closed jakobbotsch closed 1 year ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | jakobbotsch |
---|---|
Assignees: | - |
Labels: | `area-CodeGen-coreclr` |
Milestone: | - |
cc @dotnet/jit-contrib
The kinds of struct locals we can promote appear only as operands of GT_ASG
, GT_CALL
and GT_RETURN
, For GT_ASG
we should be able to decompose directly as part of the generalized promotion pass. That leaves GT_CALL
and GT_RETURN
.
My thinking for a while was to introduce a new node that represented the assembling of a local and its constituent fields, similar to GT_FIELD_LIST
. However, it would need an equivalent to appear on the LHS of assignments from calls, and of course downstream passes would need to be taught to handle this. Also, they would only be necessary on platforms with multi-reg args/returns.
After thinking some more I've come back to another idea which I think I will investigate. Instead of introducing a new node in HIR that persists until lowering, we introduce the equivalent of GT_PUTARG_REG
for GT_CALL
results and GT_RETURN
. That is, we would add nodes GT_GETRES_REG
that represents using one of the register results of a GT_CALL
node, and a GT_PUTRET_REG
that represents placing a value into one of the return registers. LSRA would get similar handling for these nodes as it has for GT_PUTARG_REG
today (essentially that these registers are busy until they are used (for call)/or until a GT_RETURN
is encountered).
For the generalized struct promotion pass we would then only introduce new assignments: to pass one of these struct locals to a call, or to return one of them, it will just write all the constituent fields back into the struct local and leave the local in the call argument/GT_RETURN
.
A priori this would just amount to a lot of unnecessary copying in the generated code. To get to an acceptable state, we would introduce an optimization in lowering to handle these patterns without copying. For example, after generalized struct promotion + rationalization, we would have LIR like:
STORE_LCL_FLD V00 [0..8), <some operand, usually a GT_LCL_VAR>
STORE_LCL_FLD V00 [8..16), <some operand 2>
t0 = LCL_VAR V00
CALL t0
we would do some analysis to figure out whether V00
is dead after this call (potentially even more precisely whether V00 [0..8)
/V00 [8..16)
are dead). If yes, then we would transform this into
t0 = PUTARG_REG <some operand>
t1 = PUTARG_REG <some operand 2>
CALL t0, t1 // with FIELD_LIST or however the representation is today
Similarly for call results, e.g. we would get LIR like the following for generalized promotion after rationalization:
STORE_LCL_VAR V00 (CALL abc)
STORE_LCL_VAR V01 (LCL_FLD V00 [0..8))
STORE_LCL_VAR V02 (LCL_FLD V00 [8..16))
and, as an optimization, lower it into:
CALL abc
STORE_LCL_VAR V01 (GETRES_REG rax)`
STORE_LCL_VAR V02 (GETRES_REG rdx)`
Some questions to investigate:
FIELD_LIST
transformation we do for our normal promotion)The kinds of struct locals we can promote appear only as operands of GT_ASG, GT_CALL and GT_RETURN
What about for GT_HWINTRINSIC
, particularly in the case of things like struct M2x4 { Vector4 X; Vector4 Y; }
or similar?
Same for cases like GT_INTRINSIC
which were originally calls and which may become calls again in some cases (e.g. Math.Pow
if constant folding can't happen).
What about for
GT_HWINTRINSIC
, particularly in the case of things likestruct M2x4 { Vector4 X; Vector4 Y; }
or similar?
What particular intrinsics take arbitrary struct arguments? Can they be decomposed early like ASG
would be?
There is https://github.com/dotnet/runtime/pull/80297, where we are handling an intrinsic that essentially has a "multi-reg arg" via early decomposition into a FIELD_LIST
.
Single beat me to it, that was the example I was going to give 😄
I don't see why the existing approach there wouldn't continue to work. The representation for call args here could also be GT_FIELD_LIST
but it would require very early ABI handling for the struct args that I am not a fan of.
A similar node would be needed for parameters. Generalized promotion would create IR in the start of functions to load (parts of) parameters into the promoted field locals and lowering would optimize these into some GT_GETPARAM_REG
node when possible. Then likely we would use the same source of liveness when homing parameters to figure out if we can avoid homing some of the struct parameter.
Some measurements over asp.net for block copies/inits and whether they involve promoted structs:
Copies physical -> physical: 3
Copies physical -> old: 283
Copies old -> physical: 250
Copies physical -> : 65
Copies -> physical: 268
Inits -> physical: 37
("old" means structs that are promoted by the normal mechanism)
It would be great to reuse block morphing to do the decomposition, but I'm not sure how simple that would be -- the decomposition for copies involving physically promoted structs is quite a bit more complicated.
Same measurements with old promotion disabled:
Copies physical -> physical: 162
Copies physical -> old: 0
Copies old -> physical: 0
Copies physical -> : 1332
Copies -> physical: 6034
Inits -> physical: 99
We frequently see promotion opportunities for standard C# code iterating lists via List<T>.Enumerator
, e.g.:
https://github.com/dotnet/aspnetcore/blob/8968058c9e5fdfdd1242426a03dc80609997edab/src/Servers/Kestrel/Core/src/Internal/Infrastructure/KestrelConnection.cs#L51-L54
Where the codegen for the loop ends up with the following diff:
@@ -1,35 +1,33 @@
G_M45156_IG09:
- mov rax, gword ptr [rbp-28H]
- mov rdx, gword ptr [rbp-20H]
+ mov rax, gword ptr [rbp-38H]
+ mov rdx, gword ptr [rbp-30H]
mov rcx, gword ptr [rax+08H]
call [rax+18H]System.Action`1[System.__Canon]:Invoke(System.__Canon):this
;; size=15 bbWeight=1 PerfScore 7.00
G_M45156_IG10:
- mov rcx, gword ptr [rbp-38H]
- mov esi, dword ptr [rbp-2CH]
- mov edi, dword ptr [rcx+14H]
- cmp esi, edi
+ mov rcx, rsi
+ mov r14d, dword ptr [rcx+14H]
+ cmp edi, r14d
jne SHORT G_M45156_IG14
- mov edx, dword ptr [rbp-30H]
- cmp edx, dword ptr [rcx+10H]
+ cmp ebx, dword ptr [rsi+10H]
jae SHORT G_M45156_IG15
- ;; size=22 bbWeight=2 PerfScore 20.50
+ ;; size=17 bbWeight=2 PerfScore 15.00
G_M45156_IG11:
- mov rcx, gword ptr [rcx+08H]
- mov eax, edx
- cmp eax, dword ptr [rcx+08H]
+ mov rcx, gword ptr [rsi+08H]
+ cmp ebx, dword ptr [rcx+08H]
jae SHORT G_M45156_IG08
- shl rax, 4
+ mov edx, ebx
+ shl rdx, 4
;; size=15 bbWeight=1 PerfScore 6.75
G_M45156_IG12:
- vmovdqu xmm0, xmmword ptr [rcx+rax+10H]
- vmovdqu xmmword ptr [rbp-28H], xmm0
+ vmovdqu xmm0, xmmword ptr [rcx+rdx+10H]
+ vmovdqu xmmword ptr [rbp-38H], xmm0
;; size=11 bbWeight=1 PerfScore 5.00
G_M45156_IG13:
- inc edx
- mov dword ptr [rbp-30H], edx
+ inc ebx
jmp SHORT G_M45156_IG09
- ;; size=7 bbWeight=1 PerfScore 3.25
+ ;; size=4 bbWeight=1 PerfScore 2.25
G_M45156_IG14:
- cmp esi, edi
+ cmp edi, r14d
jne SHORT G_M45156_IG07
Investigating some current causes of regressions when enabling physical promotion by default.
(edit: handled by #87265)
aspnet.run.windows.x64.checked.mch:
(edit: partially handled by #87217, rest will be handled by #87410)
(edit: not expected to be handled)
(edit: tracked by #87554)
(edit: not expected to be handled)
With the perflab runs @cincuranet set up and a query from @AndyAyersMS I can start looking at micro benchmark regressions. The following lists all benchmarks with a ratio below 0.95, indicating that they regress by more than 5%. There are 56 entries in this list (for comparison, the query for benchmarks that improve by more than 5% returns 267 results, but take it with a grain of salt as many of these are noisy). The quality columns are computed as median divided by standard deviation, so larger numbers indicate more stable benchmarks.
Notes | Benchmark | Ratio | Promotion median | Default median | Promotion quality | Default quality |
---|---|---|---|---|---|---|
Bimodal | PerfLabTests.CastingPerf.CheckObjIsInterfaceNo | 0.50201266706593484 | 62373.178950863221 | 31312.125918503676 | 4.1721551887971193 | 2.0068501317441818 |
Bimodal | PerfLabTests.CastingPerf.CheckIsInstAnyIsInterfaceYes | 0.5020356039475935 | 62373.964854866252 | 31313.95111651875 | 4.2768060071352947 | 2.0060349494210894 |
Bimodal | PerfLabTests.CastingPerf.CheckObjIsInterfaceYes | 0.50204105771866514 | 62373.932840068293 | 31314.275217100869 | 4.2853027310745686 | 2.0074065296253232 |
Bimodal | PerfLabTests.CastingPerf.CheckIsInstAnyIsInterfaceNo | 0.50216288733931325 | 62373.730079681263 | 31321.772390935719 | 4.0927570892698251 | 2.0097825475492144 |
Bimodal | PerfLabTests.LowLevelPerf.EmptyStaticFunction | 0.7374686274234068 | 2604944.2708333335 | 1921064.6759259258 | 7.1867237699720317 | 3.0824059321076045 |
Noisy | MicroBenchmarks.Serializers.Xml_ToStream<MyEventsListerViewModel>.XmlSerializer_ |
0.74302975696458551 | 660707.35294117662 | 490925.2238805971 | 4.9061924677629705 | 3.7168701617547995 |
Maybe? Need more data | System.Memory.Constructors<Byte>.ArrayAsMemory |
0.79356832092520146 | 2.3856637796570128 | 1.8931871999144854 | 6.3931791008772434 | 4.7934943297225 |
Bimodal | System.Memory.Constructors<String>.MemoryMarshalCreateSpan |
0.80252756077275078 | 1.5784818512235037 | 1.2667751897864545 | 5.5628662599349337 | 5.0711645321790915 |
Bimodal | System.Memory.Constructors<String>.MemoryMarshalCreateReadOnlySpan |
0.80496260867919744 | 1.5785205328904912 | 1.2706500060092067 | 5.7538606845191165 | 4.9616066679124549 |
Bimodal | PerfLabTests.CastingPerf2.CastingPerf.IntObj | 0.8285766871531669 | 226939.75845410625 | 188036.99324324325 | 12.390298163607353 | 9.8413406239725 |
Regression (vec cns reuse) | System.Numerics.Tests.Perf_Matrix4x4.IsIdentityBenchmark | 0.83234334609059291 | 1.2211270243560317 | 1.0163969534541484 | 10.778735229862832 | 7.8101018258309587 |
Multimodal | System.Collections.ContainsFalse<Int32>.Span(Size: 512) |
0.845401219708253 | 29126.453232893906 | 24623.539088863898 | 11.021424747595237 | 13.417315107098991 |
Maybe? Need more data | System.Collections.TryGetValueTrue<Int32, Int32>.Dictionary(Size: 512) |
0.84571433947519314 | 3795.9668312075878 | 3210.3035813244669 | 7.9431221125054137 | 6.7635448167816792 |
Bimodal | System.Tests.Perf_Enum.IsDefined_Generic_NonFlags | 0.85163797170392563 | 3.1193090478744949 | 2.6565220306495383 | 8.8673428674969017 | 8.5158225518187 |
Maybe? need more data | Devirtualization.EqualityComparer.ValueTupleCompareNoOpt | 0.85240213125413522 | 5.8012179366666068 | 4.9449705330843328 | 9.3287988994501845 | 7.4879186079626407 |
Regression (pipelining) | System.Numerics.Tests.Perf_Matrix3x2.SubtractBenchmark | 0.85729075729610349 | 1.62796924888388 | 1.3956429902304301 | 44.749404427588338 | 12.091485745364375 |
Regression (pipelining) | System.Numerics.Tests.Perf_Matrix3x2.AddOperatorBenchmark | 0.85737159681747388 | 1.6278395951425713 | 1.3956634330500965 | 63.074207379712796 | 10.42100793946469 |
Regression (pipelining) | System.Numerics.Tests.Perf_Matrix3x2.SubtractOperatorBenchmark | 0.85739598139557893 | 1.627907761982518 | 1.3957615732064814 | 38.580582580347361 | 13.237060244441501 |
Regression (pipelining) | System.Numerics.Tests.Perf_Matrix3x2.AddBenchmark | 0.85795657072580145 | 1.6266881427332425 | 1.3956277805797357 | 26.052276152907275 | 9.560646773106166 |
Bimodal | System.MathBenchmarks.MathTests.DivRemInt32 | 0.85968833841252423 | 1.5068739322565785 | 1.2954419470188046 | 12.849580314598676 | 7.9300084772432688 |
Modal, but check again (only spiked with phys prom) | PerfLabTests.CastingPerf.ObjObjrefValueType | 0.8646586924544396 | 361425.11420265783 | 312509.36666666664 | 14.971126929257249 | 15.218101244318584 |
Like above | PerfLabTests.CastingPerf.FooObjIsNull | 0.86607371151458068 | 361202.81007751933 | 312828.25833333336 | 14.968123248057182 | 15.452934201966883 |
Bimodal | PerfLabTests.LowLevelPerf.GenericGenericMethod | 0.86612762496753892 | 187101.49305555556 | 162053.77180808882 | 15.107030942049 | 11.493496309257749 |
Modal, but check again (only spiked with phys prom) | PerfLabTests.CastingPerf.ObjInt | 0.86704330919561978 | 360542.60349025979 | 312606.05203619908 | 14.650520670993343 | 14.998381025621656 |
Like above | PerfLabTests.CastingPerf.ObjFooIsObj | 0.86731406352131957 | 360339.93371212116 | 312527.89215686271 | 14.967143545401138 | 15.27578920376382 |
Like above | PerfLabTests.CastingPerf.ObjScalarValueType | 0.86780419724880409 | 360123.47537878784 | 312516.66346153844 | 14.864352722727245 | 15.405796220386524 |
Maybe? Need more data | Microsoft.Extensions.Primitives.Performance.StringValuesBenchmark.ForEach_Array | 0.86930747689206089 | 5.2903566301283576 | 4.5989465739960682 | 30.026761512990653 | 10.35044745664271 |
Bimodal regression and improvement | System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "Sherlock|Holmes|Watson|Irene|Adler|John|Baker", Options: NonBacktracking) | 0.870338288956494 | 1567827.0673076923 | 1364539.9271402548 | 11.60696233951157 | 14.090632090500138 |
Regression (pipelining, #87554) | System.Numerics.Tests.Perf_Matrix3x2.MultiplyByMatrixOperatorBenchmark | 0.87078330292883266 | 4.5609575925251837 | 3.9716057169374164 | 100.14896735833135 | 10.364535617674489 |
Regression (pipelijning, #87554) | System.Numerics.Tests.Perf_Matrix3x2.MultiplyByMatrixBenchmark | 0.87095941903031426 | 4.5610356119008992 | 3.9724769267177811 | 129.83253482155473 | 9.5704310251011382 |
Noisy | System.Tests.Perf_String.Trim(s: "Test") | 0.87443817495914056 | 2.5474217375187354 | 2.2275628150071256 | 8.6368738268479444 | 10.77430293074069 |
Bimodal | Benchstone.BenchI.BubbleSort.Test | 0.88167317428377268 | 13708.670704845816 | 12086.567215552373 | 17.710410860982513 | 14.8836516447514 |
Bimodal | PerfLabTests.CastingPerf.FooObjIsFoo2 | 0.888850848188488 | 434724.572368421 | 386405.30487804877 | 16.696742581172664 | 14.025034933825713 |
Multimodal | System.Collections.ContainsKeyTrue<Int32, Int32>.Dictionary(Size: 512) |
0.90009475716937359 | 3514.2079858763432 | 3163.1201836900404 | 12.024653108378379 | 8.0678159153909181 |
Maybe? Need more data | PerfLabTests.CastingPerf.CheckArrayIsArrayByVariance | 0.90067954767068859 | 2.7450413012046524 | 2.4724025575063648 | 2.7009895041042808 | 1.7932475048783776 |
Maybe? Need more data | System.Memory.ReadOnlySequence.Slice_Start(Segment: Multiple) | 0.90964038072783471 | 3.4493846916202595 | 3.1376996041622176 | 8.9030611892736591 | 7.7368085557620461 |
Maybe? Need more data | System.Buffers.Tests.ReadOnlySequenceTests.FirstTenSegments | 0.91450624664500813 | 5.0942189368202619 | 4.6586950394994213 | 13.250302455948326 | 5.93807832331802 |
Maybe? Need more data | System.Numerics.Tests.Perf_Matrix3x2.EqualityOperatorBenchmark | 0.91505574135403611 | 1.7038860172829366 | 1.5591506827276136 | 18.225174899407335 | 16.6845459834044 |
Maybe? Need more data | System.Tests.Perf_Boolean.TryParse(value: "0") | 0.91676396690683648 | 3.2874644520224661 | 3.0138289521013255 | 12.304077459929706 | 9.4382205739902538 |
Noisy | System.Text.Perf_Utf8Encoding.GetBytes(Input: Chinese) | 0.9205757115569656 | 161762.16635338342 | 148914.32139376219 | 15.828556551677019 | 15.141829429651327 |
Noisy | System.Text.RegularExpressions.Tests.Perf_Regex_Cache.IsMatch(total: 40000, unique: 7, cacheSize: 0) | 0.9227115889387062 | 59276610 | 54695215 | 10.351467756318263 | 8.71913930818749 |
Noisy | BenchmarksGame.BinaryTrees_5.RunBench | 0.92322071325312827 | 179044422.5 | 165297519.44444445 | 22.266548353098084 | 20.890177523956297 |
Noisy | System.Memory.Span<Byte>.Clear(Size: 512) |
0.92331343747602457 | 6.4115322856679935 | 5.9198539141686277 | 2.4348688680033095 | 2.7869962552546239 |
Noisy | System.Text.RegularExpressions.Tests.Perf_Regex_Cache.IsMatch_Multithreading(total: 40000, unique: 7, cacheSize: 0) | 0.9236300501927136 | 18645578.75 | 17221616.836734693 | 13.580364545045281 | 12.069896150198117 |
Maybe? Need more data | Benchstone.BenchI.XposMatrix.Test | 0.93003313539222943 | 18078.169075144509 | 16813.296267107489 | 69.731706322545591 | 25.949469557869556 |
Noisy | Benchstone.BenchI.AddArray.Test | 0.93120132537564571 | 20210.202752976191 | 18819.767589681953 | 14.050293215847208 | 15.155225360820852 |
Noisy | System.Buffers.Tests.ReadOnlySequenceTests |
0.93132417841360293 | 4.6180075165850623 | 4.3008620562914261 | 8.1717473277435051 | 6.1868419993236987 |
Bimodal regression and improvement | System.Text.RegularExpressions.Tests.Perf_Regex_Industry_RustLang_Sherlock.Count(Pattern: "(?i)Sherlock|Holmes|Watson", Options: NonBacktracking) | 0.93138653274623273 | 2780731.38576779 | 2589935.763888889 | 14.820176511579046 | 20.266737041566497 |
Noisy | System.IO.MemoryMappedFiles.Tests.Perf_MemoryMappedFile.CreateFromFile_Read(capacity: 10000000) | 0.93154070667252664 | 43590.89447463768 | 40606.692643391521 | 17.251682799323238 | 14.914235278894841 |
Noisy | System.Collections.CtorDefaultSize<String>.Stack |
0.93671812574135571 | 17.861669480703526 | 16.731349558576181 | 16.985916186371369 | 14.140773457045354 |
Bimodal | Benchstone.BenchI.Fib.Test | 0.93955184884835363 | 159694.67229199369 | 150041.42460317462 | 20.749516161852206 | 22.520675763619309 |
Maybe? Need more data | Benchstone.BenchI.IniArray.Test | 0.940509770186889 | 67189541.25 | 63192420 | 5.0790460095421039 | 5.8211886020145069 |
Bimodal | System.Memory.Span |
0.94573448370538127 | 57.653966580812074 | 54.525344317871614 | 23.807900467166885 | 21.941204410312682 |
Maybe? Need more data | System.Runtime.Intrinsics.Tests.Perf_Vector128Int.GetHashCodeBenchmark | 0.94589448437699652 | 12.750723643616887 | 12.060839166312574 | 18.816588477582815 | 17.400192778480125 |
Maybe? Need more data | System.Collections.Tests.Perf_PriorityQueue<Int32, Int32>.HeapSort(Size: 1000) | 0.94921192013119116 | 75208.203125 | 71388.5228978979 | 38.440681648109951 | 23.375784409521643 |
Bimodal | System.Collections.Tests.Perf_Dictionary.ContainsValue(Items: 3000) | 0.94928848058446713 | 4249593.2471264368 | 4034089.916666667 | 22.034137694330397 | 18.666116210466772 |
Noisy (no asm diffs) | System.Buffers.Tests.ReadOnlySequenceTests<Byte>.FirstSingleSegment |
|||||
Noisy (no asm diffs) | System.Memory.ReadOnlySequence.Slice_Repeat_StartPosition_And_EndPosition(Segment: Multiple) |
|||||
Bimodal (no asm diffs) | System.Tests.Perf_Type.op_Equality |
Promoting TYP_SIMD32
and TYP_SIMD64
fields can be very expensive if we end up creating long lifetimes that span across calls where the upper halves need to be saved/restored. For example: https://gist.github.com/jakobbotsch/e09b0e75ecfac6934ae51c8902748491
The pass does not currently have the necessary information to try to take this into account, so need to think about what to do here.
Looking at this perfscore regression:
Description
Struct promotion (a.k.a. scalar replacement of aggregates) is an optimization that replaces structs with their constituent fields, allowing those fields to be optimized as if they were normal local variables. This is a very important optimization for low-level performance oriented code that makes heavy use of structs, so it is important that it is supported well by the JIT.
Limitations
The JIT supports promotion but with the following limitations today:
This issue is about removing (some of) these limitations.
Q1 work items
Q2 work items
Future work items
CQ
GetElement
/WithElement
for SIMDsGetElement
/WithElement
(https://github.com/dotnet/runtime/issues/76928#issuecomment-1582618505)Throughput
Related issues