Open performanceautofiler[bot] opened 1 month ago
Related regressions: Linux x64: https://github.com/dotnet/perf-autofiling-issues/issues/39775, https://github.com/dotnet/perf-autofiling-issues/issues/39746 Windows x64: https://github.com/dotnet/perf-autofiling-issues/issues/39770 Linux Arm64: https://github.com/dotnet/perf-autofiling-issues/issues/39883 (Only the IterateForEach test), https://github.com/dotnet/perf-autofiling-issues/issues/40387 Windows Arm64: https://github.com/dotnet/perf-autofiling-issues/issues/41043
Tagging subscribers to this area: @dotnet/area-system-collections See info in area-owners.md if you want to be subscribed.
Likely caused by: https://github.com/dotnet/runtime/pull/106185 @jakobbotsch. Seems like this may be a necessary bug fix.
Not exactly... I will investigate for .NET 9.
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
System.Collections.IterateForEach<String>.HashSet(Size: 512)
Hot functions:
System.Collections.IterateForEach`1.HashSet
(Tier-1)
System.Collections.IterateForEach<String>.Dictionary(Size: 512)
Hot functions:
System.Collections.IterateForEach`1.Dictionary
(Tier-1)
This looks like we do a new CSE that leads to different register allocation. In the baseline we have
***** BB02 [0001]
STMT00045 ( ??? ... ??? )
N004 ( 20, 19) CSE #05 (def)[000017] --CXG------ â CALL help long CORINFO_HELP_RUNTIMEHANDLE_CLASS $402
N002 ( 3, 2) CSE #04 (def)[000015] #--X------- arg0 in rcx ââââ IND long $400
N001 ( 1, 1) [000014] !---------- â ââââ LCL_VAR ref V00 this u:1 $100
N003 ( 3, 10) [000016] H------N--- arg1 in rdx ââââ CNS_INT(h) long 0x7ffe78ac4250 global ptr $43
while in the diff we have removed this call earlier, so we end up with
STMT00045 ( ??? ... ??? )
N002 ( 3, 2) CSE #04 (def)[000015] #--X------- â IND long $400
N001 ( 1, 1) [000014] !---------- ââââ LCL_VAR ref V00 this u:1 $100
Then we see CSE kicking in for the diff, but not for the base:
-Considering CSE #04 {$143, $341} [def=51469.406974, use=100.000000, cost= 3, call]
-CSE Expression :
-N002 ( 3, 2) CSE #04 (def)[000015] #--X------- â IND long $400
-N001 ( 1, 1) [000014] !---------- ââââ LCL_VAR ref V00 this u:1 $100
-
-Aggressive CSE Promotion (103038.813948 >= 600.000000)
-cseRefCnt=103038.813948, aggressiveRefCnt=600.000000, moderateRefCnt=100.000000
-defCnt=51469.406974, useCnt=100.000000, cost=3, size=2, LiveAcrossCall
-def_cost=1, use_cost=1, extra_no_cost=2, extra_yes_cost=0
-CSE cost savings check (302.000000 >= 51569.406974) fails
-Did Not promote this CSE
+Considering CSE #04 {$143, $341} [def=51472.623061, use=51472.623061, cost= 3 ]
+CSE Expression :
+N002 ( 3, 2) CSE #04 (def)[000015] #--X------- â IND long $400
+N001 ( 1, 1) [000014] !---------- ââââ LCL_VAR ref V00 this u:1 $100
+
+Aggressive CSE Promotion (154417.869184 >= 600.000000)
+cseRefCnt=154417.869184, aggressiveRefCnt=600.000000, moderateRefCnt=100.000000
+defCnt=51472.623061, useCnt=51472.623061, cost=3, size=2
+def_cost=1, use_cost=1, extra_no_cost=4, extra_yes_cost=0
+CSE cost savings check (154421.869184 >= 102945.246123) passes
+
+Promoting CSE:
These are the defs/uses of the CSE in the diff:
Labeling the CSEs with Use/Def information
BB02 [000015] Def of CSE #04 [weight=514.73]
BB07 [000028] Use of CSE #04 [weight=513.73]
BB09 [000223] Use of CSE #04 [weight=1.00]
BB10 [000052] Def of CSE #04 [weight=0]
It turns out that both the high weight def and use can later be optimized out (they have no uses and we end up proving they are non-faulting). That happens in the baseline. But in the diff since we introduced the CSE the high-weight def ends up being live due to the low weight use.
Not really sure what we can do here, if anything. This change is a perf fix around InlineArray
, but it's quite unfortunate to take a perf regression for enumeration over dictionaries in shared generic methods as part of it... Any ideas @AndyAyersMS?
Perhaps one possibility could be to have CSE not count the weight of IND
uses if the value of those uses are not used, essentially anticipating that not doing the CSE will lead to assertion prop removing that indirection anyway. Let me try that out...
Perhaps one possibility could be to have CSE not count the weight of
IND
uses if the value of those uses are not used, essentially anticipating that not doing the CSE will lead to assertion prop removing that indirection anyway. Let me try that out...
Seems plausible, though you probably want to count the weight of the GTF_MAKE_CSE instances (they will be unused).
That change has a bit larger diffs than I'd like to take at this stage of .NET 9. And on perfscore it seems regressions outweigh improvements.
I also tried another change where we directly bash the uses to nops as part of availability, but it has basically the same diffs.
I think we should just accept this regression. I would prefer that we keep #106185 in since it is fixing questionable logic, and the way it leads to the regression here is certainly not because the baseline was actively trying to do something clever, rather it just happened to avoid this CSE by chance. I will move it to 10.0 since we can perhaps experiment with my change to account for the unused nodes in CSE heuristics in 10.0.
@EgorBot -intel -commit 15e96faf4558b017ea8df1dc28d9b2169f0badc0 vs previous --filter System.Collections.IterateForEach
Run Information
Regressions in System.Collections.IterateForEach<String>
Test Report
Repro
General Docs link: https://github.com/dotnet/performance/blob/main/docs/benchmarking-workflow-dotnet-runtime.md