Open neon-sunset opened 10 months ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | neon-sunset |
---|---|
Assignees: | - |
Labels: | `tenet-performance`, `area-CodeGen-coreclr`, `untriaged` |
Milestone: | - |
We do not emit conditional select instructions inside loops today because we do not have the necessary framework in place to make a proper prediction on whether that is profitable or not. Branches can (and do in real world examples) outperform conditional select instructions due to branch prediction. See https://github.com/dotnet/runtime/issues/55364#issuecomment-1092894196 for more details.
This might be true for older ARM cores but not necessarily for the new ones. For example, Firestorm (2020) and newer cores on osx-arm64
have reciproc. TP for cinc
of 0.333 - higher than for branches which are at 0.5 for not taken and just 1 for taken. Which means that in this instance it should be in theory unconditionally cheaper (at equal codegen size).
https://dougallj.github.io/applecpu/measurements/firestorm/CINC_64.html
Cortex-X1 software optimization guide too describes throughput at 4 conditional selects/compares per cycle vs 2 branch instructions. https://documentation-service.arm.com/static/6238b0f88804d00769e9dee6
Reading through the linked comment, the advice given by ARM may not necessarily apply (or perhaps something was lost in communication?) - given similar codegen size, branches are likely to be more profitable under the following conditions:
Therefore, there may be a performance win to be had by having simple heuristic that expands the instances where csinc/cinc/csel/etc. kick in.
Reading through the linked comment, the advice given by ARM may not necessarily apply (or perhaps something was lost in communication?) - given similar codegen size, branches are likely to be more profitable under the following conditions:
- Blocks under conditionals span multiple (not trivially cheap) instructions
- Loop body has only single taken or two not taken branches
- Condition rarely or never changes, or changes in a predictable pattern, so that misprediction rate is minimal
Another important consideration is the dependency chain the conditional instruction is involved in and whether or not it is on the critical path of the loop. We do not currently have something in place to evaluate that.
Therefore, there may be a performance win to be had by having simple heuristic that expands the instances where csinc/cinc/csel/etc. kick in.
Yes, performance will benefit in many cases from using conditional instructions within loops. I think getting the heuristic right is not going to be simple -- in particular around the dependency chain and around ensuring the quality of the PGO data we want to base the decisions on. And if we get the decision wrong, it can have very significant negative impact on performance (#82106 was a 40% regression in a core span function from a single misplaced if-conversion).
Note that @a74nh works for Arm and originally implemented if-conversion within RyuJIT. He also did some experiments in https://github.com/dotnet/performance/pull/3078.
Yes, performance will benefit in many cases from using conditional instructions within loops. I think getting the heuristic right is not going to be simple -- in particular around the dependency chain and around ensuring the quality of the PGO data we want to base the decisions on. And if we get the decision wrong, it can have very significant negative impact on performance (https://github.com/dotnet/runtime/issues/82106 was a 40% regression in a core span function from a single misplaced if-conversion).
That is true, thanks. In this issue there's also note regarding not emitting cmp, cinc
over mov, cmp, csel
. Do you want me to submit a separate issue or does this one suffice?
Feel free to open a separate issue for it. Hopefully I can take a look soon. We should be able to emit csinc
since #91262, so I'm not sure what goes wrong here. Even the following does not produce csinc
:
[MethodImpl(MethodImplOptions.NoInlining)]
public static int RuneLength(in byte value)
{
int lzcnt = BitOperations.LeadingZeroCount(~((uint)value << 24));
return lzcnt == 0 ? lzcnt + 1 : lzcnt;
}
yet this (incorrect) rewrite does:
[MethodImpl(MethodImplOptions.NoInlining)]
public static int RuneLength(in byte value)
{
int lzcnt = BitOperations.LeadingZeroCount(~((uint)value << 24));
return lzcnt == 0 ? lzcnt : lzcnt + 1;
}
I think it should be a simple fix.
@jakobbotsch I have updated the issue to reflect the possible improvement in the loop vs separate one for the cinc/csel form. Thank you for looking into this!
Won't get to this one in 9.0.
Description
It appears that for a straightforward compare -> increment, .NET emits a branch instead of
cinc
when a method containing such code is inlined inside a loop body.Analysis
Given a method
RuneLength
:This compiles to:
We can see that .NET emits
mov
+cmp
+csel
here. This isn'tcinc
but it's a good start.However, if the method is inlined inside a loop body, the codegen is different. Consider the below method:
Instead of the pattern above, the codegen changes to
cbnz label
andmov 1
which is more compact but is more expensive because it uses branch execution units which have lower throughput per cycle thancsel
which uses integer units instead (on modern ARM cores like Firestorm or Cortex-X1):Configuration
Regression?
No
Happy New Year!