dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.98k stars 4.66k forks source link

Inefficiencies with IComparer<T>.Compare vs operators #81220

Open stephentoub opened 1 year ago

stephentoub commented 1 year ago

Consider these three variations:

public static bool M1(int i, int j) => i <= j;

public static bool M2<T>(T i, T j) where T : IBinaryInteger<T> => i <= j;

public static bool M3(int i, int j) => Comparer<int>.Default.Compare(i, j) <= 0;

M1 and M2 (with T==int) both produce the nice:

    L0000: xor eax, eax
    L0002: cmp ecx, edx
    L0004: setle al
    L0007: ret

but M3 produces:

    L0000: cmp ecx, edx
    L0002: jl short L0013
    L0004: cmp ecx, edx
    L0006: jg short L001a
    L0008: xor eax, eax
    L000a: test eax, eax
    L000c: setle al
    L000f: movzx eax, al
    L0012: ret
    L0013: mov eax, 0xffffffff
    L0018: jmp short L000a
    L001a: mov eax, 1
    L001f: jmp short L000a

SharpLab

It'd be helpful if the JIT were able to unravel such an IComparer<T>.Default.Compare(...) followed by comparison to 0 into the equivalent of the direct operator usage.

ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch, @kunalspathak See info in area-owners.md if you want to be subscribed.

Issue Details
Consider these three variations: ```C# public static bool M1(int i, int j) => i <= j; public static bool M2(T i, T j) where T : IBinaryInteger => i <= j; public static bool M3(int i, int j) => Comparer.Default.Compare(i, j) <= 0; ``` M1 and M2 (with T==int) both produce the nice: ```asm L0000: xor eax, eax L0002: cmp ecx, edx L0004: setle al L0007: ret ``` but M3 produces: ```asm L0000: cmp ecx, edx L0002: jl short L0013 L0004: cmp ecx, edx L0006: jg short L001a L0008: xor eax, eax L000a: test eax, eax L000c: setle al L000f: movzx eax, al L0012: ret L0013: mov eax, 0xffffffff L0018: jmp short L000a L001a: mov eax, 1 L001f: jmp short L000a ``` [SharpLab](https://sharplab.io/#v2:EYLgxg9gTgpgtADwGwBYA0AXEBDAzgWwB8BiAOwFcAbS7YSmAAgBMBLXW+gWACgABABga8AjCgDcPAUOEA6AHLl8MKCzC4JfQQGUAFtigAHADK0ZAJXKkMLJRskBmaUiEAmBgGEeAbx4M/QxxFnYAgISgYAWWEAChYrBhY0BPiAKwBKBgBeAD4EhgAeTIYUjX8GX38AbQApFgwAcRhSZVVojABPAxgIADNYqzS0gF0Kv15A4WDQ8IiXfIAVbOj5hKSV9IYAdx1lRhWQBgBJACE4/XbDqxgAc2UF3Jy8wuLS/1GApwYQsMj7foxVskARtHu4IPgDPo7nEMNkZAARGA9bBUDAyMEQqGxJIbZ78DQAXyAA=) It'd be helpful if the JIT were able to unravel such an `IComparer.Default.Compare(...)` followed by comparison to 0 into the equivalent of the direct operator usage.
Author: stephentoub
Assignees: -
Labels: `area-CodeGen-coreclr`
Milestone: -
EgorBo commented 1 year ago

@AndyAyersMS I've not checked the dump yet but it looks like it was supposed to be handled by your https://github.com/dotnet/runtime/pull/72979 ? (or https://github.com/dotnet/runtime/pull/76283)

AndyAyersMS commented 1 year ago

I took a quick look and RBO does not see anything it can handle:

*************** Starting PHASE Redundant branch opts

-----------------------------------------------------------------------------------------------------------------------------------------
BBnum BBid ref try hnd preds           weight   IBC  lp [IL range]     [jump]      [EH region]         [flags]
-----------------------------------------------------------------------------------------------------------------------------------------
BB01 [0000]  1                             1    100    [000..001)-> BB03 ( cond )                     i IBC 
BB02 [0011]  1       BB01                  0.14  14    [000..001)-> BB06 (always)                     i IBC 
BB03 [0012]  1       BB01                  0.36  36    [000..001)-> BB05 ( cond )                     i IBC 
BB04 [0013]  1       BB03                  0.03   3    [000..001)-> BB06 (always)                     i IBC 
BB05 [0014]  1       BB03                  0.33  33    [000..001)                                     i IBC 
BB06 [0015]  3       BB02,BB04,BB05        1    100    [000..013)        (return)                     i IBC 
-----------------------------------------------------------------------------------------------------------------------------------------

--- Trying RBO in BB03 ---
Relop [000043] BB03 value unknown, trying inference
Can infer LE from [false] dominating GE

Dominator BB01 of BB03 has same VN operands but different relop
N003 (  3,  3) [000038] J---G--N---                         *  GE        int    $100
N001 (  1,  1) [000037] -----------                         +--*  LCL_VAR   int    V04 tmp2         u:1 (last use) $80
N002 (  1,  1) [000027] -----------                         \--*  LCL_VAR   int    V01 arg1         u:1 $81
 Redundant compare; current relop:
N003 (  5,  4) [000043] J---G--N---                         *  LE        int    $101
N001 (  3,  2) [000041] -----------                         +--*  LCL_VAR   int    V04 tmp2         u:1 (last use) $80
N002 (  1,  1) [000042] -----------                         \--*  LCL_VAR   int    V01 arg1         u:1 (last use) $81
inference failed -- will keep looking higher
No usable PhiDef VNs

optRedundantRelop in BB01; jump tree is
N004 (  5,  5) [000039] ----G------                         *  JTRUE     void   $VN.Void
N003 (  3,  3) [000038] J---G--N---                         \--*  GE        int    $100
N001 (  1,  1) [000037] -----------                            +--*  LCL_VAR   int    V04 tmp2         u:1 (last use) $80
N002 (  1,  1) [000027] -----------                            \--*  LCL_VAR   int    V01 arg1         u:1 $81
 ... checking previous tree
N003 (  5,  4) [000034] -A------R--                         *  ASG       int    $VN.Void
N002 (  3,  2) [000033] D------N---                         +--*  LCL_VAR   int    V04 tmp2         d:1 $VN.Void
N001 (  1,  1) [000002] -----------                         \--*  LCL_VAR   int    V00 arg0         u:1 $80
 -- prev tree VN is not related

--- Trying RBO in BB01 ---

RBO doesn't look at predicates in returns; even if it did we'd need to do something like:

stephentoub commented 1 year ago

For reference, here's something much closer to my actual scenario: SharpLab

I'd like to be able to use Comparer<T>.Default.Compare with an unconstrained T (but I know by construction that I'll only end up here if T implements IComparable<T>), but the code ends up being worse than with T constrained to IBinaryInteger<T> and using the operators.

#nullable disable
#pragma warning disable 0649
using System.Collections.Generic;
using System.Numerics;
using SharpLab.Runtime;

[JitGeneric(typeof(int))]
public class C1<T> where T : struct, IBinaryInteger<T>
{
    private T _max;
    private T[] _values; // sorted

    public bool Contains(T value)
    {
        if (value <= _max)
        {
            T[] values = _values;
            for (int i = 0; i < values.Length; i++)
            {
                if (value <= values[i])
                {
                    if (value < values[i])
                    {
                        break;
                    }

                    return true;
                }
            }
        }

        return false;
    }
}

[JitGeneric(typeof(int))]
public class C2<T> where T : struct
{
    private T _max;
    private T[] _values; // sorted

    public bool Contains(T value)
    {
        if (Comparer<T>.Default.Compare(value, _max) <= 0)
        {
            T[] values = _values;
            for (int i = 0; i < values.Length; i++)
            {
                if (Comparer<T>.Default.Compare(value, values[i]) <= 0)
                {
                    if (Comparer<T>.Default.Compare(value, values[i]) < 0)
                    {
                        break;
                    }

                    return true;
                }
            }
        }

        return false;
    }
}
EgorBo commented 1 year ago

@AndyAyersMS here is a simplied version with no returns as Stephen posted above ^:

static int CompareTo(int a, int b)
{
    if (a < b) return -1;
    if (a > b) return 1;
    return 0;
}

[MethodImpl(MethodImplOptions.NoInlining)]
public static bool Test(int a, int b) => CompareTo(a, b) <= 0 ? true : false;

Current codegen:

; Method Program:Test(int,int):bool
G_M37083_IG01:
G_M37083_IG02:
       cmp      ecx, edx
       jl       SHORT G_M37083_IG04
G_M37083_IG03:
       cmp      ecx, edx
       jg       SHORT G_M37083_IG06
G_M37083_IG04:
       mov      eax, 1
G_M37083_IG05:
       ret      
G_M37083_IG06:
       xor      eax, eax
G_M37083_IG07:
       ret      
; Total bytes of code: 17

47  Redundant branch opts dot

JulieLeeMSFT commented 1 year ago

cc @AndyAyersMS to look into it.

AndyAyersMS commented 1 year ago

Yeah before RBO there are 5 compares: image (20) The ones with V03 are all resolvable, but RBO doesn't know how to clean up what's left:

image (21)

Am poking at RBO to see what it would take to handle this. Abstractly, if we have a flow graph shape like the following

image (23)

where the B node is side effect free there is a chance that we can find a single predicate that gives us the same result, and so just do one compare in A and then branch to C or D. But in the case above we see there are some interposing blocks (BB02, BB05, BB06, and BB07) and when RBO runs some those blocks have assignments in them (they turn out to be dead assignments, but this may be hard for RBO to spot).

Assuming those blocks got cleaned up somewhere/somehow we'd still have the boolean algebra problem to solve: we'd have to find some relop**(X,Y) such that:

relop**(X,Y) == !relop(X,Y) || (relop(X,Y) && relop*(X,Y))

(with suitable variations when you permute the T/F labels in the graph above). Presumably we could precompute all those in a table or something?

I am going to prototype the shape recognition and matching relop operand checks (allowing for interposing blocks) and see how often we see this kind of structure. If it is common enough then perhaps we can figure out how to not get blocked by the apparent side effects.

Similar considerations apply if we have whole slew of intermediate relops, provided that there is some pair of nodes like C and D here that collectively postdominate the start node A with all the interior nodes side effect free. Those will just yield a bigger set of equations to sort through.

AndyAyersMS commented 1 year ago

Have the recognition part prototyped. Need to gather some stats on how often we see this flow pattern, but here's a peek at what it can do so far on the example above:

RBO-NEW: checking for  flow shape: dom BB01 reaches block BB04 via jump edge (via BB04)
RBO-NEW: BB04 successors are jump BB07 next BB05
RBO-NEW: BB04 linear chain endpoints are jump BB09 next BB08
RBO-NEW: matched flow shape: dom BB01 block BB04 C BB09 D BB05
DomBlock branch condition is $100:  {GE($80, $81)}
Block branch condition is $101:  {LE($80, $81)}
Combined condition (dom->block->C) is $105:  {AND($100, $101)}
Alternate condition (dom->!block->C is $82:  {NOT($100)}
Total condition (dom->(...)->C is $106:  {OR($82, $105)}

The next step would be to add boolean simplification logic to VN, to see if $106 can be expressed as a VNFunc of the two common VNs $80 and $81. In this case it all should simplify to LE($80, $81).

Already we can see that VN doesn't know how to do much here, eg NOT($100) is NOT(GE($80,$81)) which is LT($80,$81). So will probably add those capabilities independently, they might prove useful on their own anyways.

If this pattern matches often enough, and if we can do the simplifications above, and then safely reason about side effects, we can optimize.

AndyAyersMS commented 1 year ago

ASPNET screening finds about 1400 cases.

My shape matching is too permissive, but this is somewhat encouraging. Here's an organic example from System.Threading.ThreadPool:UnsafeQueueUserWorkItem:

image

Here BB14 is the dominating block, BB17 is a dominated block with a relop with the same operands, BB22 is block C and BB13 is block D. BB19 is empty, and BB18 has a neglectable side effect (dead local assign).

So it appears we could alter BB14 to branch directly to BB13 if V21 > 0 -- but say we enter BB14 with V21 < 0, in the original code we'd then flow BB17, BB18, BB19, BB22, but now we'll flow BB14, BB15, ... and possibly not even make it to BB22. So we may accumulate unwanted effects and possibly miss our target.

So we have failed to prove that BB22 and BB13 collectively postdominate BB14, and, if we're going to alter the route by which control reaches from A=BB14 to C=BB22 we need to have the same net side effect.

AndyAyersMS commented 1 year ago

If we insist on linear flow from A->B->C and A->C, we end up with about 300 cases in ASP.NET. Going to scan through a bit more widely...

Here's another organic case, this time from System.IO.FileInfo:get_Name()

image

Note that the trees in BB07 and BB08 don't have the same lexical operands, but they do in VN space:

RBO-NEW: checking for  flow shape: dom BB07 reaches block BB08 via next edge (via BB08)
RBO-NEW: BB08 successors are jump BB11 next BB09
RBO-NEW: BB08 linear chain endpoints are jump BB11 next BB09
RBO-NEW: matched flow shape: dom BB07 block BB08 C BB11 D BB09
DomBlock branch condition is $247:  {EQ($301, $40)}
Block branch condition is $249:  {GE($40, $301)}
Combined condition (dom->block->C) is $278:  {AND($249, $289)}
Alternate condition (dom->!block->C is $247:  {EQ($301, $40)}
Total condition (dom->(...)->C is $279:  {OR($247, $278)}
AndyAyersMS commented 1 year ago

I taught VN a bit about boolean logic simplification and some relop subsumption rules. Now it can reduce the Test example above to a single relop:

RBO-NEW: checking for  flow shape: dom BB01 reaches block BB04 via jump edge (via BB04)
RBO-NEW: BB04 successors are jump BB07 next BB05
RBO-NEW: BB04 linear chain endpoints are jump BB09 next BB08
RBO-NEW: matched flow shape: dom BB01 block BB04 C BB09 D BB05

DomBlock branch condition is $100:  {GE($80, $81)}
Block branch condition is $101:  {LE($80, $81)}
Combined condition (dom->block->C) is $105:  {EQ($80, $81)}
Alternate condition (dom->!block->C is $102:  {LT($80, $81)}
Total condition (dom->(...)->C is $101:  {LE($80, $81)}
C currently reached from dom via fall through (false path), so reversing to get final relop $106:  {GT($80, $81)}

From here it should be straight forward to modify the flow and the relop and implement the optimization.

To make this sound I still need to check for (lack of) side effects in the various paths. And there is one additional wrinkle now in the Test example: we trigger this new opt when visiting BB04 and at that point we haven't yet simplified the flow in BB02. So the linear flow pattern match fails. I am currently working around this by invoking RBO recursively if we fail to see simple linear flow, but a more robust solution would be to implement a more general retry, so when we optimize BB02 we realize we ought to go back and look at BB04 once more.

AndyAyersMS commented 1 year ago

I have the above bits mostly implemented (not polished) and they are doing some interesting things. Will post some examples shortly.

However they don't address the M3 case above; in that case we have: image (25)

Seems like in this case RBO (or something) can realize that if we backsubstituted the return expression to the preds each of them would evaluate to a constant, and if we then gate those with the reaching predicates, we get something that simplifies nicely.

BB2: LE(-1,0) ==> true
BB4: LE(1,0) ==> false
BB5: LE(0, 0) == >true

So return value is

       BB02                  BB03                   BB04                               BB05
( !(V04 >= V01) & true ) | ((V04 >= V01) & (!(V04 <= V01) & false) | ((V04 <= V01) & true ))

==>

(V04 < V01 ) | ((V04 >= V01) & (V04 <= V01))

==> 

(V04 <= V01)

I think with the abilities I'm adding to VN it may be able to do such reductions.

Then we could either manifest that by rewriting from BB01 (realizing that the assignments are dead) or effectively moving the whole computation down to BB06 (we would need to know somehow that V01 and V04 reach -- in this case there are no other SSA defs and they're defined on entry or in BB1 so they do).

AndyAyersMS commented 1 year ago

I think with the abilities I'm adding to VN it may be able to do such reductions.

Been prototyping this too, on top of the changes in #83859. It seems to be able to simplify, but so far I'm not seeing a lot of cases like the one above (none in asp.net, 5 in libraries.pmi). Am double-checking my screening to see if it is flawed.

On the M3 example it is able to deduce a simplified return value VN:

optReturnRelop in BB06; return value is
N003 (  8,  4) [000007] -----------                         *  LE        int    $104
N001 (  3,  2) [000056] -----------                         +--*  LCL_VAR   int    V05 tmp3         u:1 (last use) $140
N002 (  1,  1) [000006] -----------                         \--*  CNS_INT   int    0 $40
 {LE($140, $40)}... 
[interestingVN] in BB06 relop first operand VN is PhiDef for V05:1 $103
Found local PHI [000060] for V05
... substituting ($42,$40) for ($140,$40) in $104 gives $41
... substituted VN implies relop is 1 when coming from pred BB02
BB02 is a true pred
... substituting ($41,$40) for ($140,$40) in $104 gives $40
... substituted VN implies relop is 0 when coming from pred BB04
BB04 is a false pred
... substituting ($40,$40) for ($140,$40) in $104 gives $41
... substituted VN implies relop is 1 when coming from pred BB05
BB05 is a true pred
Dominator of BB06 and all its preds is BB01
will follow path where relop value is false
Walking flow back from BB06 via BB04 to BB01
At BB04 path VN is $42  {IntCns -1}
At BB03 path VN is $105 {GT($80, $81)}
At BB01 path VN is $105 {GT($80, $81)}
reached the dominating block
Return value VN for BB06 is $101 {LE($80, $81)}

The next step from here is to figure out how to leverage this information. We don't keep inverse mappings from VNs to representative trees, and even if we did the tree values might be position dependent, so this may be tricky.

One idea is to see if the dominating block's relop has operands with the required VNs. If so we could copy them into new temps and then use those temps down in BB06. Or alternatively, see if those operands might still have the same value if evaluated in BB06. Either would work in this case.

If we did, that then the stores to V05 would all become dead and hopefully we could clean up all intermediate flow and just have a single compare.

But I don't have enough examples yet where we find these sorts of cases to figure out how effective this kind of thing would be.

AndyAyersMS commented 1 year ago

But I don't have enough examples yet where we find these sorts of cases ...

I should add this sort of optimization could apply to more relops, it is a generalization of the phi-based jump threading to cover relops whose values aren't directly feeding into some GT_JTRUE. The prototype shares a lot of the same analysis so if the screening is indeed hitting some artificial limitation, fixing that may improve jump threading oo.

Basically, given a relop VN in some tree in some block, we are trying to see if the value of the VN is entirely predictable if we know which predecessor transferred control to the block. So we check the relop operand VNs to see if any of them are local PHI defs, if so we find the appropriate phi input VNs (matching with preds), substitute those into the relop VN, and see if we can simplify to a constant (0 or 1 here, since it's a relop). This either gives us 0, 1, or "can't tell" -- so by doing this we sort the preds into 3 categories: true preds, false preds, and ambiguous preds.

Phi-based jump threading then uses this info to rewire flow. But more generally we can use this info to either simplify the relop or (perhaps) decide to likewise duplicate this block to create contexts where the relop value is known.

AndyAyersMS commented 1 year ago

Also (note to self) we might want to put VNs onto flow edges indicating the conditions under which control reaches that edge, that would make this sort of path-based analysis simpler. This should be fairly cheap. Perhaps assertion prop could leverage this too, somehow?

Going further we could update the VN phi defs to capture these conditions and the rest of VN to know how to simplify over these predicated value sets -- this pushes forwards the information that I am recovering backwards above. Probably not so cheap though.

AndyAyersMS commented 1 year ago

In libraries PMI I see about 177 cases where the return VN contains a PhiDef and we can't resolve it.

In many cases the PhiDef is non local and these are perhaps not interesting for jump threading but may still be viable for general relop rewriting? This would go something like the following (assume just one phi def for now):

If there are multiple PHI defs and they don't all have the same def block then things are more complicated.

In some other cases the PhiDef is local (phi def block is block) but the PhiDef is buried deeper in the func graph. Currently we only look one level down, eg there are 100 or so cases like this one:

{NE($205= {AND($47= {IntCns 128}, $240= {PhiDef[V02.1/BB04]($204 {Phi($46= {IntCns 2}, $45= {IntCns 3})})})}, $40= {IntCns 0})} in BB04

These are cases that might help phi-based RBO too. But it is clunky to do substitutions in a multi-level VN tree. Would be nice to have a general substitution utility I suppose.

This also starts to tread on if conversion territory -- if "materialize it somehow" was creation of a GT_SELECT tree and we relied on other opts to clean all this up. In the above we are effectively forward subbing that tree into the consumer (and insisting there be a local consumer) and only doing this for booleans, and only doing this when we know it will clean up.