RyuJIT call optimization and aggressive inlining with known generic types

redknightlois commented 9 years ago

This probably will end up in the future releases wishlist, but it something that has been looking forward for a long time already.

Lets say that we have this code:

       public class Executer<T> where T : ICalls
        {
            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            public void Execute(T instance)
            {
                instance.Execute();
            }
        }

And we have the following instances:

        private readonly Executer<ClassCalls> _classCalls = new Executer<ClassCalls>();
        private readonly Executer<ICalls> _interfaceCalls = new Executer<ICalls>();
        private readonly Executer<SealedClassCalls> _sealedCalls = new Executer<SealedClassCalls>();

Now we would expect that the call for _classCalls.Execute(x) would be different than for _interfaceCalls(x). Apparently that is not the case, the JIT stops at the first level even if have the complete information to emit highly optimized code for that call-site.

Now, supposed the implementation is:

        public class ClassCalls : ICalls
        {
            public static int i;

            [MethodImpl(MethodImplOptions.AggressiveInlining)]
            public void Execute()
            {
                i = 0;
                i++;
            }
        }

There is no way that the JIT would inline that code, even if for all purposes it is safe to do so.

The scenario for this pattern is pretty common in high performance code where the calls are very small, in tight loops but must be able to handle more than a single type... An example is a BitVector with variants for MemoryMappedBitVector, UnsafeBitVector, LongBitVector and so on. Operations tend to be very small and executed in very tight loops.

Today we either need a different codepath for each one, or pay the call tax.

category:cq theme:inlining skill-level:expert cost:large

mikedn commented 9 years ago

I'm not quite sure what this has to do with generics, this looks more like a devirtualization problem.

panost commented 9 years ago

If I remember correctly, if ClassCalls was a struct and not a class, the Execute method would be inlined

mikedn commented 9 years ago

@panost Yes, in the case of structs the JIT usually does devirtualization. It's practically forced to do so, making an interface call on a value type would require boxing and you'd end up calling a method on a copy of the original value.

redknightlois commented 9 years ago

@mikedn yeah, but my benchmarks and the emitted assembly suggest that the JIT will perform some limited devirtualization if no generic type is involved. I agree that probably the topic could be changed to "RyuJIT support aggressive call devirtualization over constrained generic types" or something better if we can think a better one.

mikedn commented 9 years ago

but my benchmarks and the emitted assembly suggest that the JIT will perform some limited devirtualization if no generic type is involved

Do you have some sample code?

redknightlois commented 9 years ago

@mikedn Sure. https://gist.github.com/redknightlois/5bafa47ee9835605da26

Just don't execute the naked call versions along with the generic ones (the difference between the count of instruction per each will screw the results --- probably as I am not counting properly the source instructions so I am not passing the right number to BenchmarkDotNet to do a proper adjustment).

In there you will see that the timing for all the naked calls (sealed, unsealed and interface) have essentially the same cost. The assembly emitted for the 3 is identical as far as I remember. This suggest that some limited devirtualization is happening.

@CarolEidt can you provide some insight here?

mikedn commented 9 years ago

UseNakedInstanceCalls and UseNakedSealedCalls do not contain any virtual/interface calls and generate identical code. UseNakedInterfaceCalls contains an interface call that could, at least in theory, be devirtualized. All the generic variants contain interface calls similar to UseNakedInterfaceCalls. They should generate the same code as UseNakedInterfaceCalls but there's some dead code that the JIT doesn't eliminate:

 sub         rsp,28h 
 mov         r11,qword ptr [rcx+30h]  ;dead
 mov         rcx,qword ptr [rcx+18h] 
 mov         r11d,dword ptr [r11]     ;dead
 mov         r11,7FFD68470048h 
 cmp         dword ptr [rcx],ecx 
 call        qword ptr [r11] 
 nop 
 add         rsp,28h 
 ret

In there you will see that the timing for all the naked calls (sealed, unsealed and interface) have essentially the same cost. The assembly emitted for the 3 is identical as far as I remember. This suggest that some limited devirtualization is happening.

I haven't measured the time but the naked interface variant certainly generates different code from the other 2 naked variants.

As for actually doing devirtualization in this case - it isn't that simple. For example, the call in UseNakedInterfaceCalls can only be devirtualized if the JIT observes that the _nakedInterfaceCalls is readonly (trivial) and initialized to an instance of ClassCalls (not that trivial as the initialization is done in the constructor, a different and unrelated method).

redknightlois commented 9 years ago

@mikedn And then I remember that I have RyuJIT disabled :)

These are the legacy JIT calls:

devirtualization 1 devirtualization 2 devirtualization 3

Not a huge timing difference in between the alternatives.

EDIT: In the 64bits version there is an indirection on the call and a "lea" operation over the r11 register that looks like padding.

Naked class call:

sub         rsp,28h  
mov         rcx,qword ptr [rcx+8]  
cmp         byte ptr [rcx],0  
call        00007FFD14B04820  
nop  
add         rsp,28h  
ret

Interface call:

sub         rsp,28h  
mov         rcx,qword ptr [rcx+10h]  
cmp         byte ptr [rcx],0  
lea         r11,[7FFD149F0040h]  
call        qword ptr [7FFD149F0040h]  
nop  
add         rsp,28h  
ret

mikedn commented 9 years ago

These are the legacy JIT calls:

That's the 32 bit JIT, not the legacy (aka JIT64) JIT. Though on my machine JIT32 does inline the first two calls...

redknightlois commented 9 years ago

@mikedn I hate the "Prefer 32bits" option set by default of Visual Studio. See the edit.

mikedn commented 9 years ago

That lea is generated by the legacy JIT (JIT64) compiler, it's not generated by RyuJIT. Discussing the code generated by JIT64 isn't exactly useful.

redknightlois commented 9 years ago

@mikedn I know, that's why I said: "And then I remember that I have RyuJIT disabled :)" ... the limited devirtualization I've seen was JIT64 not RyuJIT making the whole argument moot.

mikedn commented 9 years ago

the limited devirtualization I've seen was JIT64 not RyuJIT making the whole argument moot

But there's no kind of devirtualization going on in JIT64 either.

redknightlois commented 9 years ago

@mikedn OK now I see what you mean. For all uses and purposes those 2 call opcodes are equivalent. That interface call performance profile is the same of the register call for every processor upwards of Sandy Bridge (and maybe a couple of before). But, that's an artifact introduced by my code because I isolated the 3 calls in their own method. When called one after another (even creating the object in the line before) it can be seen that no devirtualization happens for the interface even if that would have been insanely safe.

However, it can be argued that devirtualization of the type:

ICall instance = new ClassCall();
instance.Execute();

could be done at the compiler level without much hassle. On the constrained generic types case that doesn't seem to be true.

EDIT: Even if the only devirtualization happening works for the following code I would be glad:

public class Executer<T> where T : ICalls
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static void Execute(T instance)
    {
        instance.Execute();
    }
}

Where the calling code would look like:

ClassCalls _nakedClassCalls = new ClassCalls();
....
Executer<ClassCalls>.Execute(_nakedClassCalls);

mikedn commented 9 years ago

However, it can be argued that devirtualization of the type: ... could be done at the compiler level without much hassle.

Yes, that's one case where devirtualization is possible. In itself it is a rather useless case as there's little reason to write such code to begin with (the only practical use for that kind of code is to access explicitly implement members). But such opportunities can show up in real code as the result of inlining of either the Execute call site (like it happens in your case, the generic Execute gets inlined so the callsite it contains can "see" the actual type assigned to instance) or of the instantiation site.

On the constrained generic types case that doesn't seem to be true.

Generics don't play any part in this except for the fact that they introduce the interface call. For reference types your generic Execute method is no different from a non-generic method:

public class Executer
{
    [MethodImpl(MethodImplOptions.AggressiveInlining)]
    public static void Execute(ICalls instance)
    {
        instance.Execute();
    }
}

benaadams commented 9 years ago

Devirtualization of enumerators called via Interfaces back to structs would be nice...

AndyAyersMS commented 6 years ago

@redknightlois can you look this over and update if you still think there is anything actionable here, or close if not?

For generics instantiated over ref types we're unlikely to do devirtualization anytime soon, as the jit only sees the shared version. This might change down the road, if we somehow enabled unshared ref type instantiations or started looking into speculative devirtualization.

If the generic can get inlined into a context where the types are known then things open up a little and if the jit can put enough pieces together or see sealed types, it can do a lot of optimization.

redknightlois commented 6 years ago

@AndyAyersMS given that there are a few workarounds that could be found with sealed types and the actual solution for this is devirtualization of generic ref types I would say that the criteria for closing could be:

[ ] Are the cases where it works documented?
[ ] Are the workarounds documented?
[ ] Are the limitations documented? Or, are there open issues that cover those cases?

If all are yes, I would say that this is done.

AndyAyersMS commented 3 years ago

At runtime there's no way for the jit to deduce the exact type of instance members at jit time; all the jit knows is that the type is one of the exact instantiations of the the shared type Executer`1.

If it turns out that the instance the member is always just one or a handful of types then via profiling the jit can discover which type is most likely and guess for that, and perform guarded devirtualization and subsequent inlining. This can be seen with the changes for class profiling linked above; eg

; Assembly listing for method Runtime4489:UseSealedCalls():this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; partially interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 15122173
; invoked as altjit
; Final local variable assignments
;
;  V00 this         [V00,T00] (  4,  4   )     ref  ->  rcx         this class-hnd
;  V01 OutArgs      [V01    ] (  1,  1   )  lclBlk (32) [rsp+0x00]   "OutgoingArgSpace"
;  V02 tmp1         [V02,T02] (  2,  4   )     ref  ->  rcx         ld-addr-op class-hnd "Inlining Arg"
;  V03 tmp2         [V03,T01] (  3,  4   )     ref  ->  rcx         "guarded devirt this temp"
;* V04 tmp3         [V04    ] (  0,  0   )     ref  ->  zero-ref    class-hnd exact "guarded devirt this exact temp"
;  V05 tmp4         [V05,T03] (  2,  4   )     ref  ->  r11         class-hnd "Inlining Arg"
;
; Lcl frame size = 40

G_M52712_IG01:              ;; offset=0000H
       4883EC28             sub      rsp, 40
                        ;; bbWeight=1    PerfScore 0.25
G_M52712_IG02:              ;; offset=0004H
       4C8B5930             mov      r11, gword ptr [rcx+48]
       488B4918             mov      rcx, gword ptr [rcx+24]
       45391B               cmp      dword ptr [r11], r11d
       49BBE8C335C4F87F0000 mov      r11, 0x7FF8C435C3E8
       4C3919               cmp      qword ptr [rcx], r11
       7517                 jne      SHORT G_M52712_IG04
       48B9ACA232C4F87F0000 mov      rcx, 0x7FF8C432A2AC
       4533DB               xor      r11d, r11d
       448919               mov      dword ptr [rcx], r11d
       FF01                 inc      dword ptr [rcx]
                        ;; bbWeight=1    PerfScore 15.75
G_M52712_IG03:              ;; offset=0030H
       4883C428             add      rsp, 40
       C3                   ret      
                        ;; bbWeight=1    PerfScore 1.25
G_M52712_IG04:              ;; offset=0035H
       49BB580505C4F87F0000 mov      r11, 0x7FF8C4050558
       48B8580505C4F87F0000 mov      rax, 0x7FF8C4050558
       FF10                 call     qword ptr [rax]ICalls:Execute():this
       EBE3                 jmp      SHORT G_M52712_IG03
                        ;; bbWeight=0    PerfScore 0.00

In an AOT scenario without PGO, and if one can impose suitable restrictions (no reflection, etc) it might be possible for RTA or similar to deduce that only one type can possibly be assigned to the instance members.

Going to keep this open and in future, but once PGO is a bit further along may come back and close this one.

dotnet / runtime

RyuJIT call optimization and aggressive inlining with known generic types #4489