Open redknightlois opened 9 years ago
I'm not quite sure what this has to do with generics, this looks more like a devirtualization problem.
If I remember correctly, if ClassCalls was a struct and not a class, the Execute method would be inlined
@panost Yes, in the case of structs the JIT usually does devirtualization. It's practically forced to do so, making an interface call on a value type would require boxing and you'd end up calling a method on a copy of the original value.
@mikedn yeah, but my benchmarks and the emitted assembly suggest that the JIT will perform some limited devirtualization if no generic type is involved. I agree that probably the topic could be changed to "RyuJIT support aggressive call devirtualization over constrained generic types" or something better if we can think a better one.
but my benchmarks and the emitted assembly suggest that the JIT will perform some limited devirtualization if no generic type is involved
Do you have some sample code?
@mikedn Sure. https://gist.github.com/redknightlois/5bafa47ee9835605da26
Just don't execute the naked call versions along with the generic ones (the difference between the count of instruction per each will screw the results --- probably as I am not counting properly the source instructions so I am not passing the right number to BenchmarkDotNet to do a proper adjustment).
In there you will see that the timing for all the naked calls (sealed, unsealed and interface) have essentially the same cost. The assembly emitted for the 3 is identical as far as I remember. This suggest that some limited devirtualization is happening.
@CarolEidt can you provide some insight here?
UseNakedInstanceCalls
and UseNakedSealedCalls
do not contain any virtual/interface calls and generate identical code. UseNakedInterfaceCalls
contains an interface call that could, at least in theory, be devirtualized. All the generic variants contain interface calls similar to UseNakedInterfaceCalls
. They should generate the same code as UseNakedInterfaceCalls
but there's some dead code that the JIT doesn't eliminate:
sub rsp,28h
mov r11,qword ptr [rcx+30h] ;dead
mov rcx,qword ptr [rcx+18h]
mov r11d,dword ptr [r11] ;dead
mov r11,7FFD68470048h
cmp dword ptr [rcx],ecx
call qword ptr [r11]
nop
add rsp,28h
ret
In there you will see that the timing for all the naked calls (sealed, unsealed and interface) have essentially the same cost. The assembly emitted for the 3 is identical as far as I remember. This suggest that some limited devirtualization is happening.
I haven't measured the time but the naked interface variant certainly generates different code from the other 2 naked variants.
As for actually doing devirtualization in this case - it isn't that simple. For example, the call in UseNakedInterfaceCalls
can only be devirtualized if the JIT observes that the _nakedInterfaceCalls
is readonly (trivial) and initialized to an instance of ClassCalls
(not that trivial as the initialization is done in the constructor, a different and unrelated method).
@mikedn And then I remember that I have RyuJIT disabled :)
These are the legacy JIT calls:
Not a huge timing difference in between the alternatives.
EDIT: In the 64bits version there is an indirection on the call and a "lea" operation over the r11 register that looks like padding.
Naked class call:
sub rsp,28h
mov rcx,qword ptr [rcx+8]
cmp byte ptr [rcx],0
call 00007FFD14B04820
nop
add rsp,28h
ret
Interface call:
sub rsp,28h
mov rcx,qword ptr [rcx+10h]
cmp byte ptr [rcx],0
lea r11,[7FFD149F0040h]
call qword ptr [7FFD149F0040h]
nop
add rsp,28h
ret
These are the legacy JIT calls:
That's the 32 bit JIT, not the legacy (aka JIT64) JIT. Though on my machine JIT32 does inline the first two calls...
@mikedn I hate the "Prefer 32bits" option set by default of Visual Studio. See the edit.
That lea
is generated by the legacy JIT (JIT64) compiler, it's not generated by RyuJIT. Discussing the code generated by JIT64 isn't exactly useful.
@mikedn I know, that's why I said: "And then I remember that I have RyuJIT disabled :)" ... the limited devirtualization I've seen was JIT64 not RyuJIT making the whole argument moot.
the limited devirtualization I've seen was JIT64 not RyuJIT making the whole argument moot
But there's no kind of devirtualization going on in JIT64 either.
@mikedn OK now I see what you mean. For all uses and purposes those 2 call opcodes are equivalent. That interface call performance profile is the same of the register call for every processor upwards of Sandy Bridge (and maybe a couple of before). But, that's an artifact introduced by my code because I isolated the 3 calls in their own method. When called one after another (even creating the object in the line before) it can be seen that no devirtualization happens for the interface even if that would have been insanely safe.
However, it can be argued that devirtualization of the type:
ICall instance = new ClassCall();
instance.Execute();
could be done at the compiler level without much hassle. On the constrained generic types case that doesn't seem to be true.
EDIT: Even if the only devirtualization happening works for the following code I would be glad:
public class Executer<T> where T : ICalls
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void Execute(T instance)
{
instance.Execute();
}
}
Where the calling code would look like:
ClassCalls _nakedClassCalls = new ClassCalls();
....
Executer<ClassCalls>.Execute(_nakedClassCalls);
However, it can be argued that devirtualization of the type: ... could be done at the compiler level without much hassle.
Yes, that's one case where devirtualization is possible. In itself it is a rather useless case as there's little reason to write such code to begin with (the only practical use for that kind of code is to access explicitly implement members). But such opportunities can show up in real code as the result of inlining of either the Execute
call site (like it happens in your case, the generic Execute
gets inlined so the callsite it contains can "see" the actual type assigned to instance
) or of the instantiation site.
On the constrained generic types case that doesn't seem to be true.
Generics don't play any part in this except for the fact that they introduce the interface call. For reference types your generic Execute
method is no different from a non-generic method:
public class Executer
{
[MethodImpl(MethodImplOptions.AggressiveInlining)]
public static void Execute(ICalls instance)
{
instance.Execute();
}
}
Devirtualization of enumerators called via Interfaces back to structs would be nice...
@redknightlois can you look this over and update if you still think there is anything actionable here, or close if not?
For generics instantiated over ref types we're unlikely to do devirtualization anytime soon, as the jit only sees the shared version. This might change down the road, if we somehow enabled unshared ref type instantiations or started looking into speculative devirtualization.
If the generic can get inlined into a context where the types are known then things open up a little and if the jit can put enough pieces together or see sealed types, it can do a lot of optimization.
@AndyAyersMS given that there are a few workarounds that could be found with sealed types and the actual solution for this is devirtualization of generic ref types I would say that the criteria for closing could be:
If all are yes, I would say that this is done.
At runtime there's no way for the jit to deduce the exact type of instance members at jit time; all the jit knows is that the type is one of the exact instantiations of the the shared type Executer`1.
If it turns out that the instance the member is always just one or a handful of types then via profiling the jit can discover which type is most likely and guess for that, and perform guarded devirtualization and subsequent inlining. This can be seen with the changes for class profiling linked above; eg
; Assembly listing for method Runtime4489:UseSealedCalls():this
; Emitting BLENDED_CODE for X64 CPU with AVX - Windows
; Tier-1 compilation
; optimized code
; optimized using profile data
; rsp based frame
; partially interruptible
; with IBC profile data, edge weights are valid, and fgCalledCount is 15122173
; invoked as altjit
; Final local variable assignments
;
; V00 this [V00,T00] ( 4, 4 ) ref -> rcx this class-hnd
; V01 OutArgs [V01 ] ( 1, 1 ) lclBlk (32) [rsp+0x00] "OutgoingArgSpace"
; V02 tmp1 [V02,T02] ( 2, 4 ) ref -> rcx ld-addr-op class-hnd "Inlining Arg"
; V03 tmp2 [V03,T01] ( 3, 4 ) ref -> rcx "guarded devirt this temp"
;* V04 tmp3 [V04 ] ( 0, 0 ) ref -> zero-ref class-hnd exact "guarded devirt this exact temp"
; V05 tmp4 [V05,T03] ( 2, 4 ) ref -> r11 class-hnd "Inlining Arg"
;
; Lcl frame size = 40
G_M52712_IG01: ;; offset=0000H
4883EC28 sub rsp, 40
;; bbWeight=1 PerfScore 0.25
G_M52712_IG02: ;; offset=0004H
4C8B5930 mov r11, gword ptr [rcx+48]
488B4918 mov rcx, gword ptr [rcx+24]
45391B cmp dword ptr [r11], r11d
49BBE8C335C4F87F0000 mov r11, 0x7FF8C435C3E8
4C3919 cmp qword ptr [rcx], r11
7517 jne SHORT G_M52712_IG04
48B9ACA232C4F87F0000 mov rcx, 0x7FF8C432A2AC
4533DB xor r11d, r11d
448919 mov dword ptr [rcx], r11d
FF01 inc dword ptr [rcx]
;; bbWeight=1 PerfScore 15.75
G_M52712_IG03: ;; offset=0030H
4883C428 add rsp, 40
C3 ret
;; bbWeight=1 PerfScore 1.25
G_M52712_IG04: ;; offset=0035H
49BB580505C4F87F0000 mov r11, 0x7FF8C4050558
48B8580505C4F87F0000 mov rax, 0x7FF8C4050558
FF10 call qword ptr [rax]ICalls:Execute():this
EBE3 jmp SHORT G_M52712_IG03
;; bbWeight=0 PerfScore 0.00
In an AOT scenario without PGO, and if one can impose suitable restrictions (no reflection, etc) it might be possible for RTA or similar to deduce that only one type can possibly be assigned to the instance members.
Going to keep this open and in future, but once PGO is a bit further along may come back and close this one.
This probably will end up in the future releases wishlist, but it something that has been looking forward for a long time already.
Lets say that we have this code:
And we have the following instances:
Now we would expect that the call for _classCalls.Execute(x) would be different than for _interfaceCalls(x). Apparently that is not the case, the JIT stops at the first level even if have the complete information to emit highly optimized code for that call-site.
Now, supposed the implementation is:
There is no way that the JIT would inline that code, even if for all purposes it is safe to do so.
The scenario for this pattern is pretty common in high performance code where the calls are very small, in tight loops but must be able to handle more than a single type... An example is a BitVector with variants for MemoryMappedBitVector, UnsafeBitVector, LongBitVector and so on. Operations tend to be very small and executed in very tight loops.
Today we either need a different codepath for each one, or pay the call tax.
category:cq theme:inlining skill-level:expert cost:large