dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

Possible OSX/Linux JIT Inlining Optimization Bug .Net Core 5.0.200 #49078

Closed kevinmv closed 3 years ago

kevinmv commented 3 years ago

Hello!

We are seeing what looks to be a JIT optimization bug in .Net Core 5.0.200 but only for OSX and Linux, Windows does not present any issues.

Description

We have a function ScheduleFlushSend(JobHandle dep) which takes a small struct as an input parameter:

    [StructLayout(LayoutKind.Sequential)]
    public struct JobHandle
    {
        public IntPtr JobGroup;
        public uint Version;
        //<member functions removed for brevity>
     }

Inside ScheduleFlushSend we pass along the dep JobHandle parameter to an implemented interface method ScheduleSend however depending on optimization levels the dep jobhandle is corrupted.

Release builds can be forced to work by decorating ScheduleFlushSend with [MethodImpl(MethodImplOptions.NoInlining)], or alternatively running with COMPlus_JITMinOpts=1

The crash would only happen after calling ScheduleFlushSend 30+ times, however by running with COMPlus_TieredCompilation=0 the corruption is immediate further supporting our suspicion this is a JIT optimization bug.

Printing out the values of the JobHandle members before passing / after entering functions shows what looks to be an issue where inlining the function results in the parameter being read from a memory address that is off by 8 bytes.

(in this example RPCSystem.OnUpdate calls ScheduleFlushSend which calls ScheduleSend)

RPCSystem:                  jobgroup=76 version=0
   |--ScheduleFlushSend:    jobgroup=76 version=0
      |--ScheduleSend:          jobgroup=139823864677200 version=76 

Here you can see the JobGroup field's value has shifted into the second member of the struct, and the first member is a random value (it actually looks like a valid, irrelevant, memory address)

I can send a working (forcing noinlining on ScheduleFlushSend) repo and the crashing repo (removing the forced inlining) if provided an address.

Configuration

Other information

dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

kevinmv commented 3 years ago

Adding area owners @BruceForstall @dotnet/jit-contrib

BruceForstall commented 3 years ago

@kevinmv Thank you for the report, and investigation. It does smell like a codegen bug. Given that it's not reproducing in .NET 6, perhaps we just need to figure out what fixed the problem and backport the fix. Let me figure out how to get an upload from you of the repros.

In the meantime, if you have the capability to build your own version of .NET from source, and can build a Checked build, running with "export COMPlus_JitDump=ScheduleFlushSend ScheduleSend" would be be the first thing we would likely do, to see the JIT internal output for these functions. (You could upload the output as a gist if there's nothing private you can't share publicly.)

kevinmv commented 3 years ago

Thanks! I made a local build and it seems really quite informative. The Checked build does not fail immediately the first time ScheduleFlushSend is called but after ~40 invocations the method is recompiled and crashes right after. I've never read this log before but it seems like a possible bug with the ScheduleFlushSend trying to be devirtualized ScheduleSend. jit-opt-bug-out.txt

It's unclear to me if this bug has been fixed in .Net 6 or just doesn't present itself. Let me know if I can provide any more information to help.

All the best, Kev

BruceForstall commented 3 years ago

This looks like bad code generation in ScheduleFlushSend(), where it passes arguments to ScheduleSend() that don't match what ScheduleSend() expects to see.

In particular, the problem is when we think we can make a fast (optimistic) tailcall from ScheduleFlushSend() to ScheduleSend() (because ScheduleFlushSend() just returns the value of ScheduleSend() and does no further processing), but, late, we realize we can't because don't have enough available stack space to place the outgoing arguments. So we back out, and attempt to create a normal call. But, in the special case the called function returns a struct in multiple registers (in this case, the 16 byte JobHandle struct is passed in consecutive registers on Linux x64), we do a little extra processing to handle the return value. One more special-case: the call to ScheduleSend() is a virtual stub dispatch call where we previously added a special, hidden argument (the address of the VSD cell). When we process the call again (morph the call + call args), we forget about that, and think this special arg is a normal arg, so we erroneously add another special arg.

So, in summary, for a call A->B, to see this failure, we need:

  1. The call is considered a potential tailcall (by the importer)
  2. The call requires non-standard arguments that add call argument IR in fgInitArgInfo() (e.g., VSD call -- in this case, a generic interface call)
  3. We reject the tailcall in fgMorphPotentialTailCall() (e.g., not enough incoming arg stack space in A to store B's outgoing args), in this case because the first arg to ScheduleSend is a large struct)
  4. B returns a struct in multiple registers (e.g., a 16-byte struct in Linux x64 ABI)
  5. Given all these, we forget about the non-standard args we added, then add them again (in fgMorphCall)

I'm not sure why this wouldn't repro in .NET 6; the problem still seems to exist. I need to try and construct a small repro case.

kevinmv commented 3 years ago

Thanks @BruceForstall that is a great breakdown. Let me know if I can provide anymore information.

ecuzzillo commented 3 years ago

Is there some way for us to be notified or find out when this fix is backported to .NET 5?

BruceForstall commented 3 years ago

Reopening for .NET 5 port.

Is there some way for us to be notified or find out when this fix is backported to .NET 5?

The best way would be to follow this issue (I'll submit a PR with the .NET 5 port of the fix soon)

acgaudette commented 3 years ago

hey! any updates on the backport? :)

BruceForstall commented 3 years ago

hey! any updates on the backport? :)

@acgaudette It just got merged to the 5.0 branch.

acgaudette commented 3 years ago

hey! any updates on the backport? :)

@acgaudette It just got merged to the 5.0 branch.

@BruceForstall awesome. do you know when the next release might appear?

BruceForstall commented 3 years ago

The previews are monthly, as mentioned in the last release blog post here: https://devblogs.microsoft.com/dotnet/announcing-net-6-preview-2/. So... soon.

wackoisgod commented 3 years ago

@BruceForstall would that be true even for the backport ?

BruceForstall commented 3 years ago

@BruceForstall would that be true even for the backport ?

oops, sorry, got confused between releases there...

However, service patch releases for 5.0 are also release on an approximately monthly cadence, as can be seen from the release page: https://dotnet.microsoft.com/download/dotnet/5.0. One was released today, so this change will be in the next one, in about a month.