Closed kevinmv closed 3 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Adding area owners @BruceForstall @dotnet/jit-contrib
@kevinmv Thank you for the report, and investigation. It does smell like a codegen bug. Given that it's not reproducing in .NET 6, perhaps we just need to figure out what fixed the problem and backport the fix. Let me figure out how to get an upload from you of the repros.
In the meantime, if you have the capability to build your own version of .NET from source, and can build a Checked build, running with "export COMPlus_JitDump=ScheduleFlushSend ScheduleSend" would be be the first thing we would likely do, to see the JIT internal output for these functions. (You could upload the output as a gist if there's nothing private you can't share publicly.)
Thanks! I made a local build and it seems really quite informative. The Checked build does not fail immediately the first time ScheduleFlushSend
is called but after ~40 invocations the method is recompiled and crashes right after. I've never read this log before but it seems like a possible bug with the ScheduleFlushSend
trying to be devirtualized ScheduleSend
.
jit-opt-bug-out.txt
It's unclear to me if this bug has been fixed in .Net 6 or just doesn't present itself. Let me know if I can provide any more information to help.
All the best, Kev
This looks like bad code generation in ScheduleFlushSend()
, where it passes arguments to ScheduleSend()
that don't match what ScheduleSend()
expects to see.
In particular, the problem is when we think we can make a fast (optimistic) tailcall from ScheduleFlushSend()
to ScheduleSend()
(because ScheduleFlushSend()
just returns the value of ScheduleSend()
and does no further processing), but, late, we realize we can't because don't have enough available stack space to place the outgoing arguments. So we back out, and attempt to create a normal call. But, in the special case the called function returns a struct in multiple registers (in this case, the 16 byte JobHandle
struct is passed in consecutive registers on Linux x64), we do a little extra processing to handle the return value. One more special-case: the call to ScheduleSend()
is a virtual stub dispatch call where we previously added a special, hidden argument (the address of the VSD cell). When we process the call again (morph the call + call args), we forget about that, and think this special arg is a normal arg, so we erroneously add another special arg.
So, in summary, for a call A->B, to see this failure, we need:
fgMorphPotentialTailCall()
(e.g., not enough incoming arg stack space in A to store B's outgoing args), in this case because the first arg to ScheduleSend
is a large struct)B
returns a struct in multiple registers (e.g., a 16-byte struct in Linux x64 ABI)fgMorphCall
)I'm not sure why this wouldn't repro in .NET 6; the problem still seems to exist. I need to try and construct a small repro case.
Thanks @BruceForstall that is a great breakdown. Let me know if I can provide anymore information.
Is there some way for us to be notified or find out when this fix is backported to .NET 5?
Reopening for .NET 5 port.
Is there some way for us to be notified or find out when this fix is backported to .NET 5?
The best way would be to follow this issue (I'll submit a PR with the .NET 5 port of the fix soon)
hey! any updates on the backport? :)
hey! any updates on the backport? :)
@acgaudette It just got merged to the 5.0 branch.
hey! any updates on the backport? :)
@acgaudette It just got merged to the 5.0 branch.
@BruceForstall awesome. do you know when the next release might appear?
The previews are monthly, as mentioned in the last release blog post here: https://devblogs.microsoft.com/dotnet/announcing-net-6-preview-2/. So... soon.
@BruceForstall would that be true even for the backport ?
@BruceForstall would that be true even for the backport ?
oops, sorry, got confused between releases there...
However, service patch releases for 5.0 are also release on an approximately monthly cadence, as can be seen from the release page: https://dotnet.microsoft.com/download/dotnet/5.0. One was released today, so this change will be in the next one, in about a month.
Hello!
We are seeing what looks to be a JIT optimization bug in .Net Core 5.0.200 but only for OSX and Linux, Windows does not present any issues.
Description
We have a function
ScheduleFlushSend(JobHandle dep)
which takes a small struct as an input parameter:Inside
ScheduleFlushSend
we pass along thedep
JobHandle parameter to an implemented interface methodScheduleSend
however depending on optimization levels thedep
jobhandle is corrupted.Release builds can be forced to work by decorating
ScheduleFlushSend
with[MethodImpl(MethodImplOptions.NoInlining)]
, or alternatively running withCOMPlus_JITMinOpts=1
The crash would only happen after calling
ScheduleFlushSend
30+ times, however by running withCOMPlus_TieredCompilation=0
the corruption is immediate further supporting our suspicion this is a JIT optimization bug.Printing out the values of the
JobHandle
members before passing / after entering functions shows what looks to be an issue where inlining the function results in the parameter being read from a memory address that is off by 8 bytes.(in this example RPCSystem.OnUpdate calls ScheduleFlushSend which calls ScheduleSend)
Here you can see the
JobGroup
field's value has shifted into the second member of the struct, and the first member is a random value (it actually looks like a valid, irrelevant, memory address)I can send a working (forcing noinlining on
ScheduleFlushSend
) repo and the crashing repo (removing the forced inlining) if provided an address.Configuration
Other information