Closed danmoseley closed 1 year ago
There's another issue representing a crash in source generators on mono that's been plaguing us in CI: https://github.com/dotnet/runtime/issues/81123 I don't know if it's the same thing.
@janvorli would you or someone on the VM team be interested in the dump here before it vanishes? at least #81123 seems fairly common and this might be related.
@danmoseley this is mono, so it would be better to point some mono folks to this. The coreclr runtime team doesn't have experience debugging mono.
@janvorli - doh - of course, brainstorm. Thanks.
@SamMonoRT there's a dump here if you want it, not sure how long before they time out.
@lambdageek - can you take an initial look
Thread 10 looks ok, actually. it's just doing some work. Thread 9 looks like it crashed:
Thread 9 (Thread 0xff7909f741c0 (LWP 34)):
#0 0x0000ff791595bd5c in __waitpid (pid=<optimized out>, stat_loc=0xff7909f6e9f0, options=<optimized out>) at ../sysdeps/unix/sysv/linux/waitpid.c:30
#1 0x0000ff79153519e8 in dump_native_stacktrace (signal=<optimized out>, mctx=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini-posix.c:843
#2 mono_dump_native_crash_info (signal=<optimized out>, mctx=0xff7909f6f450, info=<optimized out>) at /__w/1/s/src/mono/mono/mini/mini-posix.c:870
#3 0x0000ff79153105c0 in mono_handle_native_crash (signal=0xff791510e1d2 "SIGSEGV", mctx=0xff7909f6f450, info=0xff7909f6f7b0) at /__w/1/s/src/mono/mono/mini/mini-exceptions.c:3005
#4 0x0000ff7915279b90 in mono_sigsegv_signal_handler_debug (_dummy=11, _info=0xff7909f6f7b0, context=0xff7909f6f830, debug_fault_addr=0x0) at /__w/1/s/src/mono/mono/mini/mini-runtime.c:3749
#5 <signal handler called>
#6 0x0000000000000000 in ?? ()
#7 0x0000ff7908de1d18 in ?? ()
#8 0x0000ff7914ad9848 in ?? ()
Backtrace stopped: previous frame inner to this frame (corrupt stack?)
and some output from mono_dump_native_crash_info
=====================================================
instruction pointer is NULL, skip dumping
=====================================================
Frames 7 and 8 look like JITed code. we'll need to try repro locally.
I'll try to grab the coredump once I'm on my work computer Core dump expired.
@lambdageek the links to builds in https://github.com/dotnet/runtime/issues/81123 might contain coredumps.
@lambdageek as another data point the crashes only seem to happen on arm64
Got it to crash in a VM after a couple of hours of running the test in a loop. Not the same crash as this issue, unfortunately. An assertion failure during class setup due to - I'm guessing - a data race:
g_assert (klass == klass->supertypes [klass->idepth - 1]);
I see two threads in mono_class_setup_vtable_full
for the same class. It does seem like it's possible for klass->supertypes
and klass->idepth
not to be set atomically.
I can't convince myself that it would cause the same kinds of failures we see in the CI jobs.
Probably same as https://github.com/dotnet/runtime/issues/81123 - which is a very similar stack trace in the Roslyn3.11 test suite. Going to try running that other testsuite in a loop, too
Seen again in an unrelated PR in main
: https://github.com/dotnet/runtime/pull/83356
Libraries Test Run release mono linux arm64 Debug
This is a dupe of https://github.com/dotnet/runtime/issues/81123
https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-81202-merge-135c923415104d6c85/Microsoft.Extensions.Logging.Generators.Roslyn4.0.Tests/1/console.6c3784b3.log?helixlogtype=result
net8.0-linux-Debug-arm64-Mono_release-(Ubuntu.1804.Arm64.Open)Ubuntu.1804.ArmArch.Open@mcr.microsoft.com/dotnet-buildtools/prereqs:ubuntu-18.04-helix-arm64v8
Possible stack trace for the crash
(original issue description note: this is not the problematic stack trace)
Not sure how it indicates which thread failed, but I'm guessing it's this one?
there is a core dump https://dev.azure.com/dnceng-public/public/_build/results?buildId=148909&view=ms.vss-test-web.build-test-results-tab&runId=3170018&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab&resultId=200309