dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

[NativeAOT] System.Runtime.Tests occasionally crashes on linux-arm-NativeAOT #99079

Closed VSadov closed 7 months ago

VSadov commented 8 months ago

The crash does not happen often - just once in a few CI runs. So far I was unable to get a dump or any other info for the failure. I'd need to set up a local repro.

For example seen in https://github.com/dotnet/runtime/pull/99031 :

https://[helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-99031-merge-ad45d5f9cac945e0af/System.Runtime.Tests/1/console.a6a20f16.log?helixlogtype=result](https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-99031-merge-ad45d5f9cac945e0af/System.Runtime.Tests/1/console.a6a20f16.log?helixlogtype=result)

ghost commented 8 months ago

Tagging subscribers to this area: @agocke, @MichalStrehovsky, @jkotas See info in area-owners.md if you want to be subscribed.

Issue Details
The crash does not happen often - just once in a few CI runs. So far I was unable to get a dump or any other info for the failure. I'd need to set up a local repro. For example seen in https://github.com/dotnet/runtime/pull/99031 : https://[helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-99031-merge-ad45d5f9cac945e0af/System.Runtime.Tests/1/console.a6a20f16.log?helixlogtype=result](https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-99031-merge-ad45d5f9cac945e0af/System.Runtime.Tests/1/console.a6a20f16.log?helixlogtype=result)
Author: VSadov
Assignees: -
Labels: `area-NativeAOT-coreclr`
Milestone: -
filipnavara commented 8 months ago

Preliminary notes:

VSadov commented 8 months ago

We may be missing some of the Align8 fixes done

I once saw a failure at this stack:

[SKIP] System.Collections.Concurrent.Tests.ConcurrentBagTests.DebuggerTypeProxy_Ctor_NullArgument_Throws
Process terminated. Access Violation: Attempted to read or write protected memory. This is often an indication that other memory is corrupt. The application will be terminated since this platform does not support throwing an AccessViolationException.
   at System.RuntimeExceptionHelpers.FailFast(String, Exception, String, RhFailFastReason, IntPtr, IntPtr) + 0x161
   at System.RuntimeExceptionHelpers.GetRuntimeException(ExceptionIDs) + 0x2a1
   at System.Runtime.EH.GetClasslibException(ExceptionIDs, IntPtr) + 0x27
   at System.Runtime.EH.RhThrowHwEx(UInt32, EH.ExInfo&) + 0x8d
   at System.Threading.PortableThreadPool.HillClimbing.GetWaveComponent(Double[], Int32, Double) + 0x99
   at System.Threading.PortableThreadPool.HillClimbing.Update(Int32, Double, Int32) + 0x2fb
   at System.Threading.PortableThreadPool.AdjustMaxWorkersActive() + 0x133
   at System.Threading.ThreadPoolWorkQueue.Dispatch() + 0x335
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() + 0x12b
   at System.Threading.Thread.StartThread(IntPtr) + 0xd5
   at System.Threading.Thread.ThreadEntryPoint(IntPtr) + 0xf
./RunTests.sh: line 179:    86 Aborted                 (core dumped) ./System.Collections.Concurrent.Tests -notrait category=IgnoreForCI -notrait category=OuterLoop -notrait category=failing -xml testResults.xml $RSP_FILE

Not necessarily because of double[] alignment, but there are some double[] involved.

I kind of think if this system was configured to fail on misaligned accesses, it would fail in more tests. Also this looks like an AV, not alignment problem, so maybe it was something else.

EgorBo commented 8 months ago

Speaking of align8 - does NAOT respect align8 for e.g.

static readonly double[] arr = new double[100];

during preinitialization?

VSadov commented 8 months ago

Speaking of align8 - does NAOT respect align8 for e.g.

static readonly double[] arr = new double[100]; during preinitialization?

Last time I checked RequiresAlign8Flag was not set for anything so types that want 8-align could not report it. So I doubt it.

It is not a lot of changes to enable this, but it would be "a bit here, a bit there." Respecting 8-align is basically an ARM32 feature, so it was not implemented and would not be testable, even if implemented until there is ARM32 support.

VSadov commented 8 months ago
                // Currently we don't support frozen objects with special alignment requirements
                // TODO: We should also give up on arrays of doubles on 32-bit platforms.
                // (we currently never allocate them on frozen segments)
#if FEATURE_64BIT_ALIGNMENT
                if (type->RequiresAlign8)
                {
                    // Align8 objects are not supported yet
                    return null;
                }
#endif

So, nope.

EgorBo commented 8 months ago

So, nope.

The code you quoted is used to allocate objects on nongc in run-time, it's different from preinitialization that also allocates on nongc.

EgorBo commented 8 months ago

I kind of think if this system was configured to fail on misaligned accesses, it would fail in more tests. Also this looks like an AV, not alignment problem, so maybe it was something else.

Afair, some floating-point related loads/store (as well as atomicity-related) are not fixable by OS and may fail

VSadov commented 8 months ago

Right, that is not even in NAOT. Sorry.

Generally aligning frozen arrays should be doable. Just do what GC does - overallocate space and pad in front with empty objects (dummy byte[] would work too). I do not see any code attempting to do that.

VSadov commented 8 months ago

Afair, some floating-point related loads/store (as well as atomicity-related) are not fixable by OS and may fail

That what I suspected too. Perhaps OS was configured to handle misaligned accesses, but could not handle all the cases. I am not very familiar with how robust the emulation is. I'd assume it is robust, since it is in OS, but would not be too surprised if there are limitations.

VSadov commented 8 months ago

Generally aligning frozen arrays should be doable. Just do what GC does - overallocate space and pad in front with empty objects (dummy byte[] would work too). I do not see any code attempting to do that.

Actually, since it is not in the real heap, I am not sure if it needs to be parseable/walkable, so maybe just leaving gaps in front will work too. One way to find out. :-)

EgorBo commented 8 months ago

Actually, since it is not in the real heap, I am not sure if it needs to be parseable/walkable,

In CoreCLR it's still walkable in two cases: 1) Profiler API, I presume those aren't (yet?) supported by NAOT 2) There is an edge case in GC (Segment-mode) when if nongc segments somehow sneak into "in-range" GC will walk (and do mark phase) them too

EgorBo commented 8 months ago

So here is a test we can try:

using System.Runtime.CompilerServices;

public class Prog
{
    static void Main()
    {
        Test(doubles);
    }

    static readonly double[] doubles = new double[100];

    [MethodImpl(MethodImplOptions.NoInlining)]
    static double Test(double[] d)
    {
        return d[0];
    }
}

I can't check real codegen for NAOT-arm32 (perhaps, @filipnavara can check?) but on CoreCLR AltJIT I am seeing:

; Assembly listing for method Prog:Test(double[]):double (FullOpts)
G_M31521_IG01:  ;; offset=0x0000
000000      push    {r11,lr}
000004      mov     r11, sp
G_M31521_IG02:  ;; offset=0x0006
000006      movs    r3, 0
000008      ldr     r2, [r0+0x04]
00000A      cmp     r3, r2
00000C      bhs     SHORT G_M31521_IG04
00000E      vldr    d0, [r0+0x08]   ;;;;;;; <--------------------
G_M31521_IG03:  ;; offset=0x0012
000012      pop     {r11,pc}
G_M31521_IG04:  ;; offset=0x0016
000016      movw    r3, 0x7bd0
00001A      movt    r3, 0x8248
00001E      blx     r3      // CORINFO_HELP_RNGCHKFAIL
000020      bkpt    
; Total bytes of code 34

And that vldr is not fixable, e.g. see https://github.com/raspberrypi/linux/issues/3099#issuecomment-517263384

VSadov commented 8 months ago

and do mark phase

That is mildly disturbing. Yes, if there is marking, especially of byrefs, it will need to walk at least to figure where the containing object starts.

MichalPetryka commented 8 months ago

Wouldn't Align8 be also needed for 8B Exchange/CompareExchange on 32 bit platforms? EDIT: ldrexd/strexd on ARM32 do require it so it's probably broken.

filipnavara commented 8 months ago

FWIW ARMv7+ handles unaligned accesses in hardware with few notable exceptions:

Several of the test crashes point to misalignment of the double on memory load.

MichalStrehovsky commented 8 months ago

and do mark phase

That is mildly disturbing. Yes, if there is marking, especially of byrefs, it will need to walk at least to figure where the containing object starts.

I'll simply block these in #99104. It's not clear if this is the problem for these tests, but it's necessary for correctness anyway.

jkotas commented 8 months ago

https://helixre8s23ayyeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-99031-merge-ad45d5f9cac945e0af/System.Runtime.Tests/1/console.a6a20f16.log?helixlogtype=result

This collected a dump. You can get it by running:

dotnet tool install --global runfo runfo get-helix-payload -j ad45d5f9-cac9-45e0-af17-19acbf27c4bf -w System.Runtime.Tests -o c:\helix_payload\System.Runtime.Tests

The stacktrace of the crash is:

libc_2_31+0x177e6
System_Runtime!System.Globalization.CalendricalCalculationsHelper::Compute+0x1e [/_/src/libraries/System.Private.CoreLib/src/System/Globalization/CalendricalCalculationsHelper.cs @ 354]
System_Runtime!System.Globalization.CalendricalCalculationsHelper::PersianNewYearOnOrBefore+0x108 [/_/src/libraries/System.Private.CoreLib/src/System/Globalization/CalendricalCalculationsHelper.cs @ 384]
System_Runtime!System.Globalization.PersianCalendar::GetAbsoluteDatePersian+0x6a [/_/src/libraries/System.Private.CoreLib/src/System/Globalization/PersianCalendar.cs @ 70]
System_Runtime!System.Globalization.PersianCalendar::ToDateTime+0x34 [/_/src/libraries/System.Private.CoreLib/src/System/Globalization/PersianCalendar.cs @ 382]
...

Looks like it crashed somewhere in Math.Sin.

filipnavara commented 8 months ago

Looks like it crashed somewhere in Math.Sin.

That's the issue tracked in https://github.com/dotnet/runtime/issues/98795. Dump is useful.

filipnavara commented 8 months ago

I got the payload from ad45d5f9-cac9-45e0-af17-19acbf27c4bf mentioned above and run it couple of times on my Raspberry Pi. It crashes quite often. Turns out that the linker produced the long thunks inside the managed code section and there's no unwinding information available for them:

      Address: System.Runtime.Tests[0x014990f8] (System.Runtime.Tests.PT_LOAD[1].__managedcode + 56)
      Summary: System.Runtime.Tests`__ThumbV7PILongThunk_log + 8
       Module: file = "/home/navara/srt/workitems/System.Runtime.Tests/System.Runtime.Tests", arch = "arm"
       Symbol: id = {0x000003ed}, range = [0x019f90f0-0x019f90fc), name="__ThumbV7PILongThunk_log"

I'll need to figure out which linker [version] did that. In my experiments the thunks were generated in a separate section and LLD source code shows that was the intended behavior.

filipnavara commented 8 months ago

While analyzing this I found one pattern that we don't recognize in IsInProlog correctly:

System.Runtime.Tests`System.SpanHelpers__NonPackedIndexOfValueType<Int32__System.SpanHelpers_Negate`1<Int32>>
    0x4150a30 <+0>:    push.w {r4, r5, r6, r10, r11, lr}
    0x4150a34 <+3>:    add.w  r11, sp, #0x10     <--------------
    0x4150a38 <+7>:    sub.w  sp, sp, #0x778

Furthermore, some of sub sp, X and add sp, X patterns are incorrectly recognized as prolog when appearing in the middle of a method with frame register (r9).

filipnavara commented 7 months ago

I think this it resolved now.

MichalStrehovsky commented 7 months ago

Yes, i don't see crashes in the arm legs in the past week. Thank you Filip!