dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.27k stars 4.73k forks source link

Reloc failures with NativeAOT on Apple Silicon #67232

Closed am11 closed 2 years ago

am11 commented 2 years ago

I am trying to enable NativeAOT on OSX arm64. With this patch https://github.com/dotnet/runtime/compare/main...am11:feature/nativeaot/osx-arm64 (tested with both @GOTPAGE and @PAGE assembler directives), it builds the nupkg. Consuming that package results in the following errors during the ilc step:

# with `<add key="TestSource" value="/Users/am11/projects/runtime/artifacts/packages/Release/Shipping" />`
# in NuGet.config
$ dotnet nuget locals all --clear && rm -rf obj bin && dotnet publish --use-current-runtime -v:diag ...
... snip ...
21:06:05.007   1:7>Target "IlcCompile: (TargetId:181)" in file "/Users/am11/.nuget/packages/microsoft.dotnet.ilcompiler/7.0.0-dev/build/Microsoft.NETCore.Native.targets" from project "/Users/am11/projects/naot1/naot1.csproj" (target "LinkNative" depends on it):
                   Building target "IlcCompile" completely.
                   Output file "obj/release/net7.0/osx-arm64/native/naot1.o" does not exist.
                   Task "Message" skipped, due to false condition; ($(_BuildingInCompatibleMode) != 'true') was evaluated as (true != 'true').
                   Task "Message" (TaskId:126)
                     Task Parameter:Text=Generating compatible native code. To optimize for size or speed, visit https://aka.ms/OptimizeCoreRT (TaskId:126)
                     Task Parameter:Importance=high (TaskId:126)
                     Generating compatible native code. To optimize for size or speed, visit https://aka.ms/OptimizeCoreRT (TaskId:126)
                   Done executing task "Message". (TaskId:126)
                   Task "Exec" (TaskId:127)
                     Task Parameter:Command="/Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/tools/ilc" @"obj/release/net7.0/osx-arm64/native/naot1.ilc.rsp" (TaskId:127)
                     "/Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/tools/ilc" @"obj/release/net7.0/osx-arm64/native/naot1.ilc.rsp" (TaskId:127)
                     <unknown>:0: error: ADR/ADRP relocations must be GOT relative (TaskId:127)
                     <unknown>:0: error: unknown AArch64 fixup kind! (TaskId:127)
                     <unknown>:0: error: unknown AArch64 fixup kind! (TaskId:127)
                     <unknown>:0: error: fixup value out of range (TaskId:127)
                     <unknown>:0: error: ADR/ADRP relocations must be GOT relative (TaskId:127)
                     <unknown>:0: error: unknown AArch64 fixup kind! (TaskId:127)
                     <unknown>:0: error: unknown AArch64 fixup kind! (TaskId:127)
                     <unknown>:0: error: fixup value out of range (TaskId:127)
... repeats 1000s of times ...

somewhere after the objwriter has succeeded: https://github.com/dotnet/runtime/blob/071e772d9d3bd8b50a5380bce6214277a1e61c98/src/coreclr/tools/aot/ILCompiler.Compiler/Compiler/DependencyAnalysis/ObjectWriter.cs#L1183 and before the clang command is executed. While the ilc task does not fail, MSBuild fails on the clang step:

                 Set Property: _IgnoreLinkerWarnings=false
                   Set Property: _IgnoreLinkerWarnings=true
                   Task "Exec" (TaskId:129)
                     Task Parameter:IgnoreStandardErrorWarningFormat=True (TaskId:129)
                     Task Parameter:Command=clang "obj/release/net7.0/osx-arm64/native/naot1.o" -o "bin/release/net7.0/osx-arm64/native/naot1" /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/sdk/libbootstrapper.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/sdk/libRuntime.WorkstationGC.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/framework/libSystem.Native.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/framework/libSystem.Globalization.Native.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/framework/libSystem.IO.Compression.Native.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/framework/libSystem.Net.Security.Native.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/framework/libSystem.Security.Cryptography.Native.Apple.a -g -Wl,-rpath,'@executable_path' -lstdc++ -ldl -lm -lz -licucore -framework CoreFoundation -framework Foundation -framework Security -framework GSS (TaskId:129)
                     clang "obj/release/net7.0/osx-arm64/native/naot1.o" -o "bin/release/net7.0/osx-arm64/native/naot1" /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/sdk/libbootstrapper.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/sdk/libRuntime.WorkstationGC.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/framework/libSystem.Native.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/framework/libSystem.Globalization.Native.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/framework/libSystem.IO.Compression.Native.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/framework/libSystem.Net.Security.Native.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/framework/libSystem.Security.Cryptography.Native.Apple.a -g -Wl,-rpath,'@executable_path' -lstdc++ -ldl -lm -lz -licucore -framework CoreFoundation -framework Foundation -framework Security -framework GSS (TaskId:129)
                     ld: malformed __LD,__compact_unwind section, bad length file 'obj/release/net7.0/osx-arm64/native/naot1.o' (TaskId:129)
                     clang: error: linker command failed with exit code 1 (use -v to see invocation) (TaskId:129)
21:06:12.873   1:7>/Users/am11/.nuget/packages/microsoft.dotnet.ilcompiler/7.0.0-dev/build/Microsoft.NETCore.Native.targets(337,5): error MSB3073: The command "clang "obj/release/net7.0/osx-arm64/native/naot1.o" -o "bin/release/net7.0/osx-arm64/native/naot1" /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/sdk/libbootstrapper.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/sdk/libRuntime.WorkstationGC.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/framework/libSystem.Native.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/framework/libSystem.Globalization.Native.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/framework/libSystem.IO.Compression.Native.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/framework/libSystem.Net.Security.Native.a /Users/am11/.nuget/packages/runtime.osx-arm64.microsoft.dotnet.ilcompiler/7.0.0-dev/framework/libSystem.Security.Cryptography.Native.Apple.a -g -Wl,-rpath,'@executable_path' -lstdc++ -ldl -lm -lz -licucore -framework CoreFoundation -framework Foundation -framework Security -framework GSS" exited with code 1. [/Users/am11/projects/naot1/naot1.csproj]
                   Done executing task "Exec" -- FAILED. (TaskId:129)
21:06:12.873   1:7>Done building target "LinkNative" in project "naot1.csproj" -- FAILED.: (TargetId:182)

With objdump, that __LD,__compact_unwind section looks like:

Disassembly of section __LD,__compact_unwind:

00000000003b2858 <ltmp8>:
  3b2858: 40 4b 00 00   udf     #19264
  3b285c: 00 00 00 00   udf     #0
  3b2860: 74 00 00 00   udf     #116
  3b2864: 00 00 00 03   <unknown>
                ...
  3b2874: c0 4b 00 00   udf     #19392
  3b2878: 00 00 00 00   udf     #0
  3b287c: 74 00 00 00   udf     #116
  3b2880: 00 00 00 03   <unknown>
                ...
  3b2890: 40 4c 00 00   udf     #19520
  3b2894: 00 00 00 00   udf     #0
  3b2898: 74 00 00 00   udf     #116
  3b289c: 00 00 00 03   <unknown>
                ...
  3b28ac: c0 4c 00 00   udf     #19648
  3b28b0: 00 00 00 00   udf     #0
  3b28b4: 74 00 00 00   udf     #116
  3b28b8: 00 00 00 03   <unknown>
... repeats ...
am11 commented 2 years ago

Congratulations on the baby boy! 🎉

Thank you for your help and advice. I have learned multiple aspects of NativeAOT, objwriter and lldb in doing this exercise. Looking forward to continue hacking. 🙂

filipnavara commented 2 years ago

The crash in https://github.com/dotnet/runtime/issues/67232#issuecomment-1086634886 (with SupportsRelativePointers = false) is caused by this branch:

https://github.com/dotnet/runtime/blob/0ce9c8c61e1bc2d7a10db5df5be6039eabe692ba/src/coreclr/nativeaot/Runtime/inc/MethodTable.inl#L93-L99

It is missing the check for SupportsRelativePointers but I am not sure how to do it in the native code. Fixing that would make it progress further.

There still seems to be some issues around unwind info and generate DWARF debug info. The DWARF debug info is actually separate issue in the llvm/objwriter where misordered calls in the original LLVM MC code cause the DWARD debug section to be generated with incorrect values/relocations. Instead of writing addresses relative to the __DWARF segment it produces absolute addresses + relocation. That makes the debugging info unreadable by dwarfdump.

MichalStrehovsky commented 2 years ago

Nice find @filipnavara!

It looks like we are not testing SupportsRelativePointers=false in the CI (CppCodegen was not ported in corert->runtimelab migration, and wasm was not ported in runtimelab->runtime migration), so it is probably broken on other targets

USE_PORTABLE_HELPERS is defined for both CPPCODEGEN and WASM, so that's why it works there - it goes straight to the "treat it as a full pointer" path.

it is probably broken on other targets (linux-x64 etc.). It is probably better to stick with defaults of SupportsRelativePointers which is known to be working on supported targets

I'm not so afraid of that - I checked and Hello world works. I've submitted #74858 to see if everything else works. Of course getting the relative relocs working would be preferred, but if Apple killed them and anybody who needs them is just holding it wrong this is what we'll have to do anyway. It could be good enough to unblock a hello world at least.

MichalStrehovsky commented 2 years ago

Ok, so we can define it via commandline ./build.sh -cmakeargs -DUSE_PORTABLE_HELPERS or add add_definitions(-DUSE_PORTABLE_HELPERS) in src/coreclr/nativeaot/Runtime/Full/CMakeLists.txt for temporary testing.

That would opt into a lot more behaviors than we want. Smashing it to if 0 like I did in the pull request would be better.

If we want to keep it, we would want to introduce FEATURE_RELATIVE_RELOCS or something like that. There is going to be one more spot to deal with within the native runtime to compensate for https://github.com/dotnet/runtime/issues/67232#issuecomment-1084146826. This will be needed to catch exceptions.

jkotas commented 2 years ago

How hard is it to make the relative relocs work with ARM64_RELOC_SUBTRACTOR? The relative relocs are important size optimization. We would want to figure out how to enable them on Apple Silicon.

filipnavara commented 2 years ago

ARM64_RELOC_SUBTRACTOR is just kind of modifier for other reloc types. I don't think it would help here.

jkotas commented 2 years ago

Right, it says to subtract an address from where the reloc is pointing to. It should be exactly what we need here.

Here is an example how LLVM creates relative relocs using the subtractor: https://github.com/llvm/llvm-project/blob/6c9f6812523a706c11a12e6cb4119b0cf67bbb21/lld/MachO/EhFrame.cpp#L108-L130

filipnavara commented 2 years ago

Couple of notes about the compact unwinding:

UPD: Apparently the code offsets in the compact unwind info point to completely wrong locations. Since the hex offsets look alright in the actual __compact_unwind section I think it may be reoccurrence of some relocation issue. UPD 2: ...and in fact it turned out to be issue with my local change.

filipnavara commented 2 years ago

I got way further (eg. printing "Hello World" works) but exceptions failed to unwind. That's why I started poking into it.

am11 commented 2 years ago

printing "Hello World" works

That is a great news! :tada:

If you can push the changes somewhere, I can join in. :slightly_smiling_face:

filipnavara commented 2 years ago

If you can push the changes somewhere, I can join in.

They need to be cleaned up... I have a lot of printf-style debugging and comments going on.

filipnavara commented 2 years ago

I committed the LLVM change for unwinding to your branch (https://github.com/dotnet/llvm-project/pull/185). On the dotnet/runtime side I use my own branch (https://github.com/dotnet/runtime/compare/main...filipnavara:runtime:nativeaot-m1) but feel free to cherry pick the last commit (https://github.com/dotnet/runtime/commit/e65a65edab8718ef8bb5b9de5a17077229fbbbff) which is the only relevant one for unwinding.

Aside from that there's only hack for SupportsRelativePointers.

My current state:

filipnavara@172-4-1-20 runtime % lldb ./artifacts/tests/coreclr/OSX.arm64.Debug/nativeaot/SmokeTests/FrameworkStrings/UseSystemResourceKeys/native/UseSystemResourceKeys
error: no such file
error: 'setsymbolserver' is not a valid command.
(lldb) target create "./artifacts/tests/coreclr/OSX.arm64.Debug/nativeaot/SmokeTests/FrameworkStrings/UseSystemResourceKeys/native/UseSystemResourceKeys"
Current executable set to '/Users/filipnavara/Projects/runtime/artifacts/tests/coreclr/OSX.arm64.Debug/nativeaot/SmokeTests/FrameworkStrings/UseSystemResourceKeys/native/UseSystemResourceKeys' (arm64).
(lldb) run
Process 85278 launched: '/Users/filipnavara/Projects/runtime/artifacts/tests/coreclr/OSX.arm64.Debug/nativeaot/SmokeTests/FrameworkStrings/UseSystemResourceKeys/native/UseSystemResourceKeys' (arm64)
Arg_NullReferenceException
Arg_NullReferenceException
Process 85278 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x00000001000a8c7c UseSystemResourceKeys`libunwind::LocalAddressSpace::get32(this=0x00000001006926d8, addr=0) at AddressSpace.hpp:164:5
   161    }
   162    uint32_t         get32(pint_t addr) {
   163      uint32_t val;
-> 164      memcpy(&val, (void *)addr, sizeof(val));
   165      return val;
   166    }
   167    uint64_t         get64(pint_t addr) {
Target 0: (UseSystemResourceKeys) stopped.
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x00000001000a8c7c UseSystemResourceKeys`libunwind::LocalAddressSpace::get32(this=0x00000001006926d8, addr=0) at AddressSpace.hpp:164:5
    frame #1: 0x00000001000a6be4 UseSystemResourceKeys`libunwind::CFI_Parser<libunwind::LocalAddressSpace>::decodeFDE(addressSpace=0x00000001006926d8, fdeStart=0, fdeInfo=0x000000016fdfe450, cieInfo=0x000000016fdfe418, useCIEInfo=false) at DwarfParser.hpp:176:43
    frame #2: 0x00000001000a4ca0 UseSystemResourceKeys`libunwind::DwarfInstructions<libunwind::LocalAddressSpace, Registers_REGDISPLAY>::stepWithDwarf(addressSpace=0x00000001006926d8, pc=4297402924, fdeStart=0, registers=0x000000016fdfedd0, isSignalFrame=0x000000016fdfe4e7) at DwarfInstructions.hpp:181:7
    frame #3: 0x00000001000a3e9c UseSystemResourceKeys`DoTheStep(pc=4297402924, uwInfoSections=UnwindInfoSections @ 0x000000016fdfe908, regs=0x000000016fdfedd0) at UnwindHelpers.cpp:800:19
    frame #4: 0x00000001000a5168 UseSystemResourceKeys`UnwindHelpers::StepFrame(regs=0x000000016fdfedd0) at UnwindHelpers.cpp:833:12
    frame #5: 0x00000001000a3370 UseSystemResourceKeys`VirtualUnwind(pRegisterSet=0x000000016fdfedd0) at UnixContext.cpp:693:12
    frame #6: 0x00000001000b6fbc UseSystemResourceKeys`UnixNativeCodeManager::UnwindStackFrame(this=0x0000600000c08210, pMethodInfo=0x000000016fdfef28, pRegisterSet=0x000000016fdfedd0, ppPreviousTransitionFrame=0x000000016fdfeb00) at UnixNativeCodeManager.cpp:322:10
    frame #7: 0x000000010001ff08 UseSystemResourceKeys`StackFrameIterator::NextInternal(this=0x000000016fdfedb0) at StackFrameIterator.cpp:1436:5
    frame #8: 0x000000010001fda8 UseSystemResourceKeys`StackFrameIterator::Next(this=0x000000016fdfedb0) at StackFrameIterator.cpp:1408:5
    frame #9: 0x0000000100020a44 UseSystemResourceKeys`::RhpSfiNext(pThis=0x000000016fdfedb0, puExCollideClauseIdx=0x000000016fdfed3c, pfUnwoundReversePInvoke=0x000000016fdfece8) at StackFrameIterator.cpp:1967:12
    frame #10: 0x0000000100236be8 UseSystemResourceKeys`S_P_CoreLib_System_Runtime_StackFrameIterator__Next_1 + 40
    frame #11: 0x0000000100233e50 UseSystemResourceKeys`S_P_CoreLib_System_Runtime_EH__DispatchEx + 640
    frame #12: 0x0000000100233b18 UseSystemResourceKeys`RhThrowEx + 152
    frame #13: 0x00000001000b9ad8 UseSystemResourceKeys`NotHijacked at ExceptionHandling.S:333
    frame #14: 0x0000000100252a2c UseSystemResourceKeys`S_P_CoreLib_System_Reflection_Runtime_TypeInfos_RuntimeTypeInfo__GetInterfaceMap + 636
    frame #15: 0x00000001002d77d0 UseSystemResourceKeys`UseSystemResourceKeys_Program___Main__ + 320
    frame #16: 0x00000001003fac34 UseSystemResourceKeys`UseSystemResourceKeys__Module___MainMethodWrapper + 20
    frame #17: 0x00000001003facc0 UseSystemResourceKeys`UseSystemResourceKeys__Module___StartupCodeMain + 112
    frame #18: 0x0000000100009a04 UseSystemResourceKeys`main(argc=1, argv=0x0036cf0400e3c100) at main.cpp:205:18
    frame #19: 0x00000001a0077e38 dyld`start + 2520
(lldb) 
filipnavara commented 2 years ago

USE_PORTABLE_HELPERS doesn't really help. There may be something more to fix in the stack iteration but I assume it's more likely related to the compact unwinding than the relative pointers (no info from LSDA is involved in this particular case so I don't expect the broken relocations to have effect).

filipnavara commented 2 years ago

I pushed a commit to my branch that gets the compact unwinding correctly decode the frames. So now we're back to the relocations... the LSDA info uses RELPTR32 relocations which don't work yet and so I get garbage in UnixNativeCodeManager::EHEnumInit when trying to get pEHInfo.

filipnavara commented 2 years ago

Right, it says to subtract an address from where the reloc is pointing to. It should be exactly what we need here.

Makes sense. So we essentially want to emit ARM64_RELOC_UNSIGNED without the IsPCRel and then ARM64_RELOC_SUBTRACTOR bound to the the section.getBeginSymbol() and offset to the fixup location.

filipnavara commented 2 years ago

I committed code for relative relocations to the llvm branch. That gets me past the first exception handler which seems to be executed correctly:

filipnavara@172-4-1-20 runtime % lldb ./artifacts/tests/coreclr/OSX.arm64.Debug/nativeaot/SmokeTests/FrameworkStrings/UseSystemResourceKeys/native/UseSystemResourceKeys
error: no such file
error: 'setsymbolserver' is not a valid command.
(lldb) target create "./artifacts/tests/coreclr/OSX.arm64.Debug/nativeaot/SmokeTests/FrameworkStrings/UseSystemResourceKeys/native/UseSystemResourceKeys"
Current executable set to '/Users/filipnavara/Projects/runtime/artifacts/tests/coreclr/OSX.arm64.Debug/nativeaot/SmokeTests/FrameworkStrings/UseSystemResourceKeys/native/UseSystemResourceKeys' (arm64).
(lldb) run
Process 61663 launched: '/Users/filipnavara/Projects/runtime/artifacts/tests/coreclr/OSX.arm64.Debug/nativeaot/SmokeTests/FrameworkStrings/UseSystemResourceKeys/native/UseSystemResourceKeys' (arm64)
Arg_NullReferenceException
Arg_NullReferenceException
Argument_ArrayGetInterfaceMap
Argument_ArrayGetInterfaceMap
Resources in CoreLib:
Resources in reflection library:
Process 61663 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000000000000
error: memory read failed for 0x0
Target 0: (UseSystemResourceKeys) stopped.
(lldb)

Not quite sure what is going on now. I am gonna be away for a week so feel free to poke at it.

FWIW reverting back to SupportsRelativePointers = true also seems to work and fail the same way.

filipnavara commented 2 years ago

More detail on the crash above. I stepped through the code all the way up to here:

Process 80634 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
    frame #0: 0x00000001002d7fac UseSystemResourceKeys`UseSystemResourceKeys_Program___Main__ + 1004
UseSystemResourceKeys`UseSystemResourceKeys_Program___Main__:
->  0x1002d7fac <+1004>: ret    
    0x1002d7fb0 <+1008>: stp    x29, x30, [sp, #-0x10]!
    0x1002d7fb4 <+1012>: str    x0, [x29, #0x28]
    0x1002d7fb8 <+1016>: ldr    x0, [x29, #0x28]
Target 0: (UseSystemResourceKeys) stopped.
(lldb) 
Process 80634 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = instruction step into
    frame #0: 0x0000000000000000
error: memory read failed for 0x0

Stack corruption?

MichalStrehovsky commented 2 years ago

Very nice progress!

Yes, this looks like a stack corruption. The probably obvious ways to troubleshoot this would be to check the stack pointer at the moment of entry and before the return (to rule out SP being corrupted). And if the SP is fine, try putting a memory breakpoint on the location where the return address is kept to see who clobbers the stack slot.

We did take an exception in Program.Main so it could be things weren't restored properly and SP is off:

https://github.com/dotnet/runtime/blob/ca82565a60380bf4220255c65e493deb44314346/src/tests/nativeaot/SmokeTests/FrameworkStrings/Program.cs#L30-L38

filipnavara commented 2 years ago

It's very likely that something was not properly restored during the exception handling. It uses the compact unwinding which is a new code path. I need to get more familiar with the ARM64 instruction set again, perhaps X30 register (default ret destination) is not restored correctly... it should be fairly easy to trace with few breakpoints in the right places.

am11 commented 2 years ago

Running SharedLibrary (with #74989) throws from libunwind in Exception ctor (or anything that uses write barriers).

% ./artifacts/tests/coreclr/OSX.arm64.Release/nativeaot/SmokeTests/SharedLibrary/SharedLibrary/native/SharedLibrary     
libunwind: stepWithCompactEncoding - invalid compact unwind encoding
zsh: abort  

same with the Debug configuration. lldb shows:

* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x102463e99)
    frame #0: 0x0000000101d7ec80 SharedLibrary.dylib`RhpAssignRefArm64 at WriteBarriers.S:322
   319      ALTERNATE_ENTRY RhpAssignRefX1AVLocation
   320          stlr    x15, [x14]
   321  
-> 322          INSERT_UNCHECKED_WRITE_BARRIER_CORE x14, x15, 12
   323  
   324          ret
   325  LEAF_END RhpAssignRefArm64, _TEXT

(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x102463e99)
  * frame #0: 0x0000000101d7ec80 SharedLibrary.dylib`RhpAssignRefArm64 at WriteBarriers.S:322
    frame #1: 0x0000000101dd3390 SharedLibrary.dylib`S_P_CoreLib_System_Exception___ctor_0 + 48
    frame #2: 0x0000000101e3ab9c SharedLibrary.dylib`S_P_CoreLib_System_SystemException___ctor_0 + 28
    frame #3: 0x0000000101e2997c SharedLibrary.dylib`S_P_CoreLib_System_OutOfMemoryException___ctor_0 + 28
    frame #4: 0x0000000101dd443c SharedLibrary.dylib`S_P_CoreLib_System_PreallocatedOutOfMemoryException__Initialize + 44
    frame #5: 0x0000000101f79800 SharedLibrary.dylib`S_P_CoreLib_Internal_Runtime_CompilerHelpers_LibraryInitializer__InitializeLibrary + 16
    frame #6: 0x00000001020c6c8c SharedLibrary.dylib`__managed__Startup + 28
    frame #7: 0x0000000101cca36c SharedLibrary.dylib`InitializeRuntime() at main.cpp:173:5
    frame #8: 0x0000000101ce5f0c SharedLibrary.dylib`Thread::EnsureRuntimeInitialized(this=0x0000000100404260) at thread.cpp:1207:13
    frame #9: 0x0000000101ce5de4 SharedLibrary.dylib`Thread::ReversePInvokeAttachOrTrapThread(this=0x0000000100404260, pFrame=0x000000016fdff658) at thread.cpp:1169:13
    frame #10: 0x0000000101ce6328 SharedLibrary.dylib`::RhpReversePInvokeAttachOrTrapThread2(pFrame=0x000000016fdff658) at thread.cpp:1349:28
    frame #11: 0x0000000101ce645c SharedLibrary.dylib`::RhpReversePInvoke(pFrame=0x000000016fdff658) at thread.cpp:1363:5
    frame #12: 0x000000010200d64c SharedLibrary.dylib`ReturnsPrimitiveInt + 28
    frame #13: 0x0000000100003e08 SharedLibrary`main(argc=1, argv=0x000000016fdff830) at SharedLibrary.cpp:55:9
    frame #14: 0x0000000100015088 dyld`start + 516
am11 commented 2 years ago

Disabling FEATURE_USE_SOFTWARE_WRITE_WATCH_FOR_GC_HEAP fixed it and a few others.

$ src/tests/run.sh --runnativeaottests Debug
...
Time [secs] | Total | Passed | Failed | Skipped | Assembly Execution Summary
============================================================================
      2.193 |    13 |      4 |      9 |       0 | nativeaot.SmokeTests.XUnitWrapper.dll
----------------------------------------------------------------------------
      2.193 |    13 |      4 |      9 |       0 | (total)
filipnavara commented 2 years ago

libunwind: stepWithCompactEncoding - invalid compact unwind encoding

I originally forgot to check for the compact->DWARF fallback path. I added it to my branch yesterday but it's untested.

am11 commented 2 years ago

Yes, I had all the patches + rebased on tip of the main branch: https://github.com/dotnet/runtime/compare/main...am11:runtime:m1-filip. You can use #include <external/llvm-libunwind/src/CompactUnwinder.hpp> around line 26 in UnwindHelpers.cpp to avoid copying CompactUnwinder_arm64.

filipnavara commented 2 years ago

You can use #include <external/llvm-libunwind/src/CompactUnwinder.hpp> around line 26 in UnwindHelpers.cpp to avoid copying CompactUnwinder_arm64.

The reason I copied it is that it operates on a different register structure layout. There may be a way around that but it was easier to just take a copy and adjust it (for testing).

am11 commented 2 years ago

There is a P/Invoke failure running into OOM (though there is plenty of memory on the system):

        > /Users/adeel/projects/runtime/artifacts/tests/coreclr/OSX.arm64.Debug/nativeaot/SmokeTests/PInvoke/PInvoke/PInvoke.sh
        Expected: True
        Actual:   False
        Stack Trace:
          /Users/adeel/projects/runtime/artifacts/tests/coreclr/OSX.arm64.Debug/TestWrappers/nativeaot.SmokeTests/nativeaot.SmokeTests.XUnitWrapper.cs(842,0): at nativeaot_SmokeTests._PInvoke_PInvoke_PInvoke_._PInvoke_PInvoke_PInvoke_sh()
             at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
             at System.Reflection.MethodInvoker.Invoke(Object obj, IntPtr* args, BindingFlags invokeAttr)
        Output:
          Unhandled Exception: System.OutOfMemoryException: Insufficient memory to continue the execution of the program.
             at Internal.Runtime.Augments.RuntimeAugments.CreateThunksHeap(IntPtr) + 0x6c
             at System.Runtime.InteropServices.PInvokeMarshal.AllocateThunk(Delegate) + 0x64
             at PInvoke!<BaseAddress>+0x419cb8
             at System.Runtime.CompilerServices.ConditionalWeakTable`2.GetValueLocked(TKey, ConditionalWeakTable`2.CreateValueCallback) + 0x58
             at System.Runtime.CompilerServices.ConditionalWeakTable`2.GetValue(TKey, ConditionalWeakTable`2.CreateValueCallback) + 0x68
             at System.Runtime.InteropServices.PInvokeMarshal.GetFunctionPointerForDelegate(Delegate) + 0x180
             at System.Runtime.InteropServices.Marshal.GetFunctionPointerForDelegateInternal(Delegate) + 0x18
             at System.Runtime.InteropServices.Marshal.GetFunctionPointerForDelegate(Delegate) + 0x2c
             at System.Runtime.InteropServices.Marshal.GetFunctionPointerForDelegate[TDelegate](TDelegate) + 0x28
             at PInvokeTests.Program.ReversePInvoke_Int(Program.Delegate_Int) + 0x74
             at PInvokeTests.Program.TestDelegate() + 0x70
             at PInvokeTests.Program.Main(String[]) + 0x78
             at PInvoke!<BaseAddress>+0x424644
             at PInvoke!<BaseAddress>+0x4246d0
          /Users/adeel/projects/runtime/src/tests/Common/scripts/nativeaottest.sh: line 14: 60306 Abort trap: 6           $_DebuggerFullPath $1/native/$exename "${@:3}"

There are no debug symbols for ThunksHeap's ctor, but it's failing to get non-zero _nextAvailableThunkPtr around here: https://github.com/dotnet/runtime/blob/9d6396deb02161f5ee47af72ccac52c2e1bae458/src/coreclr/nativeaot/Runtime.Base/src/System/Runtime/ThunkPool.cs#L98

# no debug symbol in that image by that name
(lldb) image lookup -rn S_P_CoreLib_System_Runtime_ThunksHeap___ctor

# there is a non-debug symbol if we search the image
(lldb) image lookup -rs S_P_CoreLib_System_Runtime_ThunksHeap___ctor
2 symbols match the regular expression 'S_P_CoreLib_System_Runtime_ThunksHeap___ctor' in /Users/adeel/projects/runtime/artifacts/tests/coreclr/OSX.arm64.Debug/nativeaot/SmokeTests/PInvoke/PInvoke/native/PInvoke:
        Address: PInvoke[0x00000001002b1860] (PInvoke.__TEXT.__managedcode + 1931408)
        Summary: PInvoke`S_P_CoreLib_System_Runtime_ThunksHeap___ctor        Address: PInvoke[0x00000001005d9ec1] (PInvoke.__DATA..dotnet_eh_table + 269945)
        Summary: PInvoke`lsda0_S_P_CoreLib_System_Runtime_ThunksHeap___ctor
jkotas commented 2 years ago

There is a P/Invoke failure running into OOM

It is likely a failure to allocate executable memory. This code needs to be changed to use Apple Silicon APIs to do that (pthread_jit_write_protect_np and MAP_JIT).

filipnavara commented 2 years ago

There is a P/Invoke failure running into OOM

It is likely a failure to allocate executable memory. This code needs to be changed to use Apple Silicon APIs to do that (pthread_jit_write_protect_np and MAP_JIT).

Yep, you are right. I checked the theory and this should be fairly easy to fix.

filipnavara commented 2 years ago

The stack corruption is not actually corruption of the stack content. The problem is that SP points to wrong location after exception unwinding. The problem actually happens very early on in the unwinding mechanism. When RhpThrowEx is called the fp (x29) register points to somewhat unexpected location for the frame-based compact unwinding information. I'll dig deeper into it...

filipnavara commented 2 years ago

So, CLR actually produces call frames that are incompatible with compact unwinding (SP = FP, instead of SP = prevFP - 16). Disabling the CompressARM64CFI code path on my branch reverts to regular DWARF CFI unwinding (with the compact unwinding tables being generated with pointers to DWARF CFI). That fixes UseSystemResourceKeys crash (and other exception unwinding issues) and allows it to run to completion.

I will need to investigate whether the incorrect unwinding info is result of CompressARM64CFI or the LLVM code for converting to the compact form. We can either decide to keep the long-form DWARF CFI, or add three line fix after the standard compact unwinding to account for managed frames.

jkotas commented 2 years ago

Or fix the JIT to produce frames that are compatible with compact unwinding.

filipnavara commented 2 years ago

Or fix the JIT to produce frames that are compatible with compact unwinding.

That's also an option. I started reading the design document and there are some limits on what the instruction encoding allows in terms of offsets... Overall I think there are multiple viable ways to fix it but it's relatively easy to get it working in some way now that I know what is happening.

filipnavara commented 2 years ago

It is likely a failure to allocate executable memory.

Agreed, this looks like the case. I tried borrowing some helpers from coreclr/pal yesterday, but it has not fixed the issue (so far).

You would need to map the memory with MAP_JIT flag (see MEM_RESERVE_EXECUTABLE in CLR code). Writes to both the data section and the thunk section have to be protected by the pthread_jit_write_protect_np call. Since the data chunk of memory is written from the managed code you would need to also expose the helper to enable/disable write protection to the managed code.

filipnavara commented 2 years ago

Reflection tests are failing because GetUnwindProcInfo doesn't understand compact unwinding yet. I have a local fix but it needs polishing before I push it.

Testing delegate targets are reflectable...
Testing virtual delegate targets are reflectable...
TestContainment
TestInterfaceMethod
TestByRefLikeTypeMethod
TestILScanner
Search current assembly
GetMethod on a non-generic type
Totally unreferenced method on a non-generic type (we should not find it)
GetMethod on a non-generic type for a generic method
Generics
Partial canonical types
Search in system assembly
Search through a forwarder
Search in mscorlib
Enum.GetValues
Enum.GetValuesAsUnderlyingType
Pattern in LINQ expressions
Other pattern in LINQ expressions
TestUnreferencedEnum
TestAttributeInheritance
TestStringConstructor
TestAssemblyAndModuleAttributes
TestAttributeExpressions
TestParameterAttributes
TestPropertyAndEventAttributes
TestNecessaryEETypeReflection
TestCreateDelegate
TestGetUninitializedObject
TestInstanceFields
TestReflectionInvoke
TestInvokeMemberParamsCornerCase
TestDefaultInterfaceInvoke
TestCovariantReturnInvoke
TestThreadStaticFields
TestByRefReturnInvoke
Process 59310 exited with status = 100 (0x00000064)
filipnavara commented 2 years ago

I pushed changes to my branch that get the PInvoke smoke test passing. The MAP_JIT protection was the easy part. Apparently TLS access was trashing some registers in tls_get_var function and that prevented the stubs from working correctly.

filipnavara commented 2 years ago

Looks like we don't need to change gcenv.unix.cpp now that OS_PAGE_SIZE is fixed?

Apparently we still have to, I checked. Not sure what's different from regular CoreCLR though.

filipnavara commented 2 years ago

The remaining failures seem to have some memory trashing going on. There may be something suspicious going on in RhpCheckedLockCmpXchg since the trashed locations seem to be variables written by Interlocked.CompareExchange. Or possibly the GC stack scanning is missing something and the heap gets compacted with live references incorrectly discarded...

filipnavara commented 2 years ago

aaaaaargh:

.macro PREPARE_EXTERNAL_VAR Name, HelperReg
#if defined(__APPLE__)
        adrp \HelperReg, C_FUNC(\Name)@GOTPAGE
        ldr  \HelperReg, [\HelperReg, C_FUNC(\Name)@GOTPAGEOFF]
#else
        adrp \HelperReg, C_FUNC(\Name)
        add  \HelperReg, \HelperReg, :lo12:C_FUNC(\Name)
#endif
.endm

.macro PREPARE_EXTERNAL_VAR_INDIRECT Name, HelperReg
#if defined(__APPLE__)
        adrp \HelperReg, C_FUNC(\Name)@GOTPAGE
        ldr  \HelperReg, [\HelperReg, C_FUNC(\Name)@GOTPAGEOFF]
#else
        adrp \HelperReg, C_FUNC(\Name)
        ldr  \HelperReg, [\HelperReg, :lo12:C_FUNC(\Name)]
#endif
.endm

Spot the mistake.

filipnavara commented 2 years ago

Somewhere along the way I broke the Reflection test... the weird part is that it's because of a corrupted pointer in the data section and it's already corrupted at the process start before any code is run:

image
Process 25618 launched: '/Users/filipnavara/Projects/runtime/artifacts/tests/coreclr/OSX.arm64.Debug/nativeaot/SmokeTests/Reflection/Reflection/native/Reflection' (arm64)
(lldb) p *(void **)0x00000001008e7ffc
(void *) $1 = 0x00000000004c3140

...and the section has no relocations:

image

I am running out of ideas on what could have possibly caused that.

filipnavara commented 2 years ago

Eh, I traced back the Relocation failure to https://github.com/dotnet/llvm-project/blob/cb1c615abd1a871bd2d2a105325aaa84ee5913b5/llvm/tools/objwriter/objwriter.cpp#L257-L262 ... will try to fix it, or submit revert of the code block to llvm-project for objwriter.

I did a rebuild and I could not reproduce it anymore...

filipnavara commented 2 years ago

With the current state of things the smoke tests sometimes pass on my machine. Other times it fails here:

* thread #172, stop reason = signal SIGABRT
  * frame #0: 0x00000001a0366224 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x00000001a039ccec libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x00000001a02d62d8 libsystem_c.dylib`abort + 180
    frame #3: 0x00000001000a6980 UnitTests`::PalHijack(hThread=0x00006000002040e0, pThreadToHijack=0x0000000123004740) at PalRedhawkUnix.cpp:1041:9
    frame #4: 0x000000010002bad0 UnitTests`Thread::Hijack(this=0x0000000123004740) at thread.cpp:616:5
    frame #5: 0x000000010002db84 UnitTests`ThreadStore::SuspendAllThreads(this=0x0000600000202d80, waitForGCEvent=true) at threadstore.cpp:278:36
    frame #6: 0x000000010001cb64 UnitTests`GCToEEInterface::SuspendEE(reason=SUSPEND_FOR_GC) at gcrhenv.cpp:659:23
    frame #7: 0x000000010004ccc4 UnitTests`WKS::GCHeap::GarbageCollectGeneration(this=0x0000600000004030, gen=0, reason=reason_alloc_soh) at gc.cpp:46900:9
    frame #8: 0x000000010004ec6c UnitTests`WKS::gc_heap::trigger_gc_for_alloc(gen_number=0, gr=reason_alloc_soh, msl=0x0000000100a2c1b8, loh_p=false, take_state=mt_try_budget) at gc.cpp:17691:14
    frame #9: 0x00000001000502e8 UnitTests`WKS::gc_heap::try_allocate_more_space(acontext=0x0000000126004dd0, size=32, flags=0, gen_number=0) at gc.cpp:17841:21
    frame #10: 0x000000010005048c UnitTests`WKS::gc_heap::allocate_more_space(acontext=0x0000000126004dd0, size=32, flags=0, alloc_generation_number=0) at gc.cpp:18320:18
    frame #11: 0x0000000100092394 UnitTests`WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) at gc.cpp:18351:19
    frame #12: 0x000000010009225c UnitTests`WKS::GCHeap::Alloc(this=0x0000600000004030, context=0x0000000126004dd0, size=32, flags=0) at gc.cpp:45894:34
    frame #13: 0x000000010001bd7c UnitTests`GcAllocInternal(pEEType=0x0000000100a12340, uFlags=0, numElements=0, pThread=0x0000000126004dd0) at gcrhenv.cpp:267:54
    frame #14: 0x000000010001c064 UnitTests`::RhpGcAlloc(pEEType=0x0000000100a12340, uFlags=0, numElements=0, pTransitionFrame=0x00000001705eecd0) at gcrhenv.cpp:303:12
    frame #15: 0x00000001000c3f5c UnitTests`RhpNewObject at AllocFast.S:88
    frame #16: 0x0000000100377d78 UnitTests`RhNewObject + 264
    frame #17: 0x0000000100373b64 UnitTests`S_P_CoreLib_System_Runtime_RuntimeImports__RhNewObject_0 + 36
    frame #18: 0x00000001003dff08 UnitTests`S_P_CoreLib_Internal_Runtime_ThreadStatics__AllocateThreadStaticStorageForType + 312
    frame #19: 0x00000001003dfc20 UnitTests`S_P_CoreLib_Internal_Runtime_ThreadStatics__GetThreadStaticBaseForTypeSlow + 112
    frame #20: 0x00000001003dfb84 UnitTests`S_P_CoreLib_Internal_Runtime_ThreadStatics__GetThreadStaticBaseForType + 228
    frame #21: 0x00000001002d2564 UnitTests`S_P_CoreLib_System_Threading_Thread__StartThread + 116
    frame #22: 0x00000001002d2e50 UnitTests`S_P_CoreLib_System_Threading_Thread__ThreadEntryPoint + 32
    frame #23: 0x00000001a039d06c libsystem_pthread.dylib`_pthread_start + 148

UPD: Updated PalHijack, now I get Assertion failed: (dont_restart_ee_p), function background_mark_phase, file gc.cpp, line 34647.

filipnavara commented 2 years ago

I'll probably try to clean up my branch and submit a PR soon.

filipnavara commented 2 years ago

That's GC interruption. You need to disable it in lldb with proc han -s false SIGUSR1. I do get intermittent GC related asserts in that test but sometimes it passes.

am11 commented 2 years ago

this assertion is failing:

* thread #87, stop reason = hit program assert
    frame #4: 0x00000001000818bc UnitTests`WKS::gc_heap::background_promote_callback(ppObject=0x000000017018eb90, sc=0x0000000170fc6a00, flags=1) at gc.cpp:35589:5
   35586        UNREFERENCED_PARAMETER(sc);
   35587        //in order to save space on the array, mark the object,
   35588        //knowing that it will be visited later
-> 35589        assert (settings.concurrent);
   35590    
   35591        THREAD_NUMBER_FROM_CONTEXT;
   35592    #ifndef MULTIPLE_HEAPS
Target 0: (UnitTests) stopped.
filipnavara commented 2 years ago

Yep, that matches what I get. Not every time though.

filipnavara commented 2 years ago

We should probably fix GC_PAGE_SIZE definition (https://github.com/dotnet/runtime/blob/31bdc77701b2f4b2f3391e990938ed8e17eb410f/src/coreclr/gc/gcpriv.h#L6226). Unfortunately I have a bit of trouble capturing the GC failures under lldb.

UPD: Updating GC_PAGE_SIZE blows up really quickly. Nobody expects neither the Spanish inquisition nor the 16Kb page size.

filipnavara commented 2 years ago

I finally managed to get the assert under lldb. The interesting thing is that two thread try to do GC at the same time:

  thread #73
    frame #0: 0x00000001000b9cbc UnitTests`BitStreamReader::DecodeVarLengthUnsigned(this=0x0000000170935bb0, base=8) at gcinfodecoder.h:382:13
    frame #1: 0x00000001000b967c UnitTests`GcInfoDecoder::GcInfoDecoder(this=0x0000000170935bb0, gcInfoToken=(Info = 0x0000000100878a21, Version = 2), flags=DECODE_SECURITY_OBJECT | DECODE_VARARG | DECODE_GC_LIFETIMES, breakOffset=667) at gcinfodecoder.cpp:150:29
    frame #2: 0x00000001000ba7e8 UnitTests`GcInfoDecoder::GcInfoDecoder(this=0x0000000170935bb0, gcInfoToken=(Info = 0x0000000100878a21, Version = 2), flags=DECODE_SECURITY_OBJECT | DECODE_VARARG | DECODE_GC_LIFETIMES, breakOffset=667) at gcinfodecoder.cpp:100:1
    frame #3: 0x00000001000c0fd8 UnitTests`UnixNativeCodeManager::EnumGcRefs(this=0x0000600000c04030, pMethodInfo=0x0000000170935f98, safePointAddress=0x00000001002adaec, pRegisterSet=0x0000000170935e40, hCallback=0x0000000170935ca0, isActiveStackFrame=false) at UnixNativeCodeManager.cpp:190:19
    frame #4: 0x000000010001ba44 UnitTests`RedhawkGCInterface::EnumGcRefs(pCodeManager=0x0000600000c04030, pMethodInfo=0x0000000170935f98, safePointAddress=0x00000001002adaec, pRegisterSet=0x0000000170935e40, pfnEnumCallback=0x000000010006f5b8, pvCallbackData=0x0000000170936258, isActiveStackFrame=false) at gcrhenv.cpp:377:19
    frame #5: 0x000000010002ac80 UnitTests`Thread::GcScanRootsWorker(this=0x0000000123604080, pfnEnumCallback=0x000000010006f5b8, pvCallbackData=0x0000000170936258, frameIterator=0x0000000170935e20) at thread.cpp:514:17
    frame #6: 0x000000010002a918 UnitTests`Thread::GcScanRoots(this=0x0000000123604080, pfnEnumCallback=0x000000010006f5b8, pvCallbackData=0x0000000170936258) at thread.cpp:404:5
    frame #7: 0x000000010001d7c4 UnitTests`GCToEEInterface::GcScanRoots(fn=(UnitTests`WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int) at gc.cpp:45418), condemned=0, max_gen=2, sc=0x0000000170936258)(Object**, ScanContext*, unsigned int), int, int, ScanContext*) at gcrhscan.cpp:62:22
    frame #8: 0x0000000100099770 UnitTests`GCScan::GcScanRoots(fn=(UnitTests`WKS::GCHeap::Promote(Object**, ScanContext*, unsigned int) at gc.cpp:45418), condemned=0, max_gen=2, sc=0x0000000170936258)(Object**, ScanContext*, unsigned int), int, int, ScanContext*) at gcscan.cpp:152:5
    frame #9: 0x000000010005b584 UnitTests`WKS::gc_heap::mark_phase(condemned_gen_number=0, mark_only_p=NO) at gc.cpp:26389:9
    frame #10: 0x000000010005751c UnitTests`WKS::gc_heap::gc1() at gc.cpp:20977:13
    frame #11: 0x000000010006604c UnitTests`WKS::gc_heap::garbage_collect(n=0) at gc.cpp:22716:17
    frame #12: 0x000000010004c9a0 UnitTests`WKS::GCHeap::GarbageCollectGeneration(this=0x0000600000008010, gen=0, reason=reason_alloc_soh) at gc.cpp:46935:9
    frame #13: 0x000000010004e8e4 UnitTests`WKS::gc_heap::trigger_gc_for_alloc(gen_number=0, gr=reason_alloc_soh, msl=0x0000000100a2c3d0, loh_p=false, take_state=mt_try_budget) at gc.cpp:17692:14
    frame #14: 0x000000010004ff60 UnitTests`WKS::gc_heap::try_allocate_more_space(acontext=0x00000001027041a0, size=32, flags=0, gen_number=0) at gc.cpp:17842:21
    frame #15: 0x0000000100050104 UnitTests`WKS::gc_heap::allocate_more_space(acontext=0x00000001027041a0, size=32, flags=0, alloc_generation_number=0) at gc.cpp:18321:18
    frame #16: 0x0000000100092018 UnitTests`WKS::GCHeap::Alloc(gc_alloc_context*, unsigned long, unsigned int) at gc.cpp:18352:19
    frame #17: 0x0000000100091ee0 UnitTests`WKS::GCHeap::Alloc(this=0x0000600000008010, context=0x00000001027041a0, size=32, flags=0) at gc.cpp:45893:34
    frame #18: 0x000000010001b554 UnitTests`GcAllocInternal(pEEType=0x0000000100a12528, uFlags=0, numElements=0, pThread=0x00000001027041a0) at gcrhenv.cpp:267:54
    frame #19: 0x000000010001b83c UnitTests`::RhpGcAlloc(pEEType=0x0000000100a12528, uFlags=0, numElements=0, pTransitionFrame=0x0000000170936cd0) at gcrhenv.cpp:303:12
    frame #20: 0x00000001000c3c30 UnitTests`RhpNewObject at AllocFast.S:88
    frame #21: 0x00000001003493d8 UnitTests`_S_P_CoreLib_System_Runtime_RuntimeExports__RhNewObject(pEEType=0x0000000100a12528) at RuntimeExports.cs:52
    frame #22: 0x00000001003451c4 UnitTests`_S_P_CoreLib_System_Runtime_RuntimeImports__RhNewObject_0(pEEType=S_P_CoreLib_System_EETypePtr @ 0x0000000170936dd8) at RuntimeImports.cs:355
    frame #23: 0x00000001003b1568 UnitTests`_S_P_CoreLib_Internal_Runtime_ThreadStatics__AllocateThreadStaticStorageForType(typeManager=S_P_CoreLib_Internal_Runtime_TypeManagerHandle @ 0x0000000170936e68, typeTlsIndex=9) at ThreadStatics.cs:110
    frame #24: 0x00000001003b1280 UnitTests`_S_P_CoreLib_Internal_Runtime_ThreadStatics__GetThreadStaticBaseForTypeSlow(pModuleData=0x000000010093b7f0, typeTlsIndex=9) at ThreadStatics.cs:50
    frame #25: 0x00000001003b11e4 UnitTests`_S_P_CoreLib_Internal_Runtime_ThreadStatics__GetThreadStaticBaseForType(pModuleData=0x000000010093b7f0, typeTlsIndex=9) at ThreadStatics.cs:35
    frame #26: 0x00000001002a3ab4 UnitTests`_S_P_CoreLib_System_Threading_Thread__StartThread(parameter=4330574968) at Thread.NativeAot.cs:411
    frame #27: 0x00000001002a43b0 UnitTests`_S_P_CoreLib_System_Threading_Thread__ThreadEntryPoint(parameter=4330574968) at Thread.NativeAot.Unix.cs:113
    frame #28: 0x00000001a039d06c libsystem_pthread.dylib`_pthread_start + 148
  thread #74
    frame #0: 0x00000001a03615e4 libsystem_kernel.dylib`__psynch_cvwait + 8
    frame #1: 0x00000001a039d638 libsystem_pthread.dylib`_pthread_cond_wait + 1232
    frame #2: 0x00000001000a985c UnitTests`GCEvent::Impl::Wait(this=0x0000600002c04300, milliseconds=4294967295, alertable=false) at events.cpp:153:22
    frame #3: 0x00000001000a97b0 UnitTests`GCEvent::Wait(this=0x0000600000008020, timeout=4294967295, alertable=false) at events.cpp:262:20
    frame #4: 0x0000000100031df8 UnitTests`WKS::GCHeap::WaitUntilGCComplete(this=0x0000600000008010, bConsiderGCStart=false) at gcee.cpp:285:40
    frame #5: 0x000000010001b9b0 UnitTests`RedhawkGCInterface::WaitForGCCompletion() at gcrhenv.cpp:327:35
    frame #6: 0x000000010002ccc0 UnitTests`ThreadStore::AttachCurrentThread(fAcquireThreadStoreLock=true) at threadstore.cpp:131:9
    frame #7: 0x000000010002cda8 UnitTests`ThreadStore::AttachCurrentThread() at threadstore.cpp:148:5
    frame #8: 0x000000010002c0cc UnitTests`Thread::ReversePInvokeAttachOrTrapThread(this=0x00000001235041a0, pFrame=0x0000000170bf2fa8) at thread.cpp:1172:9
    frame #9: 0x000000010002c608 UnitTests`::RhpReversePInvokeAttachOrTrapThread2(pFrame=0x0000000170bf2fa8) at thread.cpp:1349:28
    frame #10: 0x000000010002c73c UnitTests`::RhpReversePInvoke(pFrame=0x0000000170bf2fa8) at thread.cpp:1363:5
    frame #11: 0x00000001002a43a4 UnitTests`_S_P_CoreLib_System_Threading_Thread__ThreadEntryPoint(parameter=4330572752) at Thread.NativeAot.Unix.cs:112
    frame #12: 0x00000001a039d06c libsystem_pthread.dylib`_pthread_start + 148
* thread #75, stop reason = hit program assert
    frame #0: 0x00000001a0366224 libsystem_kernel.dylib`__pthread_kill + 8
    frame #1: 0x00000001a039ccec libsystem_pthread.dylib`pthread_kill + 288
    frame #2: 0x00000001a02d62d8 libsystem_c.dylib`abort + 180
    frame #3: 0x00000001a02d5630 libsystem_c.dylib`__assert_rtn + 272
  * frame #4: 0x0000000100081cb4 UnitTests`WKS::gc_heap::background_promote_callback(ppObject=0x0000000170076f30, sc=0x0000000170d96a00, flags=0) at gc.cpp:35589:5
    frame #5: 0x000000010001da9c UnitTests`GcEnumObject(ppObj=0x0000000170076f30, flags=0, fnGcEnumRef=(UnitTests`WKS::gc_heap::background_promote_callback(Object**, ScanContext*, unsigned int) at gc.cpp:35585), pSc=0x0000000170d96a00)(Object**, ScanContext*, unsigned int), ScanContext*) at gcrhscan.cpp:119:9
    frame #6: 0x000000010001ba8c UnitTests`EnumGcRefsCallback(hCallback=0x0000000170d96460, pObject=0x0000000170076f30, flags=0) at gcrhenv.cpp:359:5
    frame #7: 0x00000001000c08ac UnitTests`GcInfoDecoder::ReportStackSlotToGC(this=0x0000000170d96370, spOffset=32, spBase=GC_FRAMEREG_REL, gcFlags=0, pRD=0x0000000170d96600, flags=0, pCallBack=(UnitTests`EnumGcRefsCallback(void*, void**, unsigned int) at gcrhenv.cpp:356), hCallBack=0x0000000170d96460)(void*, void**, unsigned int), void*) at gcinfodecoder.cpp:2018:5
    frame #8: 0x00000001000bf4ec UnitTests`GcInfoDecoder::ReportSlotToGC(this=0x0000000170d96370, slotDecoder=0x0000000170d95fa0, slotIndex=6, pRD=0x0000000170d96600, reportScratchSlots=true, inputFlags=0, pCallBack=(UnitTests`EnumGcRefsCallback(void*, void**, unsigned int) at gcrhenv.cpp:356), hCallBack=0x0000000170d96460)(void*, void**, unsigned int), void*) at gcinfodecoder.h:698:17
    frame #9: 0x00000001000bf5d4 UnitTests`GcInfoDecoder::ReportUntrackedSlots(this=0x0000000170d96370, slotDecoder=0x0000000170d95fa0, pRD=0x0000000170d96600, inputFlags=0, pCallBack=(UnitTests`EnumGcRefsCallback(void*, void**, unsigned int) at gcrhenv.cpp:356), hCallBack=0x0000000170d96460)(void*, void**, unsigned int), void*) at gcinfodecoder.cpp:1032:9
    frame #10: 0x00000001000bce08 UnitTests`GcInfoDecoder::EnumerateLiveSlots(this=0x0000000170d96370, pRD=0x0000000170d96600, reportScratchSlots=false, inputFlags=0, pCallBack=(UnitTests`EnumGcRefsCallback(void*, void**, unsigned int) at gcrhenv.cpp:356), hCallBack=0x0000000170d96460)(void*, void**, unsigned int), void*) at gcinfodecoder.cpp:981:9
    frame #11: 0x00000001000c1068 UnitTests`UnixNativeCodeManager::EnumGcRefs(this=0x0000600000c04030, pMethodInfo=0x0000000170d96758, safePointAddress=0x00000001002a3b64, pRegisterSet=0x0000000170d96600, hCallback=0x0000000170d96460, isActiveStackFrame=false) at UnixNativeCodeManager.cpp:206:18
    frame #12: 0x000000010001ba44 UnitTests`RedhawkGCInterface::EnumGcRefs(pCodeManager=0x0000600000c04030, pMethodInfo=0x0000000170d96758, safePointAddress=0x00000001002a3b64, pRegisterSet=0x0000000170d96600, pfnEnumCallback=0x0000000100081c5c, pvCallbackData=0x0000000170d96a00, isActiveStackFrame=false) at gcrhenv.cpp:377:19
    frame #13: 0x000000010002ac80 UnitTests`Thread::GcScanRootsWorker(this=0x0000000102404d90, pfnEnumCallback=0x0000000100081c5c, pvCallbackData=0x0000000170d96a00, frameIterator=0x0000000170d965e0) at thread.cpp:514:17
    frame #14: 0x000000010002a918 UnitTests`Thread::GcScanRoots(this=0x0000000102404d90, pfnEnumCallback=0x0000000100081c5c, pvCallbackData=0x0000000170d96a00) at thread.cpp:404:5
    frame #15: 0x000000010001d7c4 UnitTests`GCToEEInterface::GcScanRoots(fn=(UnitTests`WKS::gc_heap::background_promote_callback(Object**, ScanContext*, unsigned int) at gc.cpp:35585), condemned=2, max_gen=2, sc=0x0000000170d96a00)(Object**, ScanContext*, unsigned int), int, int, ScanContext*) at gcrhscan.cpp:62:22
    frame #16: 0x0000000100099770 UnitTests`GCScan::GcScanRoots(fn=(UnitTests`WKS::gc_heap::background_promote_callback(Object**, ScanContext*, unsigned int) at gc.cpp:35585), condemned=2, max_gen=2, sc=0x0000000170d96a00)(Object**, ScanContext*, unsigned int), int, int, ScanContext*) at gcscan.cpp:152:5
    frame #17: 0x0000000100058bc8 UnitTests`WKS::gc_heap::background_mark_phase() at gc.cpp:34578:5
    frame #18: 0x00000001000574e0 UnitTests`WKS::gc_heap::gc1() at gc.cpp:20968:13
    frame #19: 0x0000000100081048 UnitTests`WKS::gc_heap::bgc_thread_function() at gc.cpp:35926:9
    frame #20: 0x0000000100080ec0 UnitTests`WKS::gc_heap::bgc_thread_stub(arg=0x0000000000000000) at gc.cpp:33887:5
    frame #21: 0x000000010001d63c UnitTests`GCToEEInterface::CreateThread(this=0x0000000170936680, argument=0x0000000170936680)(void*), void*, bool, char const*)::$_0::operator()(void*) const at gcrhenv.cpp:1234:9
    frame #22: 0x000000010001d540 UnitTests`GCToEEInterface::CreateThread(argument=0x0000000170936680)(void*), void*, bool, char const*)::$_0::__invoke(void*) at gcrhenv.cpp:1211:23
    frame #23: 0x00000001a039d06c libsystem_pthread.dylib`_pthread_start + 148
jkotas commented 2 years ago

You may be seeing #75298

VSadov commented 2 years ago

assert (settings.concurrent); is indeed https://github.com/dotnet/runtime/pull/75298

filipnavara commented 2 years ago

Confirmed, with https://github.com/dotnet/runtime/pull/75298 I no longer see the crash.

am11 commented 2 years ago

.NET 8 installer for SDK 8.0.100-alpha.1.22464.43 is available and console, classlib and mvc (C# and F#) apps seems to be working fine on M1 when published with:

dotnet8 publish -c release --use-current-runtime -p:'PublishAot=true;StripSymbols=true'