dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.6k stars 4.56k forks source link

[NativeAOT] Intermittent crash accessing thread statics on Windows Arm64 #104500

Open jkotas opened 1 week ago

jkotas commented 1 week ago

Hit in https://dev.azure.com/dnceng-public/public/_build/results?buildId=731138&view=ms.vss-test-web.build-test-results-tab&runId=18392824&resultId=183933&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

[SKIP] System.Net.Sockets.Tests.CreateSocket.CtorAndAccept_SocketNotKeptAliveViaInheritance
Process terminated. Access Violation: Attempted to read or write protected memory. This is often an indication that other memory is corrupt. The application will be terminated since this platform does not support throwing an AccessViolationException.
   at System.RuntimeExceptionHelpers.FailFast(String, Exception, String, RhFailFastReason, IntPtr, IntPtr) + 0x1e0
   at System.RuntimeExceptionHelpers.GetRuntimeException(ExceptionIDs) + 0x158
   at System.Runtime.EH.GetClasslibException(ExceptionIDs, IntPtr) + 0x34
   at System.Runtime.EH.RhThrowHwEx(UInt32, EH.ExInfo&) + 0xb8
   at System.Net.Sockets.Socket.DoBind(EndPoint endPointSnapshot, SocketAddress socketAddress) + 0x11c
   at System.Net.Sockets.Socket.Bind(EndPoint localEP) + 0xc8
   at System.Net.Sockets.Tests.DualModeBind.<>c.<Socket_BindV4IPEndPoint_Throws>b__0_0() + 0x78

We crash in the inlined read of PInvokeMarshal.t_lastError:

00007ff7`03d7209c f9402e41 ldr         x1,[xpr,#0x58]
00007ff7`03d720a0 d000d0c0 adrp        x0,_tls_index
00007ff7`03d720a4 9135e000 add         x0,x0,#0xD78
00007ff7`03d720a8 b9400000 ldr         w0,[x0]
00007ff7`03d720ac f8605820 ldr         x0,[x1,w0 uxtw #3]
00007ff7`03d720b0 91400021 add         x1,x1,#0,lsl #0xC
00007ff7`03d720b4 91001021 add         x1,x1,#4
00007ff7`03d720b8 8b010000 add         x0,x0,x1
00007ff7`03d720bc f9400016 ldr         x22,[x0]<------ AccessViolationException here
jkotas commented 1 week ago

Regression likely introduced by #104282

The most likely explanation of this crash is that the GC info is not right for the TLS access. @kunalspathak Could you please take a look?

cc @VSadov

VSadov commented 1 week ago

GC info is likely ok, the change is not introducing much in that sense, compared to other platforms. @kunalspathak was looking at some possible issues with relocations though. Those are a bit different pattern from what we have seen so far.

VSadov commented 1 week ago

May be related to. https://github.com/dotnet/runtime/issues/104441

kunalspathak commented 1 week ago

May be related to. #104441

104441 is related to superpmi tool itself and not the crash. It might be that I am missing recording gcinfo for the unique pattern I generated. Will take a look.

kunalspathak commented 6 days ago

While investigating this, I realized that the relocation type that was getting said has a typo. I was using the one meant for "ldr" of adrp/ldr pair instead of the one meant for "add" for adrp/add pair. That prompted me to be related to this problem, because chances are the relocations wouldn't have happened correctly and hence the native memory address getting loaded here was not getting calculated correctly. However, with that fixed too, it didn't address this problem. Currently, I have a very limited windows/arm64 machine on which I can repro (by downloading the artifacts), but not incremental builds, etc. to test the changes. I have TTD traces for the failure, but need to look closely at why the native memory that is getting populated is wrong.

Edit: So I don't think the gcinfo is wrong as @VSadov mentioned, but something to do the offset after relocation are not getting calculated properly.