Closed tactical-drone closed 2 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
I can reproduce this. It looks like we get an AV and then hit a stackoverflow in EH while trying to handle that AV. The top frames of the call stack that shows the repeating segment is
0:034> kn30
# Child-SP RetAddr Call Site
00 000000da`233c3e80 00007ff9`c5911dc8 coreclr!FixupDispatcherContext+0x3c [C:\dev\dotnet\runtime\src\coreclr\vm\exceptionhandling.cpp @ 5514]
01 (Inline Function) --------`-------- coreclr!FixupDispatcherContext+0x10 [C:\dev\dotnet\runtime\src\coreclr\vm\exceptionhandling.cpp @ 5642]
02 000000da`233c43c0 00007ffa`155b209f coreclr!FixContextHandler+0x44 [C:\dev\dotnet\runtime\src\coreclr\vm\exceptionhandling.cpp @ 5768]
03 000000da`233c4400 00007ffa`15561454 ntdll!RtlpExecuteHandlerForException+0xf
04 000000da`233c4430 00007ffa`155b0bce ntdll!RtlDispatchException+0x244
05 000000da`233c4b40 00007ffa`132c4fd9 ntdll!KiUserExceptionDispatch+0x2e
06 000000da`233c5250 00007ff9`c577d799 KERNELBASE!RaiseException+0x69
07 000000da`233c5330 00007ff9`c577d7cb coreclr!NakedThrowHelper2+0x9
08 000000da`233c5360 00007ff9`c577d7d5 coreclr!NakedThrowHelper_RspAligned+0x1e
09 000000da`233c5888 00007ff9`66927030 coreclr!NakedThrowHelper_FixRsp+0x5
0a 000000da`233c5890 00007ff9`b8f9f5dc 0x00007ff9`66927030
0b 000000da`233c58c0 00007ff9`c577c803 System_Private_CoreLib+0x1ef5dc
0c 000000da`233c5930 00007ff9`c56e11cb coreclr!CallDescrWorkerInternal+0x83
0d (Inline Function) --------`-------- coreclr!CallDescrWorkerWithHandler+0x56 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 67]
0e 000000da`233c5970 00007ff9`c5818c3b coreclr!MethodDescCallSite::CallTargetWorker+0x247 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 570]
0f (Inline Function) --------`-------- coreclr!MethodDescCallSite::Call_RetSTRINGREF+0x17 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.h @ 462]
10 000000da`233c5aa0 00007ff9`c5818c7a coreclr!GetExceptionMessage+0x9b [C:\dev\dotnet\runtime\src\coreclr\vm\excep.cpp @ 277]
11 000000da`233c5bb0 00007ff9`c57b8bff coreclr!GetExceptionMessage+0xe [C:\dev\dotnet\runtime\src\coreclr\vm\excep.cpp @ 248]
12 000000da`233c5be0 00007ffa`155b209f coreclr!ProcessCLRException+0x172cff [C:\dev\dotnet\runtime\src\coreclr\vm\exceptionhandling.cpp @ 1042]
13 000000da`233c5e00 00007ffa`15561454 ntdll!RtlpExecuteHandlerForException+0xf
14 000000da`233c5e30 00007ffa`155b0bce ntdll!RtlDispatchException+0x244
15 000000da`233c6540 00007ffa`132c4fd9 ntdll!KiUserExceptionDispatch+0x2e
16 000000da`233c6c50 00007ff9`c577d799 KERNELBASE!RaiseException+0x69
17 000000da`233c6d30 00007ff9`c577d7cb coreclr!NakedThrowHelper2+0x9
18 000000da`233c6d60 00007ff9`c577d7d5 coreclr!NakedThrowHelper_RspAligned+0x1e
19 000000da`233c7288 00007ff9`66927030 coreclr!NakedThrowHelper_FixRsp+0x5
1a 000000da`233c7290 00007ff9`b8f9f5dc 0x00007ff9`66927030
1b 000000da`233c72c0 00007ff9`c577c803 System_Private_CoreLib+0x1ef5dc
1c 000000da`233c7330 00007ff9`c56e11cb coreclr!CallDescrWorkerInternal+0x83
1d (Inline Function) --------`-------- coreclr!CallDescrWorkerWithHandler+0x56 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 67]
1e 000000da`233c7370 00007ff9`c5818c3b coreclr!MethodDescCallSite::CallTargetWorker+0x247 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 570]
1f (Inline Function) --------`-------- coreclr!MethodDescCallSite::Call_RetSTRINGREF+0x17 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.h @ 462]
20 000000da`233c74a0 00007ff9`c5818c7a coreclr!GetExceptionMessage+0x9b [C:\dev\dotnet\runtime\src\coreclr\vm\excep.cpp @ 277]
21 000000da`233c75b0 00007ff9`c57b8bff coreclr!GetExceptionMessage+0xe [C:\dev\dotnet\runtime\src\coreclr\vm\excep.cpp @ 248]
22 000000da`233c75e0 00007ffa`155b209f coreclr!ProcessCLRException+0x172cff [C:\dev\dotnet\runtime\src\coreclr\vm\exceptionhandling.cpp @ 1042]
23 000000da`233c7800 00007ffa`15561454 ntdll!RtlpExecuteHandlerForException+0xf
24 000000da`233c7830 00007ffa`155b0bce ntdll!RtlDispatchException+0x244
25 000000da`233c7f40 00007ffa`132c4fd9 ntdll!KiUserExceptionDispatch+0x2e
26 000000da`233c8650 00007ff9`c577d799 KERNELBASE!RaiseException+0x69
27 000000da`233c8730 00007ff9`c577d7cb coreclr!NakedThrowHelper2+0x9
28 000000da`233c8760 00007ff9`c577d7d5 coreclr!NakedThrowHelper_RspAligned+0x1e
29 000000da`233c8c88 00007ff9`66927030 coreclr!NakedThrowHelper_FixRsp+0x5
2a 000000da`233c8c90 00007ff9`b8f9f5dc 0x00007ff9`66927030
2b 000000da`233c8cc0 00007ff9`c577c803 System_Private_CoreLib+0x1ef5dc
2c 000000da`233c8d30 00007ff9`c56e11cb coreclr!CallDescrWorkerInternal+0x83
2d (Inline Function) --------`-------- coreclr!CallDescrWorkerWithHandler+0x56 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 67]
2e 000000da`233c8d70 00007ff9`c5818c3b coreclr!MethodDescCallSite::CallTargetWorker+0x247 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 570]
2f (Inline Function) --------`-------- coreclr!MethodDescCallSite::Call_RetSTRINGREF+0x17 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.h @ 462]
The actual crash itself happens inside RuntimeType.get_Cache
in the inlined call to GCHandle.InternalGet
.
https://github.com/dotnet/runtime/blob/621cd59436cb29cab4b1162409ae0947c4bd780d/src/coreclr/System.Private.CoreLib/src/System/RuntimeType.CoreCLR.cs#L2401-L2417
Note the shape of GCHandle.InternalGet
is the following:
https://github.com/dotnet/runtime/blob/621cd59436cb29cab4b1162409ae0947c4bd780d/src/coreclr/System.Private.CoreLib/src/System/Runtime/InteropServices/GCHandle.CoreCLR.cs#L16-L23
This made me strongly suspect it is related to #68739. However, we actually crash during the inlined InternalGet
call when dereferencing the field:
00007ff9`66927020 56 push rsi
00007ff9`66927021 4883ec20 sub rsp,20h
00007ff9`66927025 8bf2 mov esi,edx
00007ff9`66927027 488b4110 mov rax,qword ptr [rcx+10h]
00007ff9`6692702b 4885c0 test rax,rax
00007ff9`6692702e 740a je 00007ff9`6692703a
>>> 00007ff9`66927030 488b00 mov rax,qword ptr [rax]
00007ff9`66927033 4885c0 test rax,rax
00007ff9`66927036 7402 je 00007ff9`6692703a
00007ff9`66927038 eb06 jmp 00007ff9`66927040
00007ff9`6692703a ff1560082aff call qword ptr [00007ff9`65bc78a0] (System.RuntimeType.InitializeCache(), mdToken: 0000000006000726)
00007ff9`66927040 488bc8 mov rcx,rax
00007ff9`66927043 8bd6 mov edx,esi
00007ff9`66927045 3909 cmp dword ptr [rcx],ecx
00007ff9`66927047 ff15c3a765ff call qword ptr [00007ff9`65f81810] (System.RuntimeType+RuntimeTypeCache.GetName(System.TypeNameKind), mdToken: 00000000060007C2)
00007ff9`6692704d 90 nop
00007ff9`6692704e 4883c420 add rsp,20h
00007ff9`66927052 5e pop rsi
00007ff9`66927053 c3 ret
rcx
is the following object:
0:034> !DumpObj 1e1`2fce6100
Name: System.RuntimeType
MethodTable: 00007ff965bb85d8
EEClass: 00007ff965bab748
Tracked Type: false
Size: 40(0x28) bytes
Type Name: System.AccessViolationException
Type MT: 00007ff966fbdaa0
File: C:\dev\dotnet\investigations\zero\zero.sync\bin\Release\net7.0\win-x64\publish\System.Private.CoreLib.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007ff9664cbab0 4000303 100 ...Reflection.Binder 0 static 000001e11a577120 s_defaultBinder
00007ff965bbe630 4000304 9aa System.Char 1 static 2e Delimiter
00007ff965e9a490 4000305 108 System.Type[] 0 static 000001e11a42bc48 EmptyTypes
00007ff965a99280 4000306 110 System.Object 0 static 000001e11a42bc60 Missing
00007ff965f90fb8 4000307 118 ...tion.MemberFilter 0 static 000001e11a42bc78 FilterAttribute
00007ff965f90fb8 4000308 120 ...tion.MemberFilter 0 static 000001e11a42bcb8 FilterName
00007ff965f90fb8 4000309 128 ...tion.MemberFilter 0 static 000001e11a42bd10 FilterNameIgnoreCase
00007ff965a99280 40002a8 8 System.Object 0 instance 0000000000000000 m_keepalive
00007ff965c294b0 40002a9 10 System.IntPtr 1 instance 0000000000096997 m_cache <- not a valid pointer
00007ff965c294b0 40002aa 18 System.IntPtr 1 instance 00007FF966FBDAA0 m_handle
00007ff965bb85d8 40002ab c8 System.RuntimeType 0 static 000001e11a418948 ValueType
00007ff965bb85d8 40002ac d0 System.RuntimeType 0 static 000001e11a418920 ObjectType
00007ff965bb85d8 40002ad d8 System.RuntimeType 0 static 000001e11a400580 StringType
0000000000000000 40002af e0 System.OleAutBinder 0 static 0000000000000000 s_ForwardCallBinder
Notice that m_cache
looks corrupted. So now I am not totally sure. Still, I will look closer at the GC info here to make sure.
(edit) The GC info looks fine to me.
When I run this under debugger I see quite a few different crashes that all point to some heap corruption. It also seems to crash much more quickly than when not run under debugger.
It looks a lot like allocations are being returned without being zeroed. In one example run we crashed immediately when trying to throw an exception:
0:000> !clrstack -f
OS Thread Id: 0xf5c0 (0)
Child SP IP Call Site
000000213D37CE70 00007FFA132C4FD9 KERNELBASE!RaiseException + 105
000000213D37CF50 00007FF9D84E9BF6 coreclr!RaiseTheExceptionInternalOnly + 678 at C:\dev\dotnet\runtime\src\coreclr\vm\excep.cpp:2805
000000213D37D080 00007FF9D85BF229 coreclr!IL_Throw + 185 at C:\dev\dotnet\runtime\src\coreclr\vm\jithelpers.cpp:4014
000000213D37D0E8 [HelperMethodFrame: 000000213d37d0e8]
000000213D37D1E0 00007ff9793f4994 BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ScalarMultBaseEncoded(Byte[], Byte[], Int32) + 148
000000213D37D230 00007ff97940bd99 BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ImplSign(Byte[], Int32, Byte[], Byte, Byte[], Int32, Int32, Byte[], Int32) + 329
000000213D37D2E0 00007ff97940bc29 BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.Sign(Byte[], Int32, Byte[], Int32, Int32, Byte[], Int32) + 105
The managed function that creates the exception looks like:
0:000> !U /d 00007ff9793f4994
Normal JIT generated code
Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ScalarMultBaseEncoded(Byte[], Byte[], Int32)
ilAddr is 000001B0A42A6A28 pImport is 000001F8B79FF540
Begin 00007FF9793F4900, size 9b
00007ff9`793f4900 55 push rbp
00007ff9`793f4901 4883ec40 sub rsp,40h
00007ff9`793f4905 488d6c2440 lea rbp,[rsp+40h]
00007ff9`793f490a c5d857e4 vxorps xmm4,xmm4,xmm4
00007ff9`793f490e c5fa7f65e8 vmovdqu xmmword ptr [rbp-18h],xmm4
00007ff9`793f4913 33c0 xor eax,eax
00007ff9`793f4915 488945f8 mov qword ptr [rbp-8],rax
00007ff9`793f4919 48894d10 mov qword ptr [rbp+10h],rcx
00007ff9`793f491d 48895518 mov qword ptr [rbp+18h],rdx
00007ff9`793f4921 44894520 mov dword ptr [rbp+20h],r8d
00007ff9`793f4925 48b9a0c44779f97f0000 mov rcx,7FF97947C4A0h (MT: Org.BouncyCastle.Math.EC.Rfc8032.Ed25519+PointAccum)
00007ff9`793f492f e85c861e5f call coreclr!JIT_TrialAllocSFastMP_InlineGetThread (00007ff9`d85dcf90)
00007ff9`793f4934 488945f0 mov qword ptr [rbp-10h],rax
00007ff9`793f4938 488b4df0 mov rcx,qword ptr [rbp-10h]
00007ff9`793f493c ff15de6d0900 call qword ptr [00007ff9`7948b720] (Org.BouncyCastle.Math.EC.Rfc8032.Ed25519+PointAccum..ctor(), mdToken: 0000000006004504)
00007ff9`793f4942 488b4df0 mov rcx,qword ptr [rbp-10h]
00007ff9`793f4946 48894df8 mov qword ptr [rbp-8],rcx
00007ff9`793f494a 488b4d10 mov rcx,qword ptr [rbp+10h]
00007ff9`793f494e 488b55f8 mov rdx,qword ptr [rbp-8]
00007ff9`793f4952 ff15a06bc9ff call qword ptr [00007ff9`7908b4f8] (Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ScalarMultBase(Byte[], PointAccum), mdToken: 0000000006000BE1)
00007ff9`793f4958 488b4df8 mov rcx,qword ptr [rbp-8]
00007ff9`793f495c 488b5518 mov rdx,qword ptr [rbp+18h]
00007ff9`793f4960 448b4520 mov r8d,dword ptr [rbp+20h]
00007ff9`793f4964 ff154668c9ff call qword ptr [00007ff9`7908b1b0] (Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.EncodePoint(PointAccum, Byte[], Int32), mdToken: 0000000006000BBE)
00007ff9`793f496a 85c0 test eax,eax
00007ff9`793f496c 7526 jne 00007ff9`793f4994
00007ff9`793f496e 48b9c0a5cf78f97f0000 mov rcx,7FF978CFA5C0h (MT: System.InvalidOperationException)
00007ff9`793f4978 e813861e5f call coreclr!JIT_TrialAllocSFastMP_InlineGetThread (00007ff9`d85dcf90)
00007ff9`793f497d 488945e8 mov qword ptr [rbp-18h],rax
00007ff9`793f4981 488b4de8 mov rcx,qword ptr [rbp-18h]
00007ff9`793f4985 ff1525d490ff call qword ptr [00007ff9`78d01db0] (System.InvalidOperationException..ctor(), mdToken: 000000000600156B)
00007ff9`793f498b 488b4de8 mov rcx,qword ptr [rbp-18h]
00007ff9`793f498f e8dca71c5f call coreclr!IL_Throw (00007ff9`d85bf170)
>>> 00007ff9`793f4994 90 nop
00007ff9`793f4995 4883c440 add rsp,40h
00007ff9`793f4999 5d pop rbp
00007ff9`793f499a c3 ret
And the object does not look to have been zeroed:
0:000> !dumpobj 170`1d1d3708
Name: System.InvalidOperationException
MethodTable: 00007ff978cfa5c0
EEClass: 00007ff978ce5ac0
Tracked Type: false
Size: 128(0x80) bytes
File: C:\dev\dotnet\investigations\zero\zero.sync\bin\Release\net7.0\win-x64\publish\System.Private.CoreLib.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007ff978ddc328 400023f 8 ...ection.MethodBase 0 instance dddddddddddddddd _exceptionMethod
00007ff978aa2ce0 4000240 10 System.String 0 instance 000001701d1d37d8 _message
00007ff978aa6af8 4000241 18 ...tions.IDictionary 0 instance 0000000000000000 _data
00007ff978aa4688 4000242 20 System.Exception 0 instance 0000000000000000 _innerException
00007ff978aa2ce0 4000243 28 System.String 0 instance dddddddddddddddd _helpURL
00007ff978cf2380 4000244 30 System.Byte[] 0 instance 000001701d1d3a08 _stackTrace
00007ff978cf2380 4000245 38 System.Byte[] 0 instance 0000000000000000 _watsonBuckets
00007ff978aa2ce0 4000246 40 System.String 0 instance 0000000000000000 _stackTraceString
00007ff978aa2ce0 4000247 48 System.String 0 instance dddddddddddddddd _remoteStackTraceString
00007ff978a1a808 4000248 50 System.Object[] 0 instance 0000000000000000 _dynamicMethods
00007ff978aa2ce0 4000249 58 System.String 0 instance 0000000000000000 _source
00007ff978a8a828 400024a 60 System.UIntPtr 1 instance 00007FF9793F4994 _ipForWatsonBuckets
00007ff978a894b0 400024b 68 System.IntPtr 1 instance DDDDDDDDDDDDDDDD _xptrs
00007ff978a49e10 400024c 70 System.Int32 1 instance -532462766 _xcode
00007ff978a49e10 400024d 74 System.Int32 1 instance -2146233079 _HResult
0:000> !verifyheap
object 000001701d1d3708: bad member DDDDDDDDDDDDDDDD at 000001701D1D3710
Last good object: 000001701D1D36C8.
cc @dotnet/gc
Could this crash be the same? build. PR doesn't have any related changes.
Fatal error. Internal CLR error. (0x80131506)
at System.Collections.Generic.IEnumerable`1[[System.__Canon, System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].GetEnumerator()
at Mono.Linker.Steps.MarkStep.ProcessVirtualMethod(Mono.Cecil.MethodDefinition)
at Mono.Linker.Steps.MarkStep.ProcessVirtualMethods()
at Mono.Linker.Steps.MarkStep.ProcessPrimaryQueue()
at Mono.Linker.Steps.MarkStep.Process()
at Mono.Linker.Steps.MarkStep.Process(Mono.Linker.LinkContext)
at Mono.Linker.Pipeline.ProcessStep(Mono.Linker.LinkContext, Mono.Linker.Steps.IStep)
at Mono.Linker.Pipeline.Process(Mono.Linker.LinkContext)
at Mono.Linker.Driver.Run(Mono.Linker.ILogger, Boolean)
at Mono.Linker.Driver.Main(System.String[])
And the object does not look to have been zeroed:
0:000> !dumpobj 170`1d1d3708 Name: System.InvalidOperationException MethodTable: 00007ff978cfa5c0 EEClass: 00007ff978ce5ac0 Tracked Type: false Size: 128(0x80) bytes File: C:\dev\dotnet\investigations\zero\zero.sync\bin\Release\net7.0\win-x64\publish\System.Private.CoreLib.dll Fields: MT Field Offset Type VT Attr Value Name 00007ff978ddc328 400023f 8 ...ection.MethodBase 0 instance dddddddddddddddd _exceptionMethod 00007ff978aa2ce0 4000240 10 System.String 0 instance 000001701d1d37d8 _message 00007ff978aa6af8 4000241 18 ...tions.IDictionary 0 instance 0000000000000000 _data 00007ff978aa4688 4000242 20 System.Exception 0 instance 0000000000000000 _innerException 00007ff978aa2ce0 4000243 28 System.String 0 instance dddddddddddddddd _helpURL 00007ff978cf2380 4000244 30 System.Byte[] 0 instance 000001701d1d3a08 _stackTrace 00007ff978cf2380 4000245 38 System.Byte[] 0 instance 0000000000000000 _watsonBuckets 00007ff978aa2ce0 4000246 40 System.String 0 instance 0000000000000000 _stackTraceString 00007ff978aa2ce0 4000247 48 System.String 0 instance dddddddddddddddd _remoteStackTraceString 00007ff978a1a808 4000248 50 System.Object[] 0 instance 0000000000000000 _dynamicMethods 00007ff978aa2ce0 4000249 58 System.String 0 instance 0000000000000000 _source 00007ff978a8a828 400024a 60 System.UIntPtr 1 instance 00007FF9793F4994 _ipForWatsonBuckets 00007ff978a894b0 400024b 68 System.IntPtr 1 instance DDDDDDDDDDDDDDDD _xptrs 00007ff978a49e10 400024c 70 System.Int32 1 instance -532462766 _xcode 00007ff978a49e10 400024d 74 System.Int32 1 instance -2146233079 _HResult
0:000> !verifyheap object 000001701d1d3708: bad member DDDDDDDDDDDDDDDD at 000001701D1D3710 Last good object: 000001701D1D36C8.
cc @dotnet/gc
I hacked a part of code that hooks into the scheduler & thread pool. Specifically this line:
It contains a primed InvalidOperationException
ready to be used over and over. I just did some research and what I did there seems illegal. Let me take it out and see what happens. So far so good.
Thanks for that!
So initializing the exception as a static in a struct is causing it to be not initialized correctly?
So initializing the exception as a static in a struct is causing it to be not initialized correctly?
No. I think throwing the same instance is the problem. Since removing it the error has gone away. Also the error started around the time I injected that code. I can't tell for sure because those exceptions were triggering at different rates at that time.
@tactical-drone @mangod9 The codegen above indicates that the "failing" code is constructing and immediately throwing an exception. It should not be possible for such an object to contain a field with an invalid GC ref. I still think this is a GC or runtime bug, but it might be related to something EH ends up doing because of that same exception instance.
Yea, I am taking a wild guess here. If it is not the exception then great!
@janvorli as FYI
@dotnet/gc when I uncomment the verify_mem_cleared
call at the end of gc_heap::adjust_limit_clr
(thanks @cshung) I hit the assertion under debugger very quickly:
0:043> k
# Child-SP RetAddr Call Site
00 000000e5`ec57a6f8 00007ff9`45d3cad1 KERNELBASE!wil::details::DebugBreak+0x2
01 (Inline Function) --------`-------- coreclr!WKS::FATAL_GC_ERROR+0x5 [C:\dev\dotnet\runtime3\src\coreclr\gc\gcpriv.h @ 27]
02 (Inline Function) --------`-------- coreclr!WKS::verify_mem_cleared+0x4a [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 14501]
03 000000e5`ec57a700 00007ff9`45d66705 coreclr!WKS::gc_heap::adjust_limit_clr+0x591 [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 15491]
04 (Inline Function) --------`-------- coreclr!WKS::gc_heap::a_fit_free_list_p+0x3fd [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 15898]
05 000000e5`ec57a780 00007ff9`45d3ee8b coreclr!WKS::gc_heap::soh_try_fit+0x435 [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 16405]
06 000000e5`ec57a820 00007ff9`45d68bee coreclr!WKS::gc_heap::allocate_soh+0x28b [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 16537]
07 000000e5`ec57a8c0 00007ff9`45d36e21 coreclr!WKS::gc_heap::try_allocate_more_space+0x33e [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 17489]
08 (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate_more_space+0x11 [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 17961]
09 (Inline Function) --------`-------- coreclr!WKS::gc_heap::allocate+0x60 [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 17992]
0a 000000e5`ec57a940 00007ff9`459d1d49 coreclr!WKS::GCHeap::Alloc+0x301 [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 44969]
0b 000000e5`ec57aa30 00007ff9`459d3fef coreclr!Alloc+0x1e9 [C:\dev\dotnet\runtime3\src\coreclr\vm\gchelpers.cpp @ 226]
0c 000000e5`ec57ab20 00007ff9`45a365cc coreclr!AllocateSzArray+0x1ef [C:\dev\dotnet\runtime3\src\coreclr\vm\gchelpers.cpp @ 478]
0d 000000e5`ec57ac40 00007ff8`e8688e35 coreclr!JIT_NewArr1+0x2bc [C:\dev\dotnet\runtime3\src\coreclr\vm\jithelpers.cpp @ 2625]
0:043> !clrstack -f
OS Thread Id: 0xf5f4 (43)
Child SP IP Call Site
000000E5EC57A6F8 00007FFA1335D532 KERNELBASE!wil::details::DebugBreak + 2
000000E5EC57A700 00007FF945D3CAD1 coreclr!WKS::gc_heap::adjust_limit_clr + 1425 at C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp:15491
000000E5EC57A780 00007FF945D66705 coreclr!WKS::gc_heap::soh_try_fit + 1077 at C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp:16405
000000E5EC57A820 00007FF945D3EE8B coreclr!WKS::gc_heap::allocate_soh + 651 at C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp:16537
000000E5EC57A8C0 00007FF945D68BEE coreclr!WKS::gc_heap::try_allocate_more_space + 830 at C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp:17489
000000E5EC57A940 00007FF945D36E21 coreclr!WKS::GCHeap::Alloc + 769 at C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp:44969
000000E5EC57AA30 00007FF9459D1D49 coreclr!Alloc + 489 at C:\dev\dotnet\runtime3\src\coreclr\vm\gchelpers.cpp:226
000000E5EC57AB20 00007FF9459D3FEF coreclr!AllocateSzArray + 495 at C:\dev\dotnet\runtime3\src\coreclr\vm\gchelpers.cpp:478
000000E5EC57AC40 00007FF945A365CC coreclr!JIT_NewArr1 + 700 at C:\dev\dotnet\runtime3\src\coreclr\vm\jithelpers.cpp:2625
000000E5EC57AD88 [HelperMethodFrame: 000000e5ec57ad88]
000000E5EC57AEB0 00007ff8e8688e35 BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.PointDouble(PointAccum) + 85
000000E5EC57AF20 00007ff8e862cf2d BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ScalarMultStrausVar(UInt32[], UInt32[], PointAffine, PointAccum) + 461
000000E5EC57AFE0 00007ff8e862bfad BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ImplVerify(Byte[], Int32, Byte[], Int32, Byte[], Byte, Byte[], Int32, Int32) + 653
000000E5EC57B090 00007ff8e862bcf9 BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.Verify(Byte[], Int32, Byte[], Int32, Byte[], Int32, Int32) + 105
000000E5EC57B100 00007ff8e862bc75 zero.cocoon.dll!zero.cocoon.identity.CcDesignation.Verify(Byte[], Int32, Int32, Byte[], Int32, Byte[], Int32) + 69 [C:\dev\dotnet\investigations\zero\zero.cocoon\identity\CcDesignation.cs @ 93]
000000E5EC57B150 00007ff8e862587c zero.cocoon.dll!zero.cocoon.models.CcDiscoveries+d__16.MoveNext() + 4908 [C:\dev\dotnet\investigations\zero\zero.cocoon\models\CcDiscoveries.cs @ 288]
000000E5EC57BB40 00007ff8e86244b6 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[zero.cocoon.models.CcDiscoveries+d__16, zero.cocoon]](d__16 ByRef) + 102
000000E5EC57BBA0 00007ff8e862442c System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1[[zero.core.patterns.bushings.contracts.IoJobMeta+JobState, zero.core]].Start[[zero.cocoon.models.CcDiscoveries+d__16, zero.cocoon]](d__16 ByRef) + 28
000000E5EC57BBD0 00007ff8e86243cd zero.cocoon.dll!zero.cocoon.models.CcDiscoveries.ConsumeAsync() + 125
Note that when cloning the repro case you may want to use an older commit as the change discussed above seems to have an effect on the issue. I reproduced it on https://github.com/tactical-drone/zero/commit/6218976a9ed1d22fec5075417c8eda2a5618343c.
Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.
Author: | tactical-drone |
---|---|
Assignees: | - |
Labels: | `area-GC-coreclr`, `area-VM-coreclr` |
Milestone: | 7.0.0 |
If the problem goes away after deleting the exception singleton (https://github.com/tactical-drone/zero/blob/760d42656136c29e0e219c757e7558ddbe18385c/zero.core/patterns/semaphore/core/IoManualResetValueTaskSourceCore.cs#L20), it is most likely a race condition in how the stacktrace is saved into the exception.
Are you able to see the same InvalidOperationException
being thrown and caught on multiple threads in parallel before the crash?
@jkotas based on @cshung's advice, @jakobbotsch has enabled a piece of code in the GC that verifies that the memory that is given out for the allocation context is zeroed. And the assert in that check is firing. So it seems to indicate the problem of uninitialized exception object fields is in GC. While the stack trace would likely get garbled if that exception instance is thrown from multiple threads, it should not influence fields like _exceptionMethod
or _helpUrl
, I believe.
Keep in mind that exception "singleton" was on a generic struct, so there are multiple of them. But the ones that were thrown in this case were of only <bool>
kind. The others never threw.
@jkotas Yes they might throw in parallel and inside places like CompilerServices.AsyncTaskMethodBuilder
So it seems to indicate the problem of uninitialized exception object fields is in GC
Or it can indicate a buffer overrun or incorrectly protected OBJECTREF in the VM.
Clone & compile https://github.com/tactical-drone/zero and run it.
What is the exact command to run the project?
What is the exact command to run the project?
I published the zero.sync project self-contained, copied checked bits in there and just ran zero.sync.exe.
On the latest revision I don't see those errors happening anymore. There is still sometimes strange crashes, this time something about Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
(but this one is rare)
But to be fair, I do stress the runtime on purpose with hybrid attachments to core parts of the runtime that might be causing the problems. If you press loadTest
in the prompt each drone will send any new state it sees to it's neighbors as fast as possible load balanced using .net scheduler & the latest async/await tech. Pretty impressive what .net is capable of these days.
When you are finished and what to perform the hardcore test that .net fails every time, enter zero
and wait for the teardown to complete. Then mash gc
a couple of times and verify that the memory usage went from 8Gb
back down to currently 400Mb
. This 400Mb
used to be 1.9Gb
not so long ago so .net GC is getting better and better.
The crash does not repro for me (Intel Core i7, 32GB, 8 logical cores). If anybody can share a full dump (w/ symbols if it is using private coreclr build), I will be happy to take a look.
I have looked at several dumps from @jakobbotsch . I agree that this looks like bug in the GC. I do not see any pattern that would suggest a GC reporting hole.
cc @dotnet/gc
From what we have seen so far, the issue repros well on AMD machines. It is hard/impossible to repro on Intel machines.
@cshung could you please try to repro this on our amd machine if it's not obvious from the dumps why this occurs?
From what we have seen so far, the issue repros well on AMD machines. It is hard/impossible to repro on Intel machines.
I like this theory. AFAIK I still think that the Interlocked.CompareExchange
on my machine does not work sometimes. I have program state where If there is a race condition between two of these calls, the return value gets messed up in strange ways. It's like the winning caller is told it won, but its value is not set on the loosing side. On the loosing side the outcome is that some phantom blocker (maybe a previous CAS 1.7 seconds ago according to telemetry) has suddenly appeared which could not have been there just a moment ago when CompilerServices
made the call to OnComplete
). The new blocker cannot be Queued catastrophic failure therefor. What should have happened: the sentinel should have been detected (instead of phantom blocker) and Q canceled & continuation executed. Therefor, both racers are caught as they happen and the state suggests that there must have been a dormant 3rd racer in the area that I cannot explain its whereabouts, but who's state changes must be jammed in there also. Or my logic is bad.
Naturally this is pure conjecture because the state I see makes no sense. Not even wild theories can explain. History dictates that the problem is PEBKAC though, but sometimes I wonder. The things I am trying to achieve with these instructions are complex and hard to debug, but I am convinced I have been consistently plagued by some strange phenomenon with this instruction on AMD. What is hard to believe is sometimes it works for 20million iterations, sometimes if fails after 300. The range is large. (and AMD because it has that memory bridge between the two dies that could be causing this)
At another part of this chain, even slightly changing this function with an equivalent one but different IL my cluster test would bog down after 300 adjuncts have been added for clustering. So to be clear, my .net scheduler outcome is critically dependent on the outcome of this CAS. So something subtle happening here is eventually bricking my scheduler's queues that depend on the correct cores to be selected by this CAS.
AFAIK this code should lead to the same outcomes. This means that there are timing issues around this Interlocked.CompareExchange
instruction. Either I am using it wrong, or it is broken. I donno.
long latch = val;
while (latch + 1 > cap || Interlocked.CompareExchange(ref val, latch + 1, latch) != latch)
{
if (latch + 1 > cap)
return cap;
latch = val;
}
long latch;
while ((latch = val) + 1 > cap || Interlocked.CompareExchange(ref val, latch + 1, latch) != latch)
{
if (latch + 1 > cap)
return cap;
}
It's possible there are two separate bugs here.
1) The bug I have shared dumps for which has the debugger attached and hits the GC assert in verify_mem_cleared
.
2) Without debugger attached I see a more GC-hole looking issue (m_alignpad == 0
assertion failure) that does not repro nearly as consistently.
@jakobbotsch Good point. Let me explain the reason I mention these CAS issues, is that while I was redesigning my semaphore over and over again it seemed like these parts had the biggest effect on the program totally crashing in these strange ways. The CAS instructions lead to valuetasks that break throwing the exceptions above at high rates causing the original issue. But in a way you are correct, these are separate issues.
Naturally I don't think these bugs will occur if you use .net like a civilized individual. These errors will not occur when I disable my scheduler and my scheduler depends on 2 CAS instructions to work. The rest of the concurrency issues are solved by using runtime tech instead of my own contraptions. Unless the glue provided in ZeroCore is fundamentally wrong. I spent a lot of time coming up with that design (using runtime tech instead of another contraption), so I would be surprised but sad if there was a fundamental issue in the way I multiplex ValueTaskSourceCores
To be clear. I don't want to send you guys on a wild goose chase when I mess with custom schedulers. Obviously any crashes reported is a good thing. The least I can do is tell you how to disable my custom bits so that you can decern better if you are chasing my bugs.
Without debugger attached I see a more GC-hole looking issue (m_alignpad == 0 assertion failure) that does not repro nearly as consistently.
Was the assert in verify_mem_cleared
still enabled when you encountered these m_alignpad
crashes?
Was the assert in
verify_mem_cleared
still enabled when you encountered thesem_alignpad
crashes?
Yes.
We have tracked down the first issue above to seemingly be a WinDbg bug where the upper half of ymm
registers become corrupted while GC is clearing the memory. I was able to create a small C++ repro case that when executing a sequence of windbg breakpoint/stepping commands results in a memset
call that fills a buffer with garbage. In this repro we execute a vinsertf128 ymm0, ymm0, xmm0, 1
instruction with xmm0 = 0
, yet ymm0
is non-zero after executing the instruction.
Interestingly, on Intel CPUs the case does not repro and after stepping ymm0
is correctly 0, so there is potential for this to be a hardware bug as well. I have sent the repro to the WinDbg team to have a closer look.
I believe that @janvorli will investigate the second bug, which is likely the real bug reported here.
@jakobbotsch Excellent news.
W.R.T the second problem. Not sure what I am doing wrong but I have had at 3 separate code paths now where my CAS fails if I apply pressure tests. I believe one of these failures are proven to be correct code because it is runtime tech. The reason why I struggle is the pressure needs to be high for the bug to appear, so many times I think I fixed the problem when instead I made the performance worse.
Current status quo is, I see old CAS instructions coming out of nowhere jamming my state. Not sure if they are stuck in the scheduler, the CPU or my broken logic. Maybe I should just run this code in Linux VM and see what happens.
Interestingly if I lower the node count to more civilized values such as 10
, the pressure test results are pristine. You would expect catastrophic failure too because of the less contention the nodes run smoother and faster. dotnet is telling me it is doing 200 000 ops per second
. So if there was a logic bug you would expect the entire chain to collapse instantly over these speeds. It does not.
With higher values such as 50
nodes it will run for a while then suddenly my CPU would flatline and the trace comes up with this bug again. I get this bug every now and again and I have no idea what it means. It just says that runtime started spamming this HAndleNonSuccessAndDebuggerNotification
. I normally get rid of it by perturbating my CAS instructions with supplementary logic until it stops. The 1st time I have seen this spam was with my first semaphore that uses entirely my own contraptions instead of runtime tech. It almost feels like once you reach a certain performance threshold (under load) it just bricks like that.
I can repro the m_alignPad
assert (like Jakob has reported) when running the repro with checked and debug build of runtime. It repros in almost every run for me and the call stack is always the same. Here is the native top of the stack:
# Child-SP RetAddr Call Site
00 000000f0`35ebc110 00007ff9`901e6f0f KERNELBASE!RaiseFailFastException+0x152
01 000000f0`35ebc6f0 00007ff9`901e7cf3 coreclr!TerminateOnAssert+0x1f [G:\git\runtime\src\coreclr\utilcode\debug.cpp @ 202]
02 000000f0`35ebc720 00007ff9`901e6fb1 coreclr!_DbgBreakCheck+0x3c3 [G:\git\runtime\src\coreclr\utilcode\debug.cpp @ 414]
03 000000f0`35ebd9c0 00007ff9`901e7651 coreclr!_DbgBreakCheckNoThrow+0x91 [G:\git\runtime\src\coreclr\utilcode\debug.cpp @ 433]
04 000000f0`35ebdb10 00007ff9`8fac1024 coreclr!DbgAssertDialog+0x341 [G:\git\runtime\src\coreclr\utilcode\debug.cpp @ 594]
05 000000f0`35ebdd50 00007ff9`8fa15cfe coreclr!ObjHeader::IllegalAlignPad+0x64 [G:\git\runtime\src\coreclr\vm\syncblk.cpp @ 2952]
06 000000f0`35ebdd90 00007ff9`8fac83b7 coreclr!ObjHeader::GetBits+0x1e [G:\git\runtime\src\coreclr\vm\syncblk.h @ 1548]
07 000000f0`35ebddc0 00007ff9`8fa1b531 coreclr!ObjHeader::Validate+0x37 [G:\git\runtime\src\coreclr\vm\syncblk.cpp @ 2042]
08 000000f0`35ebde80 00007ff9`8fa1af14 coreclr!Object::ValidateInner+0x5c1 [G:\git\runtime\src\coreclr\vm\object.cpp @ 581]
09 000000f0`35ebe0c0 00007ff9`8fa13425 coreclr!Object::Validate+0xd4 [G:\git\runtime\src\coreclr\vm\object.cpp @ 498]
0a 000000f0`35ebe130 00007ff9`8fcf166a coreclr!OBJECTREF::OBJECTREF+0xd5 [G:\git\runtime\src\coreclr\vm\object.cpp @ 1122]
0b 000000f0`35ebe1a0 00007ff9`337ddc04 coreclr!JIT_VirtualFunctionPointer+0xda [G:\git\runtime\src\coreclr\vm\jithelpers.cpp @ 3477]
The corruption is quite strange. The object at frame 9 is of type System.Object
and the m_alignPad
of the following object that we assert on contains a valid managed reference.
0:166> !lno 0x00000261`9865e528
Before: 000002619865e500 40 (0x28) zero.core.patterns.queue.enumerator.IoQEnumerator`1[[zero.core.patterns.bushings.IoSink`1[[zero.core.feat.models.protobuffer.CcProtocMessage`2[[Zero.Models.Protobuf.CcWhisperMsg, zero.cocoon],[zero.cocoon.models.batches.CcGossipBatch, zero.cocoon]], zero.core.feat]], zero.core]]
Current: 000002619865e528 24 (0x18) System.Object
After: 000002619865e568 24 (0x18) zero.core.patterns.semaphore.core.IoManualResetValueTaskSource`1[[System.Boolean, System.Private.CoreLib]]
Heap local consistency not confirmed.
This is the memory at the object address:
dq 00000261`9865e528
00000261`9865e528 00007ff9`31159710 00000000`00000000
00000261`9865e538 00000261`9864b620 00000261`a620c9f8
00000261`9865e548 00000000`00000000 00000000`00000000
00000261`9865e558 00000000`00000000 00000000`00000000
Here is what the reference that is in the m_alignPad
location points to:
!do 00000261`9864b620
0:166> !do 00000261`9864b620
Name: zero.cocoon.CcDrone
MethodTable: 00007ff93186a9c0
EEClass: 00007ff93187bc88
Tracked Type: false
Size: 296(0x128) bytes
File: D:\test\dotnet\zero\zero.sync\bin\Release\net7.0\win-x64\publish\zero.cocoon.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007ff93132c670 400008c 38 System.Int64 1 instance 15213 _zId
00007ff9313bd6f0 400008d 8 System.String 0 instance 0000026175800218 <Description>k__BackingField
00007ff931698db8 400008e 10 ...ns.misc.IIoNanite 0 instance 000002619863cb88 <ZeroedFrom>k__BackingField
00007ff9313bd6f0 400008f 18 System.String 0 instance 00000261a61f4180 <ZeroReason>k__BackingField
00007ff9313264b0 4000090 48 System.Int32 1 instance 1200607328 TearDownTime
00007ff9313264b0 4000091 4c System.Int32 1 instance 0 CascadeTime
00007ff93132c670 4000092 40 System.Int64 1 instance 1200602109 UpTime
00007ff9313264b0 4000093 50 System.Int32 1 instance 0 _zeroPrimed
00007ff9313264b0 4000094 54 System.Int32 1 instance 1 _zeroed
00007ff9313264b0 4000095 58 System.Int32 1 instance 1 _zeroedSec
00007ff932776810 4000096 20 ...oSub, zero.core]] 0 instance 000002619864b898 _zeroHive
00007ff9327770f0 4000097 28 ...nite, zero.core]] 0 instance 0000026198652390 _zeroHiveMind
00007ff9313264b0 4000098 5c System.Int32 1 instance 1 _concurrencyLevel
00007ff9322c81d0 400009b 30 ...lationTokenSource 0 instance 000002619864b868 AsyncTasks
00007ff93177dbe8 400008a 28 NLog.ILogger 0 static 000002617577f058 _logger
00007ff93132c670 400008b 130 System.Int64 1 static 18954 _uidSeed
00007ff93132c670 4000099 138 System.Int64 1 static 2056 _zCount
00007ff932700608 400009a 30 ...Private.CoreLib]] 0 static 000002617577ea08 ZeroRoot
00007ff9326d47c8 40000d8 60 ...Private.CoreLib]] 0 instance 000002619863cb88 <Source>k__BackingField
00007ff9312c4a28 40000d9 cc System.Boolean 1 instance 0 <ZeroRecoveryEnabled>k__BackingField
0000000000000000 40000da 68 0 instance 000002619865a7b0 _queue
0000000000000000 40000db 70 0 instance 000002619865e388 <JobHeap>k__BackingField
00007ff933186f48 40000dc 78 ...Private.CoreLib]] 0 instance 000002619865e568 _zeroSync
00007ff9313bd6f0 40000dd 80 System.String 0 instance 000002619864b768 _description
00007ff93177dbe8 40000de 88 NLog.ILogger 0 instance 000002618308e460 _logger
00007ff9312c4a28 40000df cd System.Boolean 1 instance 0 <IsArbitrating>k__BackingField
0000000000000000 40000e0 90 0 instance 0000000000000000 _previousJobFragment
00007ff93132c670 40000e1 98 System.Int64 1 instance 0 _eventCounter
00007ff9313264b0 40000e2 a0 System.Int32 1 instance 128 parm_max_q_size
00007ff9313264b0 40000e3 a4 System.Int32 1 instance 2000 parm_min_uptime_ms
00007ff9313264b0 40000e4 a8 System.Int32 1 instance 1000000 parm_stats_mod_count
00007ff9313264b0 40000e5 ac System.Int32 1 instance 10 parm_min_frame_time
00007ff9313264b0 40000e6 b0 System.Int32 1 instance 1000 parm_error_timeout
00007ff9313264b0 40000e7 b4 System.Int32 1 instance 250 parm_conduit_spin_up_wait_time
00007ff9313264b0 40000e8 b8 System.Int32 1 instance 2 parm_max_consumer_threads
00007ff9313264b0 40000e9 bc System.Int32 1 instance 1000 parm_producer_consumer_throttle_delay
00007ff9313264b0 40000ea c0 System.Int32 1 instance 2500 parm_min_failed_production_time
00007ff9313264b0 40000eb c4 System.Int32 1 instance 5000 parm_consumer_wait_for_producer_timeout
00007ff9313264b0 40000ec c8 System.Int32 1 instance 1000 parm_producer_start_retry_time
00007ff9317d6138 4000133 d0 NLog.Logger 0 instance 00000261831152c0 _logger
00007ff93169b300 4000134 d8 ...Private.CoreLib]] 0 instance 000002617c90b348 Node
00007ff9317d6138 40000f7 e0 NLog.Logger 0 instance 0000026187e6d540 _logger
00007ff9313bd6f0 40000f8 e8 System.String 0 instance 00000261a620c598 _description
00007ff932c04c98 40000f9 f0 ...utopeer.CcAdjunct 0 instance 00000261870e3cf8 _adjunct
00007ff9313bd6f0 40000fa f8 System.String 0 instance 00000261870e3c98 _key
00007ff9312c4a28 40000fb 11c System.Boolean 1 instance 1 AccountingBit
00007ff9312c4a28 40000fc 11d System.Boolean 1 instance 1 _assimulated
00007ff9313264b0 40000fd 118 System.Int32 1 instance 1 parm_insane_checks_delay_s
00007ff931626398 40000fe 100 System.Byte[] 0 instance 000002619864b748 _vb
00007ff93172a9d8 40000ff 108 ...obuf.CcWhisperMsg 0 instance 0000026198679c18 _m
00007ff933a37b50 4000100 110 ...Private.CoreLib]] 0 instance 0000026198679e30 _sendBuf
And here is a history of that object:
SUCCESS: GCHist structures initialized
0:166> !histobj 00000261`9864b620
GCCount Object Roots
---------------------------------------------------------
19 000002619864B620 000002A20A332040,
18 000002619B325540
17 000002619B325540
16 000002619B325540
15 000002619B325540
14 000002619B325540
13 000002619B325540
12 000002619B325540
11 000002619B325540
10 000002619B325540
9 000002619B325540
6 000002619B325540
0:166> !histroot 000002A20A332040
GCCount Value MT Promoted? Notes
---------------------------------------------------------
19 000002619864B620 00007FF93186A9C0 no
...
0:166> !histobjfind 000002619864B620
GCCount Object Message
---------------------------------------------------------
19 000002619864B620 Relocation NEWVALUE for root 000002A20A332040
...
And more more piece of the puzzle - the heap verification shows a heap corruption
0:166> !verifyheap
object 00000261986415f8: bad member 000002619865E540 at 0000026198641608
Last good object: 00000261986414E0.
Could not request method table data for object 000002619865E540 (MethodTable: 00000261A620C9F8).
Last good object: 000002619865E528.
That corresponds to the corruption shown above. The 000002619865E528
is the System.Object
before the corrupted m_alignPad
and the 000002619865E540
is an address right after that align pad.
The 00000261A620C9F8
is also a valid object reference, so both the m_alignPad and the following memory location are corrupted by valid references.
0:166> !do 00000261A620C9F8
Name: System.String
MethodTable: 00007ff9313bd6f0
EEClass: 00007ff931460898
Tracked Type: false
Size: 86(0x56) bytes
File: D:\test\dotnet\zero\zero.sync\bin\Release\net7.0\win-x64\publish\System.Private.CoreLib.dll
String: ZERO: ZeroManagedAsync: teardown
Fields:
MT Field Offset Type VT Attr Value Name
00007ff9313264b0 4000303 8 System.Int32 1 instance 32 _stringLength
00007ff9312cc430 4000304 c System.Char 1 instance 5a _firstChar
00007ff9313bd6f0 4000302 f8 System.String 0 static 0000026175800218 Empty
However, this reference has no GC root.
Thanks to @jakobbotsch's hint, I have found that the System.Object
instance at 000002619865E528
(before the corrupted m_alignPad
) is the object that was set as _sentinel
in the IoZeroQUnsafe.As<T>(new object())
.
0:000> !do 000002619865E3F0
Name: zero.core.patterns.queue.IoZeroQ`1[[zero.core.patterns.bushings.IoSink`1[[zero.core.feat.models.protobuffer.CcProtocMessage`2[[Zero.Models.Protobuf.CcWhisperMsg, zero.cocoon],[zero.cocoon.models.batches.CcGossipBatch, zero.cocoon]], zero.core.feat]], zero.core]]
MethodTable: 00007ff933a39488
EEClass: 00007ff9328746c8
Tracked Type: false
Size: 176(0xb0) bytes
File: D:\test\dotnet\zero\zero.sync\bin\Release\net7.0\win-x64\publish\zero.core.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007ff93132c670 4000069 68 System.Int64 1 instance 0 _head
00007ff93132c670 400006a 70 System.Int64 1 instance 2 _hwm
00007ff9313bd6f0 400006b 8 System.String 0 instance 0000026175800218 _description
0000000000000000 400006c 10 SZARRAY 0 instance 000002619865e4b8 _storage
00007ff93186be90 400006d 18 System.__Canon[] 0 instance 000002619865e4d8 _fastStorage
00007ff931159808 400006e 20 System.__Canon 0 instance 000002619865e528 _sentinel
00007ff931159710 400006f 28 System.Object 0 instance 000002619865e4a0 _syncRoot
00007ff9328846d0 4000070 30 ...Private.CoreLib]] 0 instance 000002619865e500 _curEnumerator
00007ff9326df9d0 4000071 38 ...ZeroSemaphoreSlim 0 instance 0000000000000000 _fanSync
00007ff932886e80 4000072 40 ...Private.CoreLib]] 0 instance 0000000000000000 _zeroSync
00007ff9326df9d0 4000073 48 ...ZeroSemaphoreSlim 0 instance 0000000000000000 _balanceSync
0000000000000000 4000074 50 SZARRAY 0 instance 0000000000000000 _fanSyncs
0000000000000000 4000075 58 SZARRAY 0 instance 0000000000000000 _balanceSyncs
0000000000000000 4000076 60 SZARRAY 0 instance 0000000000000000 _zeroSyncs
00007ff93132c670 4000077 78 System.Int64 1 instance 1 _tail
00007ff9313264b0 4000078 80 System.Int32 1 instance 1 _zeroed
00007ff9313264b0 4000079 84 System.Int32 1 instance 2 _capacity
00007ff9313264b0 400007a 88 System.Int32 1 instance 0 _virility
00007ff9313264b0 400007b 8c System.Int32 1 instance 0 _blockingConsumers
00007ff9313264b0 400007c 90 System.Int32 1 instance 0 _sharingConsumers
00007ff9313264b0 400007d 94 System.Int32 1 instance 0 _pumpingConsumers
00007ff9313264b0 400007e 98 System.Int32 1 instance 0 _primedForScale
00007ff9312c4a28 400007f a0 System.Boolean 1 instance 0 _autoScale
00007ff9312c4a28 4000080 a1 System.Boolean 1 instance 0 _blockingCollection
00007ff9313264b0 4000081 9c System.Int32 1 instance 1 _count
Looking at the generic argument of the IoZeroQ
which is the type of the sentinel (IoSink<T>
) and looking at the sources, I can see that the two references that corrupted the m_alignPad
and the following qword in the memory are of the type of the first reference fields in the IoSink
. The string is the ZeroReason
member and the value of the string I've dumped above matches that (ZERO: ZeroManagedAsync: teardown
)
In other words, somehow the _sentinel that's an unsafely cast System.Object
instance is used as the IoSink
, writing over the end of the real object.
I use that sentinel as a instance of the type that I am queueing to manage concurrency. That sentinel is only inserted into the Q buffer using these CAS instructions that I complain about. Not sure if it was strictly necessary to be there. I was trying to fix the CAS by making the other side detect the sentinel instead of issuing another cmp&swap spam that could jam.
If I take out the sentinel the code becomes massively unstable, which should not be. This is a lead.
@tactical-drone It is never safe to reinterpret an object reference of one type as an object reference of another type where the same cast would not succeed. It will cause runtime instability issues, like you see in this issue.
I have applied the following patch that eliminates the use of Unsafe.As
:
@@ -50,8 +50,8 @@ namespace zero.core.patterns.queue
//TODO: tuning
capacity = Math.Max(4, capacity);
_hwm = _capacity = 1;
- _storage = new T[32][];
- _storage[0] = _fastStorage = new T[_capacity];
+ _storage = new object[32][];
+ _storage[0] = _fastStorage = new object[_capacity];
var v = IoMath.Log2((ulong)capacity) - 1;
var scaled = false;
@@ -66,8 +66,8 @@ namespace zero.core.patterns.queue
else
{
_hwm = _capacity = capacity;
- _storage = new T[1][];
- _storage[0] = _fastStorage = new T[_capacity];
+ _storage = new object[1][];
+ _storage[0] = _fastStorage = new object[_capacity];
}
if (_blockingCollection)
@@ -83,7 +83,7 @@ namespace zero.core.patterns.queue
_curEnumerator = new IoQEnumerator<T>(this);
- _sentinel = Unsafe.As<T>(new object());
+ _sentinel = new object();
}
#region packed
@@ -92,10 +92,10 @@ namespace zero.core.patterns.queue
private readonly string _description;
- private readonly T[][] _storage;
- private readonly T[] _fastStorage;
+ private readonly object[][] _storage;
+ private readonly object[] _fastStorage;
- private readonly T _sentinel;
+ private readonly object _sentinel;
private readonly object _syncRoot = new();
private volatile IoQEnumerator<T> _curEnumerator;
@@ -161,11 +161,11 @@ namespace zero.core.patterns.queue
Debug.Assert(idx >= 0);
//if (!IsAutoScaling) return Volatile.Read(ref _fastStorage[idx % _capacity]);
- if (!IsAutoScaling) return _fastStorage[idx % _capacity];
+ if (!IsAutoScaling) return (T)_fastStorage[idx % _capacity];
idx %= Capacity;
var i = IoMath.Log2(unchecked((ulong)idx + 1));
- return _storage[i][idx - ((1 << i) - 1)];
+ return (T)_storage[i][idx - ((1 << i) - 1)];;
}
[MethodImpl(MethodImplOptions.AggressiveInlining)]
protected set
@@ -207,7 +207,7 @@ namespace zero.core.patterns.queue
if (_primedForScale == 1 && Head <= cap2 && Tail <= cap2 && (Tail > Head || Tail == Head && _count < cap2) && Interlocked.CompareExchange(ref _primedForScale, 2, 1) == 1 || force)
{
var hwm = 1 << (_virility + 1);
- _storage[_virility + 1] = new T[hwm];
+ _storage[_virility + 1] = new object[hwm];
Interlocked.Add(ref _hwm, hwm);
Interlocked.Increment(ref _virility);
Interlocked.Exchange(ref _primedForScale, 0);
@@ -226,7 +226,7 @@ namespace zero.core.patterns.queue
/// <param name="compare">The compare value</param>
/// <returns>The previous value</returns>
[MethodImpl(MethodImplOptions.AggressiveInlining)]
- private T CompareExchange(long idx, T value, T compare)
+ private object CompareExchange(long idx, object value, object compare)
{
if (!IsAutoScaling) return Interlocked.CompareExchange(ref _fastStorage[idx % _capacity], value, compare);
@@ -401,7 +401,7 @@ namespace zero.core.patterns.queue
T latch;
while ((head = Head) >= Tail || (latch = this[head]) == _sentinel || latch == null || head != Head ||
- (slot = CompareExchange(head, _sentinel, latch)) != latch)
+ (slot = (T)CompareExchange(head, _sentinel, latch)) != latch)
{
if (_count == 0 || Zeroed)
{
@@ -418,7 +418,7 @@ namespace zero.core.patterns.queue
this[head] = null;//restore the sentinel
Interlocked.Decrement(ref _count);
Interlocked.Increment(ref _head);
return true;
}
catch (Exception e)
When I apply this patch, I see exceptions like the following:
2022/05/27 13:30:43.597|ERROR[ IoZeroQ`1 ( 317)]: TryDequeue failed! System.Object ChkCastAny(Void*, System.Object) System.InvalidCastException: Unable to cast object of type 'System.Object' to type 'IoZNode[zero.core.patterns.bushings.IoSink`1[zero.core.feat.models.protobuffer.CcProtocMessage`2[Zero.Models.Protobuf.chroniton,zero.cocoon.models.batches.CcDiscoveryBatch]]]'.
at zero.core.patterns.queue.IoZeroQ`1.TryDequeue(T& slot) in C:\dev\dotnet\investigations\zero\zero.core\patterns\queue\IoZeroQ.cs:line 431
I've also noticed that are couple of CompareExchange
usages that use the _sentinel
as a source value. There might be a subtle race bug that causes the sentinel to leak out somehow and then getting used as if it was the IoSink type.
And one more detail - in the dump I've taken at the crash, I can actually see the sentinel to be stored somewhere out of the IoZeroQ:
HandleTable:
0000026170CF8238 (strong handle)
-> 000002619C9DDEA0 System.Object[]
-> 00000261A620CB10 System.Threading.Tasks.Task`1[[System.Threading.Tasks.Task, System.Private.CoreLib]]
-> 00000261A620CAD0 System.ValueTuple`6[[zero.core.patterns.misc.IoNanoprobe, zero.core],[System.Func`2[[zero.core.patterns.misc.IoNanoprobe, zero.core],[System.Threading.Tasks.ValueTask, System.Private.CoreLib]], System.Private.CoreLib],[zero.core.patterns.misc.IoNanoprobe, zero.core],[System.String, System.Private.CoreLib],[System.String, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]]
-> 000002619865E528 System.Object
This looks like a static variable holding a TaskIoNanoprobe
, but containing the reference to the System.Object
:
0:000> !do 00000261A620CAD0
Name: System.ValueTuple`6[[zero.core.patterns.misc.IoNanoprobe, zero.core],[System.Func`2[[zero.core.patterns.misc.IoNanoprobe, zero.core],[System.Threading.Tasks.ValueTask, System.Private.CoreLib]], System.Private.CoreLib],[zero.core.patterns.misc.IoNanoprobe, zero.core],[System.String, System.Private.CoreLib],[System.String, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]]
MethodTable: 00007ff933600c60
EEClass: 00007ff932eefd48
Tracked Type: false
Size: 64(0x40) bytes
File: D:\test\dotnet\zero\zero.sync\bin\Release\net7.0\win-x64\publish\System.Private.CoreLib.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007ff931159808 400071f 8 System.__Canon 0 instance 000002619865e528 Item1
00007ff931159808 4000720 10 System.__Canon 0 instance 000002617cc788e0 Item2
00007ff931159808 4000721 18 System.__Canon 0 instance 000002619865e528 Item3
00007ff931159808 4000722 20 System.__Canon 0 instance 0000026183601840 Item4
00007ff931159808 4000723 28 System.__Canon 0 instance 000002617cb15908 Item5
00007ff9313264b0 4000724 30 System.Int32 1 instance 302 Item6
I can see that setting the string to ZERO: ZeroManagedAsync: teardown
happens in the IoNanoprobe.DisposeAsync
, invoked by the IoNanoprobe.Dispose
. So it looks like the code has ended up calling dispose on our object.
@jakobbotsch Sure thing! I was totally hacking there. I was hoping Unsafe.As
could function like a reinterpret_cast<>
on pointer value where I never dereference the pointer. I need that sentinel, or I must pass one in.
Your solution casting to object should work? I might have that internal bug that is masked by this sentinel. Your cast errors is an indication of this. Let me try that patch.
I am a bit confused now. I cant get CAS to work on objects. It throws Message = "Attempted to access an element as a type incompatible with the array."
I am a bit confused now. I cant get CAS to work on objects. It throws
Message = "Attempted to access an element as a type incompatible with the array."
You probably ended up with array covariance by storing T[]
in _storage[i]
. Make sure those inner arrays are also object[]
and not T[]
.
Oops! I missed an entire part of the patch. Your patch passes smoke tests. I don't get those cast errors. Let me make bigger tests.
I thought that fixed it. All my tests pass everything runs very smoothly now. The liveness under load looks better too. So smoothly in fact that I end up back at HandleNonSuccessAndDebuggerNotification
spam again. When I go look at that function in the runtime I have got nothing.
Everything just work then suddenly out of nowhere CPU flatline with HandleNonSuccessAndDebuggerNotification
spamming (which wont replicate in Debug mode). I get no errors in the logs, nothing. A complete instant lockup.
Let me see if I can check in progress.
Just an extra bit of detail that concludes how we got to the corruption. The problem was triggered by the ZeroManagedAsync
function calling DisposeAsync
on the sentinel. I've added this to the IoNanoprobe.DisposeAsync
:
public async ValueTask DisposeAsync(IIoNanite @from, string reason, [CallerFilePath] string filePath = null, [CallerMemberName] string methodName = null, [CallerLineNumber] int lineNumber = default)
{
if (GetType() == typeof(object))
{
Environment.FailFast("DisposeAsync called on System.Object!");
}
The call stack displayed by the FailFast shown the ZeroManagedAsync was the caller. It seemed a bit strange as the caller seems to check if the reference it calls the DisposeAsync on implements IIoNanite
:
if (item is IIoNanite ioNanite)
{
if (!ioNanite.Zeroed())
await ioNanite.DisposeAsync((IIoNanite)nanite, string.Empty).FastPath();
}
But I guess the compiler somehow eliminated the check since the type of the item
local is known to be one that matches the condition (except that in our case, it was returning the System.Object
sentinel).
@janvorli Great stuff! Indeed, that is a bug. Dispose should not be called on the sentinel, but surely must pass that check otherwise Unsafe.As
is not doing what it should. I must correct that somehow. I hope this is not going to be used as an example of why you don't put things like "Unsafe.As" into the framework. I love the fact that we have that kind of power in C#. To bad this time it really was Unsafe.
For now I have removed the sentinel entirely anyway as I was using it to jam the CAS timing.
@tactical-drone I assume we can close this issue now, right?
@janvorli Thanks! Good luck!
Description
This started happening regularly and somewhat reproducibly since .Net 7 Preview 5 a couple of versions ago. My next level scheduler & async/await optimizations were also taking affect so I cannot be sure. It could just be caused by the increased load achieved instead of something in the runtime.
zero.cocoon.models.CcDiscoveries+<ConsumeAsync>d__16.MoveNext()
is the part that reads and parses delivered msgs from UDP backend with protobuf & and tries to verify signatures. The signature verification fails. The rate at which this function is called is high, hence eventually the error happens. So the error self is very low probability but the cluster makes it higher.Reproduction Steps
Clone & compile https://github.com/tactical-drone/zero and run it.
Cluster bootstrap tests regularly fail with that error (1/5) chance.
Expected behavior
Profit.
Actual behavior
Randomly every 2nd or 3rd run I get this error in RELEASE mode. I have never seen this error happen in DEBUG.
What is awesome is, that it is very reproducible. Borg technology is designed to push the runtime to the limits.
Regression?
This worked in .net 7 preview 4
Known Workarounds
None
Configuration
CPU: amd 3900x RAM: 64GB 7.0.0-preview.5.22266.11
Other information
I have no idea. The cluster test needs 10Gb memory in RELEASE mode, 20GB in DEBUG.
PointDouble
Does a tonne of small byte array allocations and does some basic math on it littered with bit shifts.