dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.63k stars 4.57k forks source link

Fatal error. Internal CLR error. (0x80131506) #69653

Closed tactical-drone closed 2 years ago

tactical-drone commented 2 years ago

Description

 at Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.PointDouble(PointAccum)
   at Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ScalarMultStrausVar(UInt32[], UInt32[], PointAffine, PointAccum)
   at Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ImplVerify(Byte[], Int32, Byte[], Int32, Byte[], Byte, Byte[], Int32, Int32)
   at zero.cocoon.models.CcDiscoveries+<ConsumeAsync>d__16.MoveNext()

This started happening regularly and somewhat reproducibly since .Net 7 Preview 5 a couple of versions ago. My next level scheduler & async/await optimizations were also taking affect so I cannot be sure. It could just be caused by the increased load achieved instead of something in the runtime.

zero.cocoon.models.CcDiscoveries+<ConsumeAsync>d__16.MoveNext() is the part that reads and parses delivered msgs from UDP backend with protobuf & and tries to verify signatures. The signature verification fails. The rate at which this function is called is high, hence eventually the error happens. So the error self is very low probability but the cluster makes it higher.

Reproduction Steps

Clone & compile https://github.com/tactical-drone/zero and run it.

Cluster bootstrap tests regularly fail with that error (1/5) chance.

Expected behavior

Profit.

Actual behavior

Randomly every 2nd or 3rd run I get this error in RELEASE mode. I have never seen this error happen in DEBUG.

What is awesome is, that it is very reproducible. Borg technology is designed to push the runtime to the limits.

Regression?

This worked in .net 7 preview 4

Known Workarounds

None

Configuration

CPU: amd 3900x RAM: 64GB 7.0.0-preview.5.22266.11

Other information

I have no idea. The cluster test needs 10Gb memory in RELEASE mode, 20GB in DEBUG.

PointDouble Does a tonne of small byte array allocations and does some basic math on it littered with bit shifts.

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

jakobbotsch commented 2 years ago

I can reproduce this. It looks like we get an AV and then hit a stackoverflow in EH while trying to handle that AV. The top frames of the call stack that shows the repeating segment is

0:034> kn30
 # Child-SP          RetAddr               Call Site
00 000000da`233c3e80 00007ff9`c5911dc8     coreclr!FixupDispatcherContext+0x3c [C:\dev\dotnet\runtime\src\coreclr\vm\exceptionhandling.cpp @ 5514] 
01 (Inline Function) --------`--------     coreclr!FixupDispatcherContext+0x10 [C:\dev\dotnet\runtime\src\coreclr\vm\exceptionhandling.cpp @ 5642] 
02 000000da`233c43c0 00007ffa`155b209f     coreclr!FixContextHandler+0x44 [C:\dev\dotnet\runtime\src\coreclr\vm\exceptionhandling.cpp @ 5768] 
03 000000da`233c4400 00007ffa`15561454     ntdll!RtlpExecuteHandlerForException+0xf
04 000000da`233c4430 00007ffa`155b0bce     ntdll!RtlDispatchException+0x244
05 000000da`233c4b40 00007ffa`132c4fd9     ntdll!KiUserExceptionDispatch+0x2e
06 000000da`233c5250 00007ff9`c577d799     KERNELBASE!RaiseException+0x69
07 000000da`233c5330 00007ff9`c577d7cb     coreclr!NakedThrowHelper2+0x9
08 000000da`233c5360 00007ff9`c577d7d5     coreclr!NakedThrowHelper_RspAligned+0x1e
09 000000da`233c5888 00007ff9`66927030     coreclr!NakedThrowHelper_FixRsp+0x5
0a 000000da`233c5890 00007ff9`b8f9f5dc     0x00007ff9`66927030
0b 000000da`233c58c0 00007ff9`c577c803     System_Private_CoreLib+0x1ef5dc
0c 000000da`233c5930 00007ff9`c56e11cb     coreclr!CallDescrWorkerInternal+0x83
0d (Inline Function) --------`--------     coreclr!CallDescrWorkerWithHandler+0x56 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 67] 
0e 000000da`233c5970 00007ff9`c5818c3b     coreclr!MethodDescCallSite::CallTargetWorker+0x247 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 570] 
0f (Inline Function) --------`--------     coreclr!MethodDescCallSite::Call_RetSTRINGREF+0x17 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.h @ 462] 
10 000000da`233c5aa0 00007ff9`c5818c7a     coreclr!GetExceptionMessage+0x9b [C:\dev\dotnet\runtime\src\coreclr\vm\excep.cpp @ 277] 
11 000000da`233c5bb0 00007ff9`c57b8bff     coreclr!GetExceptionMessage+0xe [C:\dev\dotnet\runtime\src\coreclr\vm\excep.cpp @ 248] 
12 000000da`233c5be0 00007ffa`155b209f     coreclr!ProcessCLRException+0x172cff [C:\dev\dotnet\runtime\src\coreclr\vm\exceptionhandling.cpp @ 1042] 
13 000000da`233c5e00 00007ffa`15561454     ntdll!RtlpExecuteHandlerForException+0xf
14 000000da`233c5e30 00007ffa`155b0bce     ntdll!RtlDispatchException+0x244
15 000000da`233c6540 00007ffa`132c4fd9     ntdll!KiUserExceptionDispatch+0x2e
16 000000da`233c6c50 00007ff9`c577d799     KERNELBASE!RaiseException+0x69
17 000000da`233c6d30 00007ff9`c577d7cb     coreclr!NakedThrowHelper2+0x9
18 000000da`233c6d60 00007ff9`c577d7d5     coreclr!NakedThrowHelper_RspAligned+0x1e
19 000000da`233c7288 00007ff9`66927030     coreclr!NakedThrowHelper_FixRsp+0x5
1a 000000da`233c7290 00007ff9`b8f9f5dc     0x00007ff9`66927030
1b 000000da`233c72c0 00007ff9`c577c803     System_Private_CoreLib+0x1ef5dc
1c 000000da`233c7330 00007ff9`c56e11cb     coreclr!CallDescrWorkerInternal+0x83
1d (Inline Function) --------`--------     coreclr!CallDescrWorkerWithHandler+0x56 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 67] 
1e 000000da`233c7370 00007ff9`c5818c3b     coreclr!MethodDescCallSite::CallTargetWorker+0x247 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 570] 
1f (Inline Function) --------`--------     coreclr!MethodDescCallSite::Call_RetSTRINGREF+0x17 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.h @ 462] 
20 000000da`233c74a0 00007ff9`c5818c7a     coreclr!GetExceptionMessage+0x9b [C:\dev\dotnet\runtime\src\coreclr\vm\excep.cpp @ 277] 
21 000000da`233c75b0 00007ff9`c57b8bff     coreclr!GetExceptionMessage+0xe [C:\dev\dotnet\runtime\src\coreclr\vm\excep.cpp @ 248] 
22 000000da`233c75e0 00007ffa`155b209f     coreclr!ProcessCLRException+0x172cff [C:\dev\dotnet\runtime\src\coreclr\vm\exceptionhandling.cpp @ 1042] 
23 000000da`233c7800 00007ffa`15561454     ntdll!RtlpExecuteHandlerForException+0xf
24 000000da`233c7830 00007ffa`155b0bce     ntdll!RtlDispatchException+0x244
25 000000da`233c7f40 00007ffa`132c4fd9     ntdll!KiUserExceptionDispatch+0x2e
26 000000da`233c8650 00007ff9`c577d799     KERNELBASE!RaiseException+0x69
27 000000da`233c8730 00007ff9`c577d7cb     coreclr!NakedThrowHelper2+0x9
28 000000da`233c8760 00007ff9`c577d7d5     coreclr!NakedThrowHelper_RspAligned+0x1e
29 000000da`233c8c88 00007ff9`66927030     coreclr!NakedThrowHelper_FixRsp+0x5
2a 000000da`233c8c90 00007ff9`b8f9f5dc     0x00007ff9`66927030
2b 000000da`233c8cc0 00007ff9`c577c803     System_Private_CoreLib+0x1ef5dc
2c 000000da`233c8d30 00007ff9`c56e11cb     coreclr!CallDescrWorkerInternal+0x83
2d (Inline Function) --------`--------     coreclr!CallDescrWorkerWithHandler+0x56 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 67] 
2e 000000da`233c8d70 00007ff9`c5818c3b     coreclr!MethodDescCallSite::CallTargetWorker+0x247 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.cpp @ 570] 
2f (Inline Function) --------`--------     coreclr!MethodDescCallSite::Call_RetSTRINGREF+0x17 [C:\dev\dotnet\runtime\src\coreclr\vm\callhelpers.h @ 462] 

The actual crash itself happens inside RuntimeType.get_Cache in the inlined call to GCHandle.InternalGet. https://github.com/dotnet/runtime/blob/621cd59436cb29cab4b1162409ae0947c4bd780d/src/coreclr/System.Private.CoreLib/src/System/RuntimeType.CoreCLR.cs#L2401-L2417

Note the shape of GCHandle.InternalGet is the following: https://github.com/dotnet/runtime/blob/621cd59436cb29cab4b1162409ae0947c4bd780d/src/coreclr/System.Private.CoreLib/src/System/Runtime/InteropServices/GCHandle.CoreCLR.cs#L16-L23

This made me strongly suspect it is related to #68739. However, we actually crash during the inlined InternalGet call when dereferencing the field:

00007ff9`66927020 56              push    rsi
00007ff9`66927021 4883ec20        sub     rsp,20h
00007ff9`66927025 8bf2            mov     esi,edx
00007ff9`66927027 488b4110        mov     rax,qword ptr [rcx+10h]
00007ff9`6692702b 4885c0          test    rax,rax
00007ff9`6692702e 740a            je      00007ff9`6692703a
>>> 00007ff9`66927030 488b00          mov     rax,qword ptr [rax]
00007ff9`66927033 4885c0          test    rax,rax
00007ff9`66927036 7402            je      00007ff9`6692703a
00007ff9`66927038 eb06            jmp     00007ff9`66927040
00007ff9`6692703a ff1560082aff    call    qword ptr [00007ff9`65bc78a0] (System.RuntimeType.InitializeCache(), mdToken: 0000000006000726)
00007ff9`66927040 488bc8          mov     rcx,rax
00007ff9`66927043 8bd6            mov     edx,esi
00007ff9`66927045 3909            cmp     dword ptr [rcx],ecx
00007ff9`66927047 ff15c3a765ff    call    qword ptr [00007ff9`65f81810] (System.RuntimeType+RuntimeTypeCache.GetName(System.TypeNameKind), mdToken: 00000000060007C2)
00007ff9`6692704d 90              nop
00007ff9`6692704e 4883c420        add     rsp,20h
00007ff9`66927052 5e              pop     rsi
00007ff9`66927053 c3              ret

rcx is the following object:

0:034> !DumpObj 1e1`2fce6100
Name:        System.RuntimeType
MethodTable: 00007ff965bb85d8
EEClass:     00007ff965bab748
Tracked Type: false
Size:        40(0x28) bytes
Type Name:   System.AccessViolationException
Type MT:     00007ff966fbdaa0
File:        C:\dev\dotnet\investigations\zero\zero.sync\bin\Release\net7.0\win-x64\publish\System.Private.CoreLib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff9664cbab0  4000303      100 ...Reflection.Binder  0   static 000001e11a577120 s_defaultBinder
00007ff965bbe630  4000304      9aa          System.Char  1   static               2e Delimiter
00007ff965e9a490  4000305      108        System.Type[]  0   static 000001e11a42bc48 EmptyTypes
00007ff965a99280  4000306      110        System.Object  0   static 000001e11a42bc60 Missing
00007ff965f90fb8  4000307      118 ...tion.MemberFilter  0   static 000001e11a42bc78 FilterAttribute
00007ff965f90fb8  4000308      120 ...tion.MemberFilter  0   static 000001e11a42bcb8 FilterName
00007ff965f90fb8  4000309      128 ...tion.MemberFilter  0   static 000001e11a42bd10 FilterNameIgnoreCase
00007ff965a99280  40002a8        8        System.Object  0 instance 0000000000000000 m_keepalive
00007ff965c294b0  40002a9       10        System.IntPtr  1 instance 0000000000096997 m_cache  <- not a valid pointer
00007ff965c294b0  40002aa       18        System.IntPtr  1 instance 00007FF966FBDAA0 m_handle
00007ff965bb85d8  40002ab       c8   System.RuntimeType  0   static 000001e11a418948 ValueType
00007ff965bb85d8  40002ac       d0   System.RuntimeType  0   static 000001e11a418920 ObjectType
00007ff965bb85d8  40002ad       d8   System.RuntimeType  0   static 000001e11a400580 StringType
0000000000000000  40002af       e0  System.OleAutBinder  0   static 0000000000000000 s_ForwardCallBinder

Notice that m_cache looks corrupted. So now I am not totally sure. Still, I will look closer at the GC info here to make sure. (edit) The GC info looks fine to me.

jakobbotsch commented 2 years ago

When I run this under debugger I see quite a few different crashes that all point to some heap corruption. It also seems to crash much more quickly than when not run under debugger.

jakobbotsch commented 2 years ago

It looks a lot like allocations are being returned without being zeroed. In one example run we crashed immediately when trying to throw an exception:

0:000> !clrstack -f
OS Thread Id: 0xf5c0 (0)
        Child SP               IP Call Site
000000213D37CE70 00007FFA132C4FD9 KERNELBASE!RaiseException + 105
000000213D37CF50 00007FF9D84E9BF6 coreclr!RaiseTheExceptionInternalOnly + 678 at C:\dev\dotnet\runtime\src\coreclr\vm\excep.cpp:2805
000000213D37D080 00007FF9D85BF229 coreclr!IL_Throw + 185 at C:\dev\dotnet\runtime\src\coreclr\vm\jithelpers.cpp:4014
000000213D37D0E8                  [HelperMethodFrame: 000000213d37d0e8] 
000000213D37D1E0 00007ff9793f4994 BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ScalarMultBaseEncoded(Byte[], Byte[], Int32) + 148
000000213D37D230 00007ff97940bd99 BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ImplSign(Byte[], Int32, Byte[], Byte, Byte[], Int32, Int32, Byte[], Int32) + 329
000000213D37D2E0 00007ff97940bc29 BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.Sign(Byte[], Int32, Byte[], Int32, Int32, Byte[], Int32) + 105

The managed function that creates the exception looks like:

0:000> !U /d 00007ff9793f4994
Normal JIT generated code
Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ScalarMultBaseEncoded(Byte[], Byte[], Int32)
ilAddr is 000001B0A42A6A28 pImport is 000001F8B79FF540
Begin 00007FF9793F4900, size 9b
00007ff9`793f4900 55              push    rbp
00007ff9`793f4901 4883ec40        sub     rsp,40h
00007ff9`793f4905 488d6c2440      lea     rbp,[rsp+40h]
00007ff9`793f490a c5d857e4        vxorps  xmm4,xmm4,xmm4
00007ff9`793f490e c5fa7f65e8      vmovdqu xmmword ptr [rbp-18h],xmm4
00007ff9`793f4913 33c0            xor     eax,eax
00007ff9`793f4915 488945f8        mov     qword ptr [rbp-8],rax
00007ff9`793f4919 48894d10        mov     qword ptr [rbp+10h],rcx
00007ff9`793f491d 48895518        mov     qword ptr [rbp+18h],rdx
00007ff9`793f4921 44894520        mov     dword ptr [rbp+20h],r8d
00007ff9`793f4925 48b9a0c44779f97f0000 mov rcx,7FF97947C4A0h (MT: Org.BouncyCastle.Math.EC.Rfc8032.Ed25519+PointAccum)
00007ff9`793f492f e85c861e5f      call    coreclr!JIT_TrialAllocSFastMP_InlineGetThread (00007ff9`d85dcf90)
00007ff9`793f4934 488945f0        mov     qword ptr [rbp-10h],rax
00007ff9`793f4938 488b4df0        mov     rcx,qword ptr [rbp-10h]
00007ff9`793f493c ff15de6d0900    call    qword ptr [00007ff9`7948b720] (Org.BouncyCastle.Math.EC.Rfc8032.Ed25519+PointAccum..ctor(), mdToken: 0000000006004504)
00007ff9`793f4942 488b4df0        mov     rcx,qword ptr [rbp-10h]
00007ff9`793f4946 48894df8        mov     qword ptr [rbp-8],rcx
00007ff9`793f494a 488b4d10        mov     rcx,qword ptr [rbp+10h]
00007ff9`793f494e 488b55f8        mov     rdx,qword ptr [rbp-8]
00007ff9`793f4952 ff15a06bc9ff    call    qword ptr [00007ff9`7908b4f8] (Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ScalarMultBase(Byte[], PointAccum), mdToken: 0000000006000BE1)
00007ff9`793f4958 488b4df8        mov     rcx,qword ptr [rbp-8]
00007ff9`793f495c 488b5518        mov     rdx,qword ptr [rbp+18h]
00007ff9`793f4960 448b4520        mov     r8d,dword ptr [rbp+20h]
00007ff9`793f4964 ff154668c9ff    call    qword ptr [00007ff9`7908b1b0] (Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.EncodePoint(PointAccum, Byte[], Int32), mdToken: 0000000006000BBE)
00007ff9`793f496a 85c0            test    eax,eax
00007ff9`793f496c 7526            jne     00007ff9`793f4994
00007ff9`793f496e 48b9c0a5cf78f97f0000 mov rcx,7FF978CFA5C0h (MT: System.InvalidOperationException)
00007ff9`793f4978 e813861e5f      call    coreclr!JIT_TrialAllocSFastMP_InlineGetThread (00007ff9`d85dcf90)
00007ff9`793f497d 488945e8        mov     qword ptr [rbp-18h],rax
00007ff9`793f4981 488b4de8        mov     rcx,qword ptr [rbp-18h]
00007ff9`793f4985 ff1525d490ff    call    qword ptr [00007ff9`78d01db0] (System.InvalidOperationException..ctor(), mdToken: 000000000600156B)
00007ff9`793f498b 488b4de8        mov     rcx,qword ptr [rbp-18h]
00007ff9`793f498f e8dca71c5f      call    coreclr!IL_Throw (00007ff9`d85bf170)
>>> 00007ff9`793f4994 90              nop
00007ff9`793f4995 4883c440        add     rsp,40h
00007ff9`793f4999 5d              pop     rbp
00007ff9`793f499a c3              ret

And the object does not look to have been zeroed:

0:000> !dumpobj 170`1d1d3708
Name:        System.InvalidOperationException
MethodTable: 00007ff978cfa5c0
EEClass:     00007ff978ce5ac0
Tracked Type: false
Size:        128(0x80) bytes
File:        C:\dev\dotnet\investigations\zero\zero.sync\bin\Release\net7.0\win-x64\publish\System.Private.CoreLib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff978ddc328  400023f        8 ...ection.MethodBase  0 instance dddddddddddddddd _exceptionMethod
00007ff978aa2ce0  4000240       10        System.String  0 instance 000001701d1d37d8 _message
00007ff978aa6af8  4000241       18 ...tions.IDictionary  0 instance 0000000000000000 _data
00007ff978aa4688  4000242       20     System.Exception  0 instance 0000000000000000 _innerException
00007ff978aa2ce0  4000243       28        System.String  0 instance dddddddddddddddd _helpURL
00007ff978cf2380  4000244       30        System.Byte[]  0 instance 000001701d1d3a08 _stackTrace
00007ff978cf2380  4000245       38        System.Byte[]  0 instance 0000000000000000 _watsonBuckets
00007ff978aa2ce0  4000246       40        System.String  0 instance 0000000000000000 _stackTraceString
00007ff978aa2ce0  4000247       48        System.String  0 instance dddddddddddddddd _remoteStackTraceString
00007ff978a1a808  4000248       50      System.Object[]  0 instance 0000000000000000 _dynamicMethods
00007ff978aa2ce0  4000249       58        System.String  0 instance 0000000000000000 _source
00007ff978a8a828  400024a       60       System.UIntPtr  1 instance 00007FF9793F4994 _ipForWatsonBuckets
00007ff978a894b0  400024b       68        System.IntPtr  1 instance DDDDDDDDDDDDDDDD _xptrs
00007ff978a49e10  400024c       70         System.Int32  1 instance       -532462766 _xcode
00007ff978a49e10  400024d       74         System.Int32  1 instance      -2146233079 _HResult
0:000> !verifyheap
object 000001701d1d3708: bad member DDDDDDDDDDDDDDDD at 000001701D1D3710
Last good object: 000001701D1D36C8.

cc @dotnet/gc

radical commented 2 years ago

Could this crash be the same? build. PR doesn't have any related changes.

  Fatal error. Internal CLR error. (0x80131506)
     at System.Collections.Generic.IEnumerable`1[[System.__Canon, System.Private.CoreLib, Version=7.0.0.0, Culture=neutral, PublicKeyToken=7cec85d7bea7798e]].GetEnumerator()
     at Mono.Linker.Steps.MarkStep.ProcessVirtualMethod(Mono.Cecil.MethodDefinition)
     at Mono.Linker.Steps.MarkStep.ProcessVirtualMethods()
     at Mono.Linker.Steps.MarkStep.ProcessPrimaryQueue()
     at Mono.Linker.Steps.MarkStep.Process()
     at Mono.Linker.Steps.MarkStep.Process(Mono.Linker.LinkContext)
     at Mono.Linker.Pipeline.ProcessStep(Mono.Linker.LinkContext, Mono.Linker.Steps.IStep)
     at Mono.Linker.Pipeline.Process(Mono.Linker.LinkContext)
     at Mono.Linker.Driver.Run(Mono.Linker.ILogger, Boolean)
     at Mono.Linker.Driver.Main(System.String[])
tactical-drone commented 2 years ago

And the object does not look to have been zeroed:

0:000> !dumpobj 170`1d1d3708
Name:        System.InvalidOperationException
MethodTable: 00007ff978cfa5c0
EEClass:     00007ff978ce5ac0
Tracked Type: false
Size:        128(0x80) bytes
File:        C:\dev\dotnet\investigations\zero\zero.sync\bin\Release\net7.0\win-x64\publish\System.Private.CoreLib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff978ddc328  400023f        8 ...ection.MethodBase  0 instance dddddddddddddddd _exceptionMethod
00007ff978aa2ce0  4000240       10        System.String  0 instance 000001701d1d37d8 _message
00007ff978aa6af8  4000241       18 ...tions.IDictionary  0 instance 0000000000000000 _data
00007ff978aa4688  4000242       20     System.Exception  0 instance 0000000000000000 _innerException
00007ff978aa2ce0  4000243       28        System.String  0 instance dddddddddddddddd _helpURL
00007ff978cf2380  4000244       30        System.Byte[]  0 instance 000001701d1d3a08 _stackTrace
00007ff978cf2380  4000245       38        System.Byte[]  0 instance 0000000000000000 _watsonBuckets
00007ff978aa2ce0  4000246       40        System.String  0 instance 0000000000000000 _stackTraceString
00007ff978aa2ce0  4000247       48        System.String  0 instance dddddddddddddddd _remoteStackTraceString
00007ff978a1a808  4000248       50      System.Object[]  0 instance 0000000000000000 _dynamicMethods
00007ff978aa2ce0  4000249       58        System.String  0 instance 0000000000000000 _source
00007ff978a8a828  400024a       60       System.UIntPtr  1 instance 00007FF9793F4994 _ipForWatsonBuckets
00007ff978a894b0  400024b       68        System.IntPtr  1 instance DDDDDDDDDDDDDDDD _xptrs
00007ff978a49e10  400024c       70         System.Int32  1 instance       -532462766 _xcode
00007ff978a49e10  400024d       74         System.Int32  1 instance      -2146233079 _HResult
0:000> !verifyheap
object 000001701d1d3708: bad member DDDDDDDDDDDDDDDD at 000001701D1D3710
Last good object: 000001701D1D36C8.

cc @dotnet/gc

I hacked a part of code that hooks into the scheduler & thread pool. Specifically this line:

It contains a primed InvalidOperationException ready to be used over and over. I just did some research and what I did there seems illegal. Let me take it out and see what happens. So far so good.

Thanks for that!

mangod9 commented 2 years ago

So initializing the exception as a static in a struct is causing it to be not initialized correctly?

tactical-drone commented 2 years ago

So initializing the exception as a static in a struct is causing it to be not initialized correctly?

No. I think throwing the same instance is the problem. Since removing it the error has gone away. Also the error started around the time I injected that code. I can't tell for sure because those exceptions were triggering at different rates at that time.

jakobbotsch commented 2 years ago

@tactical-drone @mangod9 The codegen above indicates that the "failing" code is constructing and immediately throwing an exception. It should not be possible for such an object to contain a field with an invalid GC ref. I still think this is a GC or runtime bug, but it might be related to something EH ends up doing because of that same exception instance.

tactical-drone commented 2 years ago

Yea, I am taking a wild guess here. If it is not the exception then great!

mangod9 commented 2 years ago

@janvorli as FYI

jakobbotsch commented 2 years ago

@dotnet/gc when I uncomment the verify_mem_cleared call at the end of gc_heap::adjust_limit_clr (thanks @cshung) I hit the assertion under debugger very quickly:

0:043> k
 # Child-SP          RetAddr               Call Site
00 000000e5`ec57a6f8 00007ff9`45d3cad1     KERNELBASE!wil::details::DebugBreak+0x2
01 (Inline Function) --------`--------     coreclr!WKS::FATAL_GC_ERROR+0x5 [C:\dev\dotnet\runtime3\src\coreclr\gc\gcpriv.h @ 27] 
02 (Inline Function) --------`--------     coreclr!WKS::verify_mem_cleared+0x4a [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 14501] 
03 000000e5`ec57a700 00007ff9`45d66705     coreclr!WKS::gc_heap::adjust_limit_clr+0x591 [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 15491] 
04 (Inline Function) --------`--------     coreclr!WKS::gc_heap::a_fit_free_list_p+0x3fd [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 15898] 
05 000000e5`ec57a780 00007ff9`45d3ee8b     coreclr!WKS::gc_heap::soh_try_fit+0x435 [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 16405] 
06 000000e5`ec57a820 00007ff9`45d68bee     coreclr!WKS::gc_heap::allocate_soh+0x28b [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 16537] 
07 000000e5`ec57a8c0 00007ff9`45d36e21     coreclr!WKS::gc_heap::try_allocate_more_space+0x33e [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 17489] 
08 (Inline Function) --------`--------     coreclr!WKS::gc_heap::allocate_more_space+0x11 [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 17961] 
09 (Inline Function) --------`--------     coreclr!WKS::gc_heap::allocate+0x60 [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 17992] 
0a 000000e5`ec57a940 00007ff9`459d1d49     coreclr!WKS::GCHeap::Alloc+0x301 [C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp @ 44969] 
0b 000000e5`ec57aa30 00007ff9`459d3fef     coreclr!Alloc+0x1e9 [C:\dev\dotnet\runtime3\src\coreclr\vm\gchelpers.cpp @ 226] 
0c 000000e5`ec57ab20 00007ff9`45a365cc     coreclr!AllocateSzArray+0x1ef [C:\dev\dotnet\runtime3\src\coreclr\vm\gchelpers.cpp @ 478] 
0d 000000e5`ec57ac40 00007ff8`e8688e35     coreclr!JIT_NewArr1+0x2bc [C:\dev\dotnet\runtime3\src\coreclr\vm\jithelpers.cpp @ 2625] 
0:043> !clrstack -f
OS Thread Id: 0xf5f4 (43)
        Child SP               IP Call Site
000000E5EC57A6F8 00007FFA1335D532 KERNELBASE!wil::details::DebugBreak + 2
000000E5EC57A700 00007FF945D3CAD1 coreclr!WKS::gc_heap::adjust_limit_clr + 1425 at C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp:15491
000000E5EC57A780 00007FF945D66705 coreclr!WKS::gc_heap::soh_try_fit + 1077 at C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp:16405
000000E5EC57A820 00007FF945D3EE8B coreclr!WKS::gc_heap::allocate_soh + 651 at C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp:16537
000000E5EC57A8C0 00007FF945D68BEE coreclr!WKS::gc_heap::try_allocate_more_space + 830 at C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp:17489
000000E5EC57A940 00007FF945D36E21 coreclr!WKS::GCHeap::Alloc + 769 at C:\dev\dotnet\runtime3\src\coreclr\gc\gc.cpp:44969
000000E5EC57AA30 00007FF9459D1D49 coreclr!Alloc + 489 at C:\dev\dotnet\runtime3\src\coreclr\vm\gchelpers.cpp:226
000000E5EC57AB20 00007FF9459D3FEF coreclr!AllocateSzArray + 495 at C:\dev\dotnet\runtime3\src\coreclr\vm\gchelpers.cpp:478
000000E5EC57AC40 00007FF945A365CC coreclr!JIT_NewArr1 + 700 at C:\dev\dotnet\runtime3\src\coreclr\vm\jithelpers.cpp:2625
000000E5EC57AD88                  [HelperMethodFrame: 000000e5ec57ad88] 
000000E5EC57AEB0 00007ff8e8688e35 BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.PointDouble(PointAccum) + 85
000000E5EC57AF20 00007ff8e862cf2d BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ScalarMultStrausVar(UInt32[], UInt32[], PointAffine, PointAccum) + 461
000000E5EC57AFE0 00007ff8e862bfad BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ImplVerify(Byte[], Int32, Byte[], Int32, Byte[], Byte, Byte[], Int32, Int32) + 653
000000E5EC57B090 00007ff8e862bcf9 BouncyCastle.Crypto.dll!Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.Verify(Byte[], Int32, Byte[], Int32, Byte[], Int32, Int32) + 105
000000E5EC57B100 00007ff8e862bc75 zero.cocoon.dll!zero.cocoon.identity.CcDesignation.Verify(Byte[], Int32, Int32, Byte[], Int32, Byte[], Int32) + 69 [C:\dev\dotnet\investigations\zero\zero.cocoon\identity\CcDesignation.cs @ 93]
000000E5EC57B150 00007ff8e862587c zero.cocoon.dll!zero.cocoon.models.CcDiscoveries+d__16.MoveNext() + 4908 [C:\dev\dotnet\investigations\zero\zero.cocoon\models\CcDiscoveries.cs @ 288]
000000E5EC57BB40 00007ff8e86244b6 System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[zero.cocoon.models.CcDiscoveries+d__16, zero.cocoon]](d__16 ByRef) + 102
000000E5EC57BBA0 00007ff8e862442c System.Private.CoreLib.dll!System.Runtime.CompilerServices.AsyncValueTaskMethodBuilder`1[[zero.core.patterns.bushings.contracts.IoJobMeta+JobState, zero.core]].Start[[zero.cocoon.models.CcDiscoveries+d__16, zero.cocoon]](d__16 ByRef) + 28
000000E5EC57BBD0 00007ff8e86243cd zero.cocoon.dll!zero.cocoon.models.CcDiscoveries.ConsumeAsync() + 125

Note that when cloning the repro case you may want to use an older commit as the change discussed above seems to have an effect on the issue. I reproduced it on https://github.com/tactical-drone/zero/commit/6218976a9ed1d22fec5075417c8eda2a5618343c.

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/gc See info in area-owners.md if you want to be subscribed.

Issue Details
### Description ``` at Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.PointDouble(PointAccum) at Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ScalarMultStrausVar(UInt32[], UInt32[], PointAffine, PointAccum) at Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ImplVerify(Byte[], Int32, Byte[], Int32, Byte[], Byte, Byte[], Int32, Int32) at zero.cocoon.models.CcDiscoveries+d__16.MoveNext() ``` This started happening regularly and somewhat reproducibly since .Net 7 Preview 5 a couple of versions ago. My next level scheduler & async/await optimizations were also taking affect so I cannot be sure. It could just be caused by the increased load achieved instead of something in the runtime. ` zero.cocoon.models.CcDiscoveries+d__16.MoveNext()` is the part that reads and parses delivered msgs from UDP backend with protobuf & and tries to verify signatures. The signature verification fails. The rate at which this function is called is high, hence eventually the error happens. So the error self is very low probability but the cluster makes it higher. ### Reproduction Steps Clone & compile https://github.com/tactical-drone/zero and run it. Cluster bootstrap tests regularly fail with that error (1/5) chance. ### Expected behavior Profit. ### Actual behavior Randomly every 2nd or 3rd run I get this error in RELEASE mode. I have never seen this error happen in DEBUG. What is awesome is, that it is very reproducible. Borg technology is designed to push the runtime to the limits. ### Regression? This worked in .net 7 preview 4 ### Known Workarounds None ### Configuration CPU: amd 3900x RAM: 64GB 7.0.0-preview.5.22266.11 ### Other information I have no idea. The cluster test needs 10Gb memory in RELEASE mode, 20GB in DEBUG. `PointDouble` Does a tonne of small byte array allocations and does some basic math on it littered with bit shifts.
Author: tactical-drone
Assignees: -
Labels: `area-GC-coreclr`, `area-VM-coreclr`
Milestone: 7.0.0
jkotas commented 2 years ago

If the problem goes away after deleting the exception singleton (https://github.com/tactical-drone/zero/blob/760d42656136c29e0e219c757e7558ddbe18385c/zero.core/patterns/semaphore/core/IoManualResetValueTaskSourceCore.cs#L20), it is most likely a race condition in how the stacktrace is saved into the exception.

Are you able to see the same InvalidOperationException being thrown and caught on multiple threads in parallel before the crash?

janvorli commented 2 years ago

@jkotas based on @cshung's advice, @jakobbotsch has enabled a piece of code in the GC that verifies that the memory that is given out for the allocation context is zeroed. And the assert in that check is firing. So it seems to indicate the problem of uninitialized exception object fields is in GC. While the stack trace would likely get garbled if that exception instance is thrown from multiple threads, it should not influence fields like _exceptionMethod or _helpUrl, I believe.

tactical-drone commented 2 years ago

Keep in mind that exception "singleton" was on a generic struct, so there are multiple of them. But the ones that were thrown in this case were of only <bool> kind. The others never threw.

@jkotas Yes they might throw in parallel and inside places like CompilerServices.AsyncTaskMethodBuilder

jkotas commented 2 years ago

So it seems to indicate the problem of uninitialized exception object fields is in GC

Or it can indicate a buffer overrun or incorrectly protected OBJECTREF in the VM.

jkotas commented 2 years ago

Clone & compile https://github.com/tactical-drone/zero and run it.

What is the exact command to run the project?

jakobbotsch commented 2 years ago

What is the exact command to run the project?

I published the zero.sync project self-contained, copied checked bits in there and just ran zero.sync.exe.

tactical-drone commented 2 years ago

On the latest revision I don't see those errors happening anymore. There is still sometimes strange crashes, this time something about Fatal error. System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt. (but this one is rare)

But to be fair, I do stress the runtime on purpose with hybrid attachments to core parts of the runtime that might be causing the problems. If you press loadTest in the prompt each drone will send any new state it sees to it's neighbors as fast as possible load balanced using .net scheduler & the latest async/await tech. Pretty impressive what .net is capable of these days.

When you are finished and what to perform the hardcore test that .net fails every time, enter zero and wait for the teardown to complete. Then mash gc a couple of times and verify that the memory usage went from 8Gb back down to currently 400Mb. This 400Mb used to be 1.9Gb not so long ago so .net GC is getting better and better.

jkotas commented 2 years ago

The crash does not repro for me (Intel Core i7, 32GB, 8 logical cores). If anybody can share a full dump (w/ symbols if it is using private coreclr build), I will be happy to take a look.

jkotas commented 2 years ago

I have looked at several dumps from @jakobbotsch . I agree that this looks like bug in the GC. I do not see any pattern that would suggest a GC reporting hole.

cc @dotnet/gc

From what we have seen so far, the issue repros well on AMD machines. It is hard/impossible to repro on Intel machines.

Maoni0 commented 2 years ago

@cshung could you please try to repro this on our amd machine if it's not obvious from the dumps why this occurs?

tactical-drone commented 2 years ago

From what we have seen so far, the issue repros well on AMD machines. It is hard/impossible to repro on Intel machines.

I like this theory. AFAIK I still think that the Interlocked.CompareExchange on my machine does not work sometimes. I have program state where If there is a race condition between two of these calls, the return value gets messed up in strange ways. It's like the winning caller is told it won, but its value is not set on the loosing side. On the loosing side the outcome is that some phantom blocker (maybe a previous CAS 1.7 seconds ago according to telemetry) has suddenly appeared which could not have been there just a moment ago when CompilerServices made the call to OnComplete). The new blocker cannot be Queued catastrophic failure therefor. What should have happened: the sentinel should have been detected (instead of phantom blocker) and Q canceled & continuation executed. Therefor, both racers are caught as they happen and the state suggests that there must have been a dormant 3rd racer in the area that I cannot explain its whereabouts, but who's state changes must be jammed in there also. Or my logic is bad.

Naturally this is pure conjecture because the state I see makes no sense. Not even wild theories can explain. History dictates that the problem is PEBKAC though, but sometimes I wonder. The things I am trying to achieve with these instructions are complex and hard to debug, but I am convinced I have been consistently plagued by some strange phenomenon with this instruction on AMD. What is hard to believe is sometimes it works for 20million iterations, sometimes if fails after 300. The range is large. (and AMD because it has that memory bridge between the two dies that could be causing this)

At another part of this chain, even slightly changing this function with an equivalent one but different IL my cluster test would bog down after 300 adjuncts have been added for clustering. So to be clear, my .net scheduler outcome is critically dependent on the outcome of this CAS. So something subtle happening here is eventually bricking my scheduler's queues that depend on the correct cores to be selected by this CAS.

AFAIK this code should lead to the same outcomes. This means that there are timing issues around this Interlocked.CompareExchange instruction. Either I am using it wrong, or it is broken. I donno.

Fails:

long latch = val;
while (latch + 1 > cap || Interlocked.CompareExchange(ref val, latch + 1, latch) != latch)
{
    if (latch + 1 > cap)
        return cap;
    latch = val;
}

Succeeds:

long latch;
while ((latch = val) + 1 > cap || Interlocked.CompareExchange(ref val, latch + 1, latch) != latch)
{
    if (latch + 1 > cap)
        return cap;
}
jakobbotsch commented 2 years ago

It's possible there are two separate bugs here. 1) The bug I have shared dumps for which has the debugger attached and hits the GC assert in verify_mem_cleared. 2) Without debugger attached I see a more GC-hole looking issue (m_alignpad == 0 assertion failure) that does not repro nearly as consistently.

tactical-drone commented 2 years ago

@jakobbotsch Good point. Let me explain the reason I mention these CAS issues, is that while I was redesigning my semaphore over and over again it seemed like these parts had the biggest effect on the program totally crashing in these strange ways. The CAS instructions lead to valuetasks that break throwing the exceptions above at high rates causing the original issue. But in a way you are correct, these are separate issues.

Naturally I don't think these bugs will occur if you use .net like a civilized individual. These errors will not occur when I disable my scheduler and my scheduler depends on 2 CAS instructions to work. The rest of the concurrency issues are solved by using runtime tech instead of my own contraptions. Unless the glue provided in ZeroCore is fundamentally wrong. I spent a lot of time coming up with that design (using runtime tech instead of another contraption), so I would be surprised but sad if there was a fundamental issue in the way I multiplex ValueTaskSourceCores

To be clear. I don't want to send you guys on a wild goose chase when I mess with custom schedulers. Obviously any crashes reported is a good thing. The least I can do is tell you how to disable my custom bits so that you can decern better if you are chasing my bugs.

jkotas commented 2 years ago

Without debugger attached I see a more GC-hole looking issue (m_alignpad == 0 assertion failure) that does not repro nearly as consistently.

Was the assert in verify_mem_cleared still enabled when you encountered these m_alignpad crashes?

jakobbotsch commented 2 years ago

Was the assert in verify_mem_cleared still enabled when you encountered these m_alignpad crashes?

Yes.

jakobbotsch commented 2 years ago

We have tracked down the first issue above to seemingly be a WinDbg bug where the upper half of ymm registers become corrupted while GC is clearing the memory. I was able to create a small C++ repro case that when executing a sequence of windbg breakpoint/stepping commands results in a memset call that fills a buffer with garbage. In this repro we execute a vinsertf128 ymm0, ymm0, xmm0, 1 instruction with xmm0 = 0, yet ymm0 is non-zero after executing the instruction.

Interestingly, on Intel CPUs the case does not repro and after stepping ymm0 is correctly 0, so there is potential for this to be a hardware bug as well. I have sent the repro to the WinDbg team to have a closer look.

I believe that @janvorli will investigate the second bug, which is likely the real bug reported here.

tactical-drone commented 2 years ago

@jakobbotsch Excellent news.

W.R.T the second problem. Not sure what I am doing wrong but I have had at 3 separate code paths now where my CAS fails if I apply pressure tests. I believe one of these failures are proven to be correct code because it is runtime tech. The reason why I struggle is the pressure needs to be high for the bug to appear, so many times I think I fixed the problem when instead I made the performance worse.

Current status quo is, I see old CAS instructions coming out of nowhere jamming my state. Not sure if they are stuck in the scheduler, the CPU or my broken logic. Maybe I should just run this code in Linux VM and see what happens.

Interestingly if I lower the node count to more civilized values such as 10, the pressure test results are pristine. You would expect catastrophic failure too because of the less contention the nodes run smoother and faster. dotnet is telling me it is doing 200 000 ops per second. So if there was a logic bug you would expect the entire chain to collapse instantly over these speeds. It does not.

With higher values such as 50 nodes it will run for a while then suddenly my CPU would flatline and the trace comes up with this bug again. I get this bug every now and again and I have no idea what it means. It just says that runtime started spamming this HAndleNonSuccessAndDebuggerNotification. I normally get rid of it by perturbating my CAS instructions with supplementary logic until it stops. The 1st time I have seen this spam was with my first semaphore that uses entirely my own contraptions instead of runtime tech. It almost feels like once you reach a certain performance threshold (under load) it just bricks like that.

janvorli commented 2 years ago

I can repro the m_alignPad assert (like Jakob has reported) when running the repro with checked and debug build of runtime. It repros in almost every run for me and the call stack is always the same. Here is the native top of the stack:

 # Child-SP          RetAddr           Call Site
00 000000f0`35ebc110 00007ff9`901e6f0f KERNELBASE!RaiseFailFastException+0x152
01 000000f0`35ebc6f0 00007ff9`901e7cf3 coreclr!TerminateOnAssert+0x1f [G:\git\runtime\src\coreclr\utilcode\debug.cpp @ 202] 
02 000000f0`35ebc720 00007ff9`901e6fb1 coreclr!_DbgBreakCheck+0x3c3 [G:\git\runtime\src\coreclr\utilcode\debug.cpp @ 414] 
03 000000f0`35ebd9c0 00007ff9`901e7651 coreclr!_DbgBreakCheckNoThrow+0x91 [G:\git\runtime\src\coreclr\utilcode\debug.cpp @ 433] 
04 000000f0`35ebdb10 00007ff9`8fac1024 coreclr!DbgAssertDialog+0x341 [G:\git\runtime\src\coreclr\utilcode\debug.cpp @ 594] 
05 000000f0`35ebdd50 00007ff9`8fa15cfe coreclr!ObjHeader::IllegalAlignPad+0x64 [G:\git\runtime\src\coreclr\vm\syncblk.cpp @ 2952] 
06 000000f0`35ebdd90 00007ff9`8fac83b7 coreclr!ObjHeader::GetBits+0x1e [G:\git\runtime\src\coreclr\vm\syncblk.h @ 1548] 
07 000000f0`35ebddc0 00007ff9`8fa1b531 coreclr!ObjHeader::Validate+0x37 [G:\git\runtime\src\coreclr\vm\syncblk.cpp @ 2042] 
08 000000f0`35ebde80 00007ff9`8fa1af14 coreclr!Object::ValidateInner+0x5c1 [G:\git\runtime\src\coreclr\vm\object.cpp @ 581] 
09 000000f0`35ebe0c0 00007ff9`8fa13425 coreclr!Object::Validate+0xd4 [G:\git\runtime\src\coreclr\vm\object.cpp @ 498] 
0a 000000f0`35ebe130 00007ff9`8fcf166a coreclr!OBJECTREF::OBJECTREF+0xd5 [G:\git\runtime\src\coreclr\vm\object.cpp @ 1122] 
0b 000000f0`35ebe1a0 00007ff9`337ddc04 coreclr!JIT_VirtualFunctionPointer+0xda [G:\git\runtime\src\coreclr\vm\jithelpers.cpp @ 3477] 

The corruption is quite strange. The object at frame 9 is of type System.Object and the m_alignPad of the following object that we assert on contains a valid managed reference.

0:166> !lno 0x00000261`9865e528
Before:  000002619865e500           40 (0x28)   zero.core.patterns.queue.enumerator.IoQEnumerator`1[[zero.core.patterns.bushings.IoSink`1[[zero.core.feat.models.protobuffer.CcProtocMessage`2[[Zero.Models.Protobuf.CcWhisperMsg, zero.cocoon],[zero.cocoon.models.batches.CcGossipBatch, zero.cocoon]], zero.core.feat]], zero.core]]
Current: 000002619865e528           24 (0x18)   System.Object
After:   000002619865e568           24 (0x18)   zero.core.patterns.semaphore.core.IoManualResetValueTaskSource`1[[System.Boolean, System.Private.CoreLib]]
Heap local consistency not confirmed.

This is the memory at the object address:

dq 00000261`9865e528  
00000261`9865e528  00007ff9`31159710 00000000`00000000
00000261`9865e538  00000261`9864b620 00000261`a620c9f8
00000261`9865e548  00000000`00000000 00000000`00000000
00000261`9865e558  00000000`00000000 00000000`00000000

Here is what the reference that is in the m_alignPad location points to:

!do 00000261`9864b620
0:166> !do 00000261`9864b620
Name:        zero.cocoon.CcDrone
MethodTable: 00007ff93186a9c0
EEClass:     00007ff93187bc88
Tracked Type: false
Size:        296(0x128) bytes
File:        D:\test\dotnet\zero\zero.sync\bin\Release\net7.0\win-x64\publish\zero.cocoon.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff93132c670  400008c       38         System.Int64  1 instance            15213 _zId
00007ff9313bd6f0  400008d        8        System.String  0 instance 0000026175800218 <Description>k__BackingField
00007ff931698db8  400008e       10 ...ns.misc.IIoNanite  0 instance 000002619863cb88 <ZeroedFrom>k__BackingField
00007ff9313bd6f0  400008f       18        System.String  0 instance 00000261a61f4180 <ZeroReason>k__BackingField
00007ff9313264b0  4000090       48         System.Int32  1 instance       1200607328 TearDownTime
00007ff9313264b0  4000091       4c         System.Int32  1 instance                0 CascadeTime
00007ff93132c670  4000092       40         System.Int64  1 instance       1200602109 UpTime
00007ff9313264b0  4000093       50         System.Int32  1 instance                0 _zeroPrimed
00007ff9313264b0  4000094       54         System.Int32  1 instance                1 _zeroed
00007ff9313264b0  4000095       58         System.Int32  1 instance                1 _zeroedSec
00007ff932776810  4000096       20 ...oSub, zero.core]]  0 instance 000002619864b898 _zeroHive
00007ff9327770f0  4000097       28 ...nite, zero.core]]  0 instance 0000026198652390 _zeroHiveMind
00007ff9313264b0  4000098       5c         System.Int32  1 instance                1 _concurrencyLevel
00007ff9322c81d0  400009b       30 ...lationTokenSource  0 instance 000002619864b868 AsyncTasks
00007ff93177dbe8  400008a       28         NLog.ILogger  0   static 000002617577f058 _logger
00007ff93132c670  400008b      130         System.Int64  1   static            18954 _uidSeed
00007ff93132c670  4000099      138         System.Int64  1   static             2056 _zCount
00007ff932700608  400009a       30 ...Private.CoreLib]]  0   static 000002617577ea08 ZeroRoot
00007ff9326d47c8  40000d8       60 ...Private.CoreLib]]  0 instance 000002619863cb88 <Source>k__BackingField
00007ff9312c4a28  40000d9       cc       System.Boolean  1 instance                0 <ZeroRecoveryEnabled>k__BackingField
0000000000000000  40000da       68                       0 instance 000002619865a7b0 _queue
0000000000000000  40000db       70                       0 instance 000002619865e388 <JobHeap>k__BackingField
00007ff933186f48  40000dc       78 ...Private.CoreLib]]  0 instance 000002619865e568 _zeroSync
00007ff9313bd6f0  40000dd       80        System.String  0 instance 000002619864b768 _description
00007ff93177dbe8  40000de       88         NLog.ILogger  0 instance 000002618308e460 _logger
00007ff9312c4a28  40000df       cd       System.Boolean  1 instance                0 <IsArbitrating>k__BackingField
0000000000000000  40000e0       90                       0 instance 0000000000000000 _previousJobFragment
00007ff93132c670  40000e1       98         System.Int64  1 instance                0 _eventCounter
00007ff9313264b0  40000e2       a0         System.Int32  1 instance              128 parm_max_q_size
00007ff9313264b0  40000e3       a4         System.Int32  1 instance             2000 parm_min_uptime_ms
00007ff9313264b0  40000e4       a8         System.Int32  1 instance          1000000 parm_stats_mod_count
00007ff9313264b0  40000e5       ac         System.Int32  1 instance               10 parm_min_frame_time
00007ff9313264b0  40000e6       b0         System.Int32  1 instance             1000 parm_error_timeout
00007ff9313264b0  40000e7       b4         System.Int32  1 instance              250 parm_conduit_spin_up_wait_time
00007ff9313264b0  40000e8       b8         System.Int32  1 instance                2 parm_max_consumer_threads
00007ff9313264b0  40000e9       bc         System.Int32  1 instance             1000 parm_producer_consumer_throttle_delay
00007ff9313264b0  40000ea       c0         System.Int32  1 instance             2500 parm_min_failed_production_time
00007ff9313264b0  40000eb       c4         System.Int32  1 instance             5000 parm_consumer_wait_for_producer_timeout
00007ff9313264b0  40000ec       c8         System.Int32  1 instance             1000 parm_producer_start_retry_time
00007ff9317d6138  4000133       d0          NLog.Logger  0 instance 00000261831152c0 _logger
00007ff93169b300  4000134       d8 ...Private.CoreLib]]  0 instance 000002617c90b348 Node
00007ff9317d6138  40000f7       e0          NLog.Logger  0 instance 0000026187e6d540 _logger
00007ff9313bd6f0  40000f8       e8        System.String  0 instance 00000261a620c598 _description
00007ff932c04c98  40000f9       f0 ...utopeer.CcAdjunct  0 instance 00000261870e3cf8 _adjunct
00007ff9313bd6f0  40000fa       f8        System.String  0 instance 00000261870e3c98 _key
00007ff9312c4a28  40000fb      11c       System.Boolean  1 instance                1 AccountingBit
00007ff9312c4a28  40000fc      11d       System.Boolean  1 instance                1 _assimulated
00007ff9313264b0  40000fd      118         System.Int32  1 instance                1 parm_insane_checks_delay_s
00007ff931626398  40000fe      100        System.Byte[]  0 instance 000002619864b748 _vb
00007ff93172a9d8  40000ff      108 ...obuf.CcWhisperMsg  0 instance 0000026198679c18 _m
00007ff933a37b50  4000100      110 ...Private.CoreLib]]  0 instance 0000026198679e30 _sendBuf

And here is a history of that object:

SUCCESS: GCHist structures initialized
0:166> !histobj 00000261`9864b620
 GCCount           Object                                    Roots
---------------------------------------------------------
      19 000002619864B620 000002A20A332040, 
      18 000002619B325540 
      17 000002619B325540 
      16 000002619B325540 
      15 000002619B325540 
      14 000002619B325540 
      13 000002619B325540 
      12 000002619B325540 
      11 000002619B325540 
      10 000002619B325540 
       9 000002619B325540 
       6 000002619B325540 
0:166> !histroot 000002A20A332040
 GCCount            Value               MT Promoted?                Notes
---------------------------------------------------------
      19 000002619864B620 00007FF93186A9C0        no 
...
0:166> !histobjfind 000002619864B620 
 GCCount           Object                                  Message
---------------------------------------------------------
      19 000002619864B620 Relocation NEWVALUE for root 000002A20A332040
...
janvorli commented 2 years ago

And more more piece of the puzzle - the heap verification shows a heap corruption

0:166> !verifyheap
object 00000261986415f8: bad member 000002619865E540 at 0000026198641608
Last good object: 00000261986414E0.
Could not request method table data for object 000002619865E540 (MethodTable: 00000261A620C9F8).
Last good object: 000002619865E528.

That corresponds to the corruption shown above. The 000002619865E528 is the System.Object before the corrupted m_alignPad and the 000002619865E540 is an address right after that align pad. The 00000261A620C9F8 is also a valid object reference, so both the m_alignPad and the following memory location are corrupted by valid references.

0:166> !do 00000261A620C9F8
Name:        System.String
MethodTable: 00007ff9313bd6f0
EEClass:     00007ff931460898
Tracked Type: false
Size:        86(0x56) bytes
File:        D:\test\dotnet\zero\zero.sync\bin\Release\net7.0\win-x64\publish\System.Private.CoreLib.dll
String:      ZERO: ZeroManagedAsync: teardown
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff9313264b0  4000303        8         System.Int32  1 instance               32 _stringLength
00007ff9312cc430  4000304        c          System.Char  1 instance               5a _firstChar
00007ff9313bd6f0  4000302       f8        System.String  0   static 0000026175800218 Empty

However, this reference has no GC root.

janvorli commented 2 years ago

Thanks to @jakobbotsch's hint, I have found that the System.Object instance at 000002619865E528 (before the corrupted m_alignPad) is the object that was set as _sentinel in the IoZeroQ instance using the Unsafe.As<T>(new object()).

0:000> !do 000002619865E3F0
Name:        zero.core.patterns.queue.IoZeroQ`1[[zero.core.patterns.bushings.IoSink`1[[zero.core.feat.models.protobuffer.CcProtocMessage`2[[Zero.Models.Protobuf.CcWhisperMsg, zero.cocoon],[zero.cocoon.models.batches.CcGossipBatch, zero.cocoon]], zero.core.feat]], zero.core]]
MethodTable: 00007ff933a39488
EEClass:     00007ff9328746c8
Tracked Type: false
Size:        176(0xb0) bytes
File:        D:\test\dotnet\zero\zero.sync\bin\Release\net7.0\win-x64\publish\zero.core.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff93132c670  4000069       68         System.Int64  1 instance                0 _head
00007ff93132c670  400006a       70         System.Int64  1 instance                2 _hwm
00007ff9313bd6f0  400006b        8        System.String  0 instance 0000026175800218 _description
0000000000000000  400006c       10              SZARRAY  0 instance 000002619865e4b8 _storage
00007ff93186be90  400006d       18     System.__Canon[]  0 instance 000002619865e4d8 _fastStorage
00007ff931159808  400006e       20       System.__Canon  0 instance 000002619865e528 _sentinel
00007ff931159710  400006f       28        System.Object  0 instance 000002619865e4a0 _syncRoot
00007ff9328846d0  4000070       30 ...Private.CoreLib]]  0 instance 000002619865e500 _curEnumerator
00007ff9326df9d0  4000071       38 ...ZeroSemaphoreSlim  0 instance 0000000000000000 _fanSync
00007ff932886e80  4000072       40 ...Private.CoreLib]]  0 instance 0000000000000000 _zeroSync
00007ff9326df9d0  4000073       48 ...ZeroSemaphoreSlim  0 instance 0000000000000000 _balanceSync
0000000000000000  4000074       50              SZARRAY  0 instance 0000000000000000 _fanSyncs
0000000000000000  4000075       58              SZARRAY  0 instance 0000000000000000 _balanceSyncs
0000000000000000  4000076       60              SZARRAY  0 instance 0000000000000000 _zeroSyncs
00007ff93132c670  4000077       78         System.Int64  1 instance                1 _tail
00007ff9313264b0  4000078       80         System.Int32  1 instance                1 _zeroed
00007ff9313264b0  4000079       84         System.Int32  1 instance                2 _capacity
00007ff9313264b0  400007a       88         System.Int32  1 instance                0 _virility
00007ff9313264b0  400007b       8c         System.Int32  1 instance                0 _blockingConsumers
00007ff9313264b0  400007c       90         System.Int32  1 instance                0 _sharingConsumers
00007ff9313264b0  400007d       94         System.Int32  1 instance                0 _pumpingConsumers
00007ff9313264b0  400007e       98         System.Int32  1 instance                0 _primedForScale
00007ff9312c4a28  400007f       a0       System.Boolean  1 instance                0 _autoScale
00007ff9312c4a28  4000080       a1       System.Boolean  1 instance                0 _blockingCollection
00007ff9313264b0  4000081       9c         System.Int32  1 instance                1 _count

Looking at the generic argument of the IoZeroQ which is the type of the sentinel (IoSink<T>) and looking at the sources, I can see that the two references that corrupted the m_alignPad and the following qword in the memory are of the type of the first reference fields in the IoSink. The string is the ZeroReason member and the value of the string I've dumped above matches that (ZERO: ZeroManagedAsync: teardown)

janvorli commented 2 years ago

In other words, somehow the _sentinel that's an unsafely cast System.Object instance is used as the IoSink, writing over the end of the real object.

tactical-drone commented 2 years ago

I use that sentinel as a instance of the type that I am queueing to manage concurrency. That sentinel is only inserted into the Q buffer using these CAS instructions that I complain about. Not sure if it was strictly necessary to be there. I was trying to fix the CAS by making the other side detect the sentinel instead of issuing another cmp&swap spam that could jam.

If I take out the sentinel the code becomes massively unstable, which should not be. This is a lead.

jakobbotsch commented 2 years ago

@tactical-drone It is never safe to reinterpret an object reference of one type as an object reference of another type where the same cast would not succeed. It will cause runtime instability issues, like you see in this issue.

I have applied the following patch that eliminates the use of Unsafe.As:

@@ -50,8 +50,8 @@ namespace zero.core.patterns.queue
                 //TODO: tuning
                 capacity = Math.Max(4, capacity);
                 _hwm = _capacity = 1;
-                _storage = new T[32][];
-                _storage[0] = _fastStorage = new T[_capacity];
+                _storage = new object[32][];
+                _storage[0] = _fastStorage = new object[_capacity];

                 var v = IoMath.Log2((ulong)capacity) - 1;
                 var scaled = false;
@@ -66,8 +66,8 @@ namespace zero.core.patterns.queue
             else
             {
                 _hwm = _capacity = capacity;
-                _storage = new T[1][];
-                _storage[0] = _fastStorage = new T[_capacity];
+                _storage = new object[1][];
+                _storage[0] = _fastStorage = new object[_capacity];
             }

             if (_blockingCollection)
@@ -83,7 +83,7 @@ namespace zero.core.patterns.queue

             _curEnumerator = new IoQEnumerator<T>(this);

-            _sentinel = Unsafe.As<T>(new object());
+            _sentinel = new object();
         }

         #region packed
@@ -92,10 +92,10 @@ namespace zero.core.patterns.queue

         private readonly string _description;

-        private readonly T[][] _storage;
-        private readonly T[] _fastStorage;
+        private readonly object[][] _storage;
+        private readonly object[] _fastStorage;

-        private readonly T _sentinel;
+        private readonly object _sentinel;
         private readonly object _syncRoot = new();

         private volatile IoQEnumerator<T> _curEnumerator;
@@ -161,11 +161,11 @@ namespace zero.core.patterns.queue
                 Debug.Assert(idx >= 0);

                 //if (!IsAutoScaling) return Volatile.Read(ref _fastStorage[idx % _capacity]);
-                if (!IsAutoScaling) return _fastStorage[idx % _capacity];
+                if (!IsAutoScaling) return (T)_fastStorage[idx % _capacity];

                 idx %= Capacity;
                 var i = IoMath.Log2(unchecked((ulong)idx + 1));
-                return _storage[i][idx - ((1 << i) - 1)];
+                return (T)_storage[i][idx - ((1 << i) - 1)];;
             }
             [MethodImpl(MethodImplOptions.AggressiveInlining)]
             protected set
@@ -207,7 +207,7 @@ namespace zero.core.patterns.queue
                 if (_primedForScale == 1 && Head <= cap2 && Tail <= cap2 && (Tail > Head || Tail == Head && _count < cap2) && Interlocked.CompareExchange(ref _primedForScale, 2, 1) == 1 || force)
                 {
                     var hwm = 1 << (_virility + 1);
-                    _storage[_virility + 1] = new T[hwm];
+                    _storage[_virility + 1] = new object[hwm];
                     Interlocked.Add(ref _hwm, hwm);
                     Interlocked.Increment(ref _virility);
                     Interlocked.Exchange(ref _primedForScale, 0);
@@ -226,7 +226,7 @@ namespace zero.core.patterns.queue
         /// <param name="compare">The compare value</param>
         /// <returns>The previous value</returns>
         [MethodImpl(MethodImplOptions.AggressiveInlining)]
-        private T CompareExchange(long idx, T value, T compare)
+        private object CompareExchange(long idx, object value, object compare)
         {
             if (!IsAutoScaling) return Interlocked.CompareExchange(ref _fastStorage[idx % _capacity], value, compare);

@@ -401,7 +401,7 @@ namespace zero.core.patterns.queue
                 T latch;

                 while ((head = Head) >= Tail || (latch = this[head]) == _sentinel || latch == null || head != Head ||
-                       (slot = CompareExchange(head, _sentinel, latch)) != latch) 
+                       (slot = (T)CompareExchange(head, _sentinel, latch)) != latch) 
                 {
                     if (_count == 0 || Zeroed)
                     {
@@ -418,7 +418,7 @@ namespace zero.core.patterns.queue
                 this[head] = null;//restore the sentinel
                 Interlocked.Decrement(ref _count);
                 Interlocked.Increment(ref _head);

                 return true;
             }
             catch (Exception e)

When I apply this patch, I see exceptions like the following:

2022/05/27 13:30:43.597|ERROR[   IoZeroQ`1 ( 317)]: TryDequeue failed! System.Object ChkCastAny(Void*, System.Object) System.InvalidCastException: Unable to cast object of type 'System.Object' to type 'IoZNode[zero.core.patterns.bushings.IoSink`1[zero.core.feat.models.protobuffer.CcProtocMessage`2[Zero.Models.Protobuf.chroniton,zero.cocoon.models.batches.CcDiscoveryBatch]]]'.
   at zero.core.patterns.queue.IoZeroQ`1.TryDequeue(T& slot) in C:\dev\dotnet\investigations\zero\zero.core\patterns\queue\IoZeroQ.cs:line 431
janvorli commented 2 years ago

I've also noticed that are couple of CompareExchange usages that use the _sentinel as a source value. There might be a subtle race bug that causes the sentinel to leak out somehow and then getting used as if it was the IoSink type.

janvorli commented 2 years ago

And one more detail - in the dump I've taken at the crash, I can actually see the sentinel to be stored somewhere out of the IoZeroQ:

HandleTable:
    0000026170CF8238 (strong handle)
    -> 000002619C9DDEA0 System.Object[]
    -> 00000261A620CB10 System.Threading.Tasks.Task`1[[System.Threading.Tasks.Task, System.Private.CoreLib]]
    -> 00000261A620CAD0 System.ValueTuple`6[[zero.core.patterns.misc.IoNanoprobe, zero.core],[System.Func`2[[zero.core.patterns.misc.IoNanoprobe, zero.core],[System.Threading.Tasks.ValueTask, System.Private.CoreLib]], System.Private.CoreLib],[zero.core.patterns.misc.IoNanoprobe, zero.core],[System.String, System.Private.CoreLib],[System.String, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]]
    -> 000002619865E528 System.Object

This looks like a static variable holding a Task. The ValueTuple are most likely the captured locals, the first member being of type IoNanoprobe, but containing the reference to the System.Object:

0:000> !do 00000261A620CAD0 
Name:        System.ValueTuple`6[[zero.core.patterns.misc.IoNanoprobe, zero.core],[System.Func`2[[zero.core.patterns.misc.IoNanoprobe, zero.core],[System.Threading.Tasks.ValueTask, System.Private.CoreLib]], System.Private.CoreLib],[zero.core.patterns.misc.IoNanoprobe, zero.core],[System.String, System.Private.CoreLib],[System.String, System.Private.CoreLib],[System.Int32, System.Private.CoreLib]]
MethodTable: 00007ff933600c60
EEClass:     00007ff932eefd48
Tracked Type: false
Size:        64(0x40) bytes
File:        D:\test\dotnet\zero\zero.sync\bin\Release\net7.0\win-x64\publish\System.Private.CoreLib.dll
Fields:
              MT    Field   Offset                 Type VT     Attr            Value Name
00007ff931159808  400071f        8       System.__Canon  0 instance 000002619865e528 Item1
00007ff931159808  4000720       10       System.__Canon  0 instance 000002617cc788e0 Item2
00007ff931159808  4000721       18       System.__Canon  0 instance 000002619865e528 Item3
00007ff931159808  4000722       20       System.__Canon  0 instance 0000026183601840 Item4
00007ff931159808  4000723       28       System.__Canon  0 instance 000002617cb15908 Item5
00007ff9313264b0  4000724       30         System.Int32  1 instance              302 Item6
janvorli commented 2 years ago

I can see that setting the string to ZERO: ZeroManagedAsync: teardown happens in the IoNanoprobe.DisposeAsync, invoked by the IoNanoprobe.Dispose. So it looks like the code has ended up calling dispose on our object.

tactical-drone commented 2 years ago

@jakobbotsch Sure thing! I was totally hacking there. I was hoping Unsafe.As could function like a reinterpret_cast<> on pointer value where I never dereference the pointer. I need that sentinel, or I must pass one in.

Your solution casting to object should work? I might have that internal bug that is masked by this sentinel. Your cast errors is an indication of this. Let me try that patch.

tactical-drone commented 2 years ago

I am a bit confused now. I cant get CAS to work on objects. It throws Message = "Attempted to access an element as a type incompatible with the array."

jakobbotsch commented 2 years ago

I am a bit confused now. I cant get CAS to work on objects. It throws Message = "Attempted to access an element as a type incompatible with the array."

You probably ended up with array covariance by storing T[] in _storage[i]. Make sure those inner arrays are also object[] and not T[].

tactical-drone commented 2 years ago

Oops! I missed an entire part of the patch. Your patch passes smoke tests. I don't get those cast errors. Let me make bigger tests.

tactical-drone commented 2 years ago

I thought that fixed it. All my tests pass everything runs very smoothly now. The liveness under load looks better too. So smoothly in fact that I end up back at HandleNonSuccessAndDebuggerNotification spam again. When I go look at that function in the runtime I have got nothing.

Everything just work then suddenly out of nowhere CPU flatline with HandleNonSuccessAndDebuggerNotification spamming (which wont replicate in Debug mode). I get no errors in the logs, nothing. A complete instant lockup.

Let me see if I can check in progress.

janvorli commented 2 years ago

Just an extra bit of detail that concludes how we got to the corruption. The problem was triggered by the ZeroManagedAsync function calling DisposeAsync on the sentinel. I've added this to the IoNanoprobe.DisposeAsync:

        public async ValueTask DisposeAsync(IIoNanite @from, string reason, [CallerFilePath] string filePath = null, [CallerMemberName] string methodName = null, [CallerLineNumber] int lineNumber = default)
        {
            if (GetType() == typeof(object))
            {
                Environment.FailFast("DisposeAsync called on System.Object!");
            }

The call stack displayed by the FailFast shown the ZeroManagedAsync was the caller. It seemed a bit strange as the caller seems to check if the reference it calls the DisposeAsync on implements IIoNanite:

if (item is IIoNanite ioNanite)
{
    if (!ioNanite.Zeroed())
        await ioNanite.DisposeAsync((IIoNanite)nanite, string.Empty).FastPath();
}

But I guess the compiler somehow eliminated the check since the type of the item local is known to be one that matches the condition (except that in our case, it was returning the System.Object sentinel).

tactical-drone commented 2 years ago

@janvorli Great stuff! Indeed, that is a bug. Dispose should not be called on the sentinel, but surely must pass that check otherwise Unsafe.As is not doing what it should. I must correct that somehow. I hope this is not going to be used as an example of why you don't put things like "Unsafe.As" into the framework. I love the fact that we have that kind of power in C#. To bad this time it really was Unsafe.

For now I have removed the sentinel entirely anyway as I was using it to jam the CAS timing.

janvorli commented 2 years ago

@tactical-drone I assume we can close this issue now, right?

tactical-drone commented 2 years ago

@janvorli Thanks! Good luck!