Fatal CLR exception in .NET 5.0.0-RC2 high async/await load application

hoyosjs commented 4 years ago

I spoke to soon. Ignore the previous post.

For some bizarre reason I am getting this error today, where the code I am running has been the same, mostly and working for a while now. Bizarre.

I don't have the debugger attached, I am in (console) RELEASE mode running massive parallel async/await load consuming many chips.

Running .net 5 rc2 and possibly forgetting to set $env:DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS = 1 which I have been extensively using and playing around with. I left it off, maybe that is what was different. App running again, but this setting has not done anything like this before.

This is all the debug info I could find:

Console stack that was dumped this time:

Fatal error. Internal CLR error. (0x80131506)
   at Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.CheckPoint(Int32[], Int32[])
   at Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.EncodePoint(PointAccum, Byte[], Int32)
   at Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.ImplVerify(Byte[], Int32, Byte[], Int32, Byte[], Byte, Byte[], Int32, Int32)
   at zero.cocoon.models.CcSubspaceMessage+<ConsumeAsync>d__28.MoveNext()
   at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[[zero.cocoon.models.CcSubspaceMessage+<ConsumeAsync>d__28, zero.cocoon, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null]](<ConsumeAsync>d__28 ByRef)
 ...

clr dotnet-dump stack clr threads with exceptions on them `(threadid 486)`:

Unhandled exception: Microsoft.Diagnostics.DebugServices.DiagnosticsException: Invalid thread index: 486 ---> System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values.
Parameter name: index
   at System.Linq.Enumerable.ElementAt[TSource](IEnumerable`1 source, Int32 index)
   at Microsoft.Diagnostics.DebugServices.ThreadService.Microsoft.Diagnostics.DebugServices.IThreadService.GetThreadInfoFromIndex(Int32 threadIndex) in /_/src/Microsoft.Diagnostics.DebugServices/ThreadService.cs:line 285
   --- End of inner exception stack trace ---
   at Microsoft.Diagnostics.Repl.CommandProcessor.Handler.Invoke(MethodInfo methodInfo, InvocationContext context) in /_/src/Microsoft.Diagnostics.Repl/Command/CommandProcessor.cs:line 297
   at Microsoft.Diagnostics.Repl.CommandProcessor.Handler.System.CommandLine.Invocation.ICommandHandler.InvokeAsync(InvocationContext context) in /_/src/Microsoft.Diagnostics.Repl/Command/CommandProcessor.cs:line 256
--- End of stack trace from previous location where exception was thrown ---
   at System.CommandLine.Invocation.InvocationPipeline.<>c__DisplayClass4_0.<<BuildInvocationChain>b__0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.CommandLine.Invocation.InvocationExtensions.<>c.<<UseParseErrorReporting>b__22_0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.CommandLine.Invocation.InvocationExtensions.<>c.<<UseSuggestDirective>b__7_0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.CommandLine.Invocation.InvocationExtensions.<>c.<<UseParseDirective>b__6_0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.CommandLine.Invocation.InvocationExtensions.<>c.<<UseHelp>b__20_0>d.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.CommandLine.Invocation.InvocationExtensions.<>c__DisplayClass5_0.<<UseExceptionHandler>b__0>d.MoveNext()
>

Originally posted by @tactical-drone in https://github.com/dotnet/runtime/issues/13590#issuecomment-713302666

ghost commented 4 years ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

hoyosjs commented 4 years ago

@tactical-drone the issue you were describing was really different. However, the data you provided was not . The second one caused dotnet-dump to fail, which is a completely different issue altogether (@mikem8361 fyi). However, the frames that I see there only tell me that some check in Org.BouncyCastle.Math.EC.Rfc8032.Ed25519.CheckPoint threw an exception. Is there anyway you can share the dump you captured? Or to share a repro with us? There's not much information on this issue to do an investigation with as it stands right now. I see it was doing an operation here https://github.com/bcgit/bc-csharp/blob/5bd4c8c70f80f1e7ead8e3c73459b78eb93d0ef7/crypto/src/math/ec/rfc8032/Ed25519.cs#L120 and then something threw a critical exception that shouldn't get to the user. I am not sure what to classify this as, @jkotas would this be VM?

mikem8361 commented 4 years ago

The dotnet-dump setthread issue is that you need to add the --tid option or use the thread "index" (0 - N) from the threads command for this thread.

setthread --tid 486

tactical-drone commented 4 years ago

This commit had produced the error

To reproduce, just checkout the branch net5 and run on rc2. But you do need a beefy PC on those test parameters. I run my tests on a 3950x with 64Gb ram, which is effectively a mini datacenter. Just set the total nodes to something more manageable like 500 or something. Bump the value until your memory runs out. 2000 nodes requires 10Gb ram. Wait for the error to occur. (All the program does is spawn a bunch of TCP/IP clients that connect to each other and then continuously exchange test bit between each other, 2000 nodes cause 30K connections so this starts to add up)

With regards to the failure. I haven't seen it since. It occurred on a fresh boot (a boot that felt unstable and was rebooted not long after). I am 100% sure $env:DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS = 1 was not set. I don't think it has anything to do with this setting, but that is the only change that I made (apart from these commits a0b10c9f, af69df5) that fixed the issue.

The reason I initially thought a0b10c9f was the fix, is because no other changes I brought could cause a CLR crash. The very first crash looked to be caused because of a TaskCanceledException coming from zerosemaphore. So I thought I understood the cause and made a0b10c9f. But I spoke to soon, and the subsequent test failed with the same CLR error and this time with accompanying stack trace, this time not a TaskCanceledException, but something in Bouncy castle. This made it look like a new random CLR error not caused by me. (CLR can only be brought down by unhandled exceptions in certain types of closures).

I then thought wtf (started googling), set $env:DOTNET_SYSTEM_THREADING_POOLASYNCVALUETASKS = 1 and never saw the issue again. I am still not sure what exactly happened or why things are back to normal. I just got that CLR error back to back out of nowhere and then gone. My other IDE (which is not vscode or vs2019 but uses dotnet.exe to build) sometimes screws up the build after a small core lib refactor. So I am not sure if the subsequent run's CLR error was caused by something like this? I have seen ~3/2000 .net5 strange builds where part of the previous code refactor did not show, but then doing a full rebuild in vs2019 fixes the missing refactor. Could be a race condition between that IDE's build and save all files logic. Who knows?

In any case, I am not using vs2019 or it's debugger since from RC1 those became massively unstable. Before RC1 I could regularly debug my program from vs2019 preview, but after rc1 only very small test parameters work. Otherwise the debug session just hangs. It hangs regardless actually.

hoyosjs commented 4 years ago

In any case, I am not using vs2019 or it's debugger since from RC1 those became massively unstable. Before RC1 I could regularly debug my program from vs2019 preview, but after rc1 only very small test parameters work. Otherwise the debug session just hangs. It hangs regardless actually.

Do you have any feedback on this? It is concerning and I am wondering it it was us who caused such instability. There was a bug that had to be fixed not too long ago (https://github.com/dotnet/runtime/commit/1744896828815234d5c4244a3575301b3dc8f164) that would explain this behavior. This was part of the RC2 release.

tactical-drone commented 4 years ago

@hoyosjs RC2 has exactly the same debugger issue as RC1.

I would say fairly regularly a debug session would just lock up when I run less that 30 nodes. The lockup would happen within 20 seconds of the debug session starting, or NEVER.

Between 30 ~ 100 nodes this fairly regularly becomes a certainty.

Failure type 1: Small test

Sometimes the debug session just locks up. This means, the IDE's debug area with CPU and Memory becomes unresponsive and frozen. A stop debug and "close the purple cmd window" and rerun normally mends the issue. It feels like the closing of that window increases the probability of the next run working and not closing it almost guarantees it failing. I could be wrong.

Failure type2: Progressively larger test

This one can go two ways. A total lockup as in failure type 1, or, a "lockup" but the IDE's CPU and Memory region still works, showing no CPU activity but the Memory region shows ~20Mb/s consistent linear growth (with no GCs) in memory usage that looks like a runaway process in the CLR.

tactical-drone commented 4 years ago

Another issue I am having since RC1 is the following:

I am testing memory leaks, this consists of a longish run in RELEASE mode and then a teardown cycle that ends with a manual GC.Collect.

I then go to vs2019 attach the debugger to the process and take a memory snapshot. During that heap snapshot calculation (which is a fat one that sometimes shows a terminate dialog but completes) the debugger sometimes crashes taking the process with it.

This did not happen pre RC1 with even larger HEAP snapshot calculations.

Further more, the debug became equally unstable in vs2019 as in Rider since RC1. They both exhibit exactly the same issues. So I don't think it has something to do with the IDE itself.

Something else I have seen since RC1 is, with these super large tests if they sometimes work I see strange things in the HEAP after teardown. Some of my objects are still floating around, but the numbers are not right. For example I spawned 2K nodes, and the HEAP shows that there is exactly ONE still in the heap after teardown. That just does not make sense. I have been doing this leak debugging for a while now and since RC1 the large tests produce snapshots that I cannot understand. In smaller (that are still large) runs everything works.

hoyosjs commented 4 years ago

@dotnet/dotnet-diag as the last couple points seem worth investigating

sdmaclea commented 4 years ago

strange things in the HEAP after teardown I have been doing this leak debugging for a while now and since RC1 the large tests produce snapshots that I cannot understand

Could this be related to the addition of the pinned object heap?

/cc @Maoni0

mangod9 commented 4 years ago

Is there a repro for this?

tactical-drone commented 4 years ago

The .net5 Release seems to have fixed these issues I had.

I can now debug again, like I could before.

Good job men!

hoyosjs commented 4 years ago

Glad to hear the issues have gone away so far. I'll close this, but feel free to ping again if the behavior surfaces again.

dotnet / runtime