Intermittent hang/deadlock in .net core 5.0.400 while debugging

jeff-simeon commented 2 years ago

Description

This is a duplicate of https://github.com/dotnet/runtime/issues/42375 as far as symptoms and behavior go but I am still encountering the exact same symptoms on 5.0.400. I can reproduce this on Mac OS and Windows.

When debugging, our dev team encounters sporadic hangs (about 30% of the time). There does not seem to be any specific reproducible pattern of when in the program execution the hang occurs. When it happens, the diagnostics logger stops updating

and I cannot break or terminate the program:

If I try to dotnet trace collect on a hung process, dotnet trace hangs as well.

I have tried taking and analyzing a memory dump using wpr as described here, but I have not been able to find anything meaningful.

Configuration

Reproduced on 5.0.400 on Mac OS and Windows. In Visual Studio and Rider IDE.

Regression?

This issue seems to have started when we upgraded from netcoreapp3.1 to net5.0

Other information

The amount of logging and amount of asynchronous operations seems to make the issue more/less prevalent. For example, turning down the log level makes the issue happen about 20% of the time instead of 30% of the time.

dotnet-issue-labeler[bot] commented 2 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 2 years ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

Issue Details

### Description This is a duplicate of https://github.com/dotnet/runtime/issues/42375 as far as symptoms and behavior go but I am still encountering the exact same symptoms on 5.0.400. I can reproduce this on Mac OS and Windows. When debugging, our dev team encounters sporadic hangs (about 30% of the time). There does not seem to be any specific reproducible pattern of when in the program execution the hang occurs. When it happens, the diagnostics logger stops updating ![image](https://user-images.githubusercontent.com/87453236/131579797-4b7d6272-c255-4a91-aebb-de65676b611a.png) and I cannot break or terminate the program: ![image](https://user-images.githubusercontent.com/87453236/131579812-5c2a8327-a295-4d11-a8be-7086551f4bc9.png) ![image](https://user-images.githubusercontent.com/87453236/131579820-1a1dd1af-e124-404f-8c75-ca5282a98b4a.png) If I try to `dotnet trace collect` on a hung process, `dotnet trace` hangs as well. ![image](https://user-images.githubusercontent.com/87453236/131581529-b21b7cc3-405d-4f60-8df7-0936c6a1ffd7.png) I have tried taking and analyzing a memory dump using wpr as described [here](https://stackoverflow.com/questions/68746658/net-core-application-cpu-hang#comment121755761_68746658), but I have not been able to find anything meaningful. ### Configuration Reproduced on 5.0.400 on Mac OS and Windows. In Visual Studio and Rider IDE. ### Regression? This issue seems to have started when we upgraded from netcoreapp3.1 to net5.0 ### Other information The amount of logging and amount of asynchronous operations seems to make the issue more/less prevalent. For example, turning down the log level makes the issue happen about 20% of the time instead of 30% of the time.

Author:	jeff-simeon
Assignees:	-
Labels:	`area-Diagnostics-coreclr`, `untriaged`
Milestone:	-

hoyosjs commented 2 years ago

Is there any way you can share a dump when everything seems stuck (both the debugger or the target process) or a trace from ETW with CPU stacks? If the dumps, we'll likely need one dump of MSVSMON and one of your app.

jeff-simeon commented 2 years ago

@hoyosjs - sure - how would you like me to deliver it to you? Do you have a preference of Mac vs Windows, Rider vs Visual Studio?

hoyosjs commented 2 years ago

You can open a feedback ticket on https://developercommunity.visualstudio.com/ and attach them there if possible (ensures proper deletion of customer data) and post it here. And Windows VS would probably be the easiest for us to examine.

jeff-simeon commented 2 years ago

@hoyosjs - all set - https://developercommunity.visualstudio.com/t/intermittent-hangdeadlock-in-net-core-50400-while/1519798

jeff-simeon commented 2 years ago

Any update @hoyosjs ?

jeff-simeon commented 2 years ago

@hoyosjs bump...

hoyosjs commented 2 years ago

Sorry @jeff-simeon, things are building up around the .NET 6 release and with the long weekend I haven't had a chance to get to this yet. I'll try my best to sink some time into the analysis this week and come back to you.

jeff-simeon commented 2 years ago

thanks @hoyosjs - this is a real painful experience for us

jeff-simeon commented 2 years ago

@hoyosjs any luck? thanks very much

hoyosjs commented 2 years ago

Hey @jeff-simeon. I've started taking a look, took most of today to take a look at that. I still don't have a good sense why it's locked. I see that there were three threads sending notifications of the thread starting up, then the thread sends the event over to the debugger side and then all threads are stopped waiting for the debugger to notify that it's ok to continue. The runtime never got the even to continue, so not much to do. I need to think of the debugger side, and I am thinking still how to look at it. Given that it repros both on macOS and Windows and Rider and VS, I highly doubt that it's a bug in those layers (if they are all the same bug). I just need to think of a way to see what's happening without having a repro.

What I see:

A bunch of locked threads for DI
Some tests that are waiting to run managed code, but they can't as the debugger is trapping them.
AppInsights trying to sleep
A bunch of threads running their termination routines.
Threads
Some threads notifying of startup and waiting on a continuation event, the threads were acked by the debugger.

jeff-simeon commented 2 years ago

Thanks @hoyosjs

is there anything I can do for I help? Or anything I can do to narrow down the cause at runtime when debugging?

we really would love to get this resolved.

jeff-simeon commented 2 years ago

@hoyosjs - any thoughts?

hoyosjs commented 2 years ago

Sorry @jeff-simeon; I took a look the other day but ever since I haven't been able to take a look much as .NET 6 is about to ship and it has required my full attention. Without a repro I expect progress on this one to be slow. Not sure how big your project is, but does it only happen on one project or is it anything you try? The only way I can thing that could help without a repro is logging.

jeff-simeon commented 2 years ago

This is the core software for our product, so it puts us in an untenable situation.

This happens randomly about 30% to 50% of the time we debug.

Should we downgrade to a version of .NET that works? It would be a shame to determine that .NET 5.0 isn’t suitable for use.

hoyosjs commented 2 years ago

I understand it's painful, and I'm sorry it got there. 30-50% is definitely REALLY high. I am surprised we haven't gotten any other reports of something like this. Looping @dotnet/dotnet-diag to see if someone has more time to check this one. The problem is most threads are in Thread::WaitSuspendEventsHelper waiting on m_DebugSuspendEvent. The only part where we set it is ReleaseFromSuspension called from SysResumeFromDebug. The only way to end up there is DB_IPCE_CONTINUE or DB_IPCE_DETACH_FROM_PROCESS getting received in the DebuggerRCThread. That thread didn't report any messages not getting passed down.

jeff-simeon commented 2 years ago

We really appreciate any help you can provide.

I am surprised that this isn’t more widespread as well. I have to think it has to do with the highly asynchronous nature of our software. It processes many hundreds of http calls to remote APIs in parallel and tracks thousands of TaskCompletionSources to orchestrate the concurrency. The debugger hang seems to happen when the process is spawning off all these tasks and then WhenAll’ing on them

I’m not sure if that helps shed any light on where we can look.

davmason commented 2 years ago

@hoyosjs do we leave the debugger logging in release mode? It seems like if we had logs of the left and right sides it would be helpful here

hoyosjs commented 2 years ago

I'd say that's the only thing that could really help. I haven't gone fishing for what stress log statements could help us, or if a private will be needed.

davmason commented 2 years ago

I took a quick look, we have more stress log statements than I expected, they might be enough.

@jeff-simeon could you collect more data? We're probably going to have to iterate a bit on what data exactly to collect, usually we like to have a local repro we can debug ourselves. The logging should be enough to get to the bottom of this, but we may have to go back and forth and potentially even ask you to run a privately built runtime before we have the right set of data.

For right now, we should start with the StressLog data, because it doesn't require any private builds. StressLog is an in memory circular buffer we use to collect additional data for hard to debug scenarios. Setting the following environment variables before startup on the debuggee process (testhost.exe) will tell it to create as large of a log as possible and only collect debugger events:

set COMPlus_StressLog=1
set COMPlus_LogFacility=8200
set COMPlus_LogLevel=10
set COMPlus_StressLogSize=80000
set COMPlus_TotalStressLogSize=8000000

Then with these all set, run your repro and once it hangs collect another dump and send it to us. We can extract the logging with SOS, our debugger extension. Depending on how curious you are, you are free to look at the logs yourself by loading it in WinDBG and running the !dumplog command - it will create a StressLog.txt in the directory with the StressLog contents.

There should be lots of CORDB statements:

63bc  13.513426000 : `CORDB`              D::SendIPCEvent DB_IPCE_SYNC_COMPLETE to outofproc appD 0x0,

63bc  13.513425400 : `CORDB`              GetIPCEventSendBuffer called in SendSyncCompleteIPCEvent

63bc  13.513425100 : `CORDB`              D::SSCIPCE: sync complete.

63bc  13.513424900 : `CORDB`              D::SC: suspension complete

63bc  13.513424700 : `CORDB`              DRCT::ML:: wait set empty after sweep.

1954  13.513412400 : `CORDB`              D::SSCIPCE: Calling IsRCThreadReady()

1954  13.513198100 : `CORDB`              D::TART: Trapping all Runtime threads.

jeff-simeon commented 2 years ago

Thanks - I checked out the dmp file in windbg but still don't see any red flags that I can recognize. I'd be very interested in learning from what you find.

Here is the new ticket link: https://developercommunity.visualstudio.com/t/intermittent-hangdeadlock-in-net-core-50400-while-1/1533767

davmason commented 2 years ago

@jeff-simeon I don't see the dump attached to the new issue. It's entirely possible I'm doing something wrong, can you verify if it succeeded in uploading so I can figure out if the problem is on my end?

jeff-simeon commented 2 years ago

...I see it's not there as well....seems like something is broken with the site.

I tried to upload again, this time by adding a comment with an attachment. Please let me know if you see it now.

Thanks.

davmason commented 2 years ago

Thanks, I can see the file is uploaded now. We will take a look, I'm not sure exactly when but hopefully early this week

jeff-simeon commented 2 years ago

Thanks very much @davmason

hoyosjs commented 2 years ago

Just wanted to let you know @jeff-simeon I haven't forgotten about this. I am looking at your new dump - I don't have the debugger side of the log, but the in process control is saying what I suspected - Total time in you app reported is ~410 sec out of which 300 seconds is failure to suspend the runtime. The thing is - I am trying to see why can't we suspend the runtime. There are threads as I said last time, that are just trapped trying to pulse the GC. To do so they are waiting for the debugger event to be set - but that only gets toggled in ReleaseFromSuspension, that is - we are waiting for the debugger to tell us to continue essentially. I'll continue taking a look.

jeff-simeon commented 2 years ago

Thanks very much @hoyosjs - let me know if I can do anything to assist.

jeff-simeon commented 2 years ago

@hoyosjs - do you want me to capture msvsmon as well?

hoyosjs commented 2 years ago

@jeff-simeon that would be really helpful. To do this - set the environment that was mentioned before like:

set COMPlus_StressLog=1
set COMPlus_RSStressLog=1
set COMPlus_LogFacility=8200
set COMPlus_LogLevel=10
set COMPlus_StressLogSize=80000
set COMPlus_TotalStressLogSize=8000000

before opening vs from the commandline (probably something like devenv <solution>); then you can capture the dump of MSVSMON and your test process.

Another thing - this process seems to be in a weird state. Is it possible for you - in case it's needed - to use a special build of the runtime with more logging?

And thanks for working with us on this.

jeff-simeon commented 2 years ago

Sure - and thank you for your help @hoyosjs

Just uploaded a new dump of both testhost.exe and msvsmon.exe

We'd be happy to reproduce on a special build of the runtime if it will help.

hoyosjs commented 2 years ago

Hey @jeff-simeon. Did you upload them to https://developercommunity.visualstudio.com/t/intermittent-hangdeadlock-in-net-core-50400-while-1/1533767 ? I can't see any recent uploads there.

jeff-simeon commented 2 years ago

@hoyosjs - yes I uploaded a file "dumps.zip" with both dmp files a few days ago. Do you not see it?

I just uploaded them again as a new comment now...

jeff-simeon commented 2 years ago

Any update @hoyosjs ?

Thanks.

jeff-simeon commented 2 years ago

@davmason @hoyosjs any updates? We would really appreciate your assistance resolving or working around this bug. It's causing our development team to need to consider a downgrade to an earlier version of .NET

hoyosjs commented 2 years ago

I am looking at it. Missed the ping. I still don't have an answer, but if I can find something I'll ping this thread back. Sorry it's taking so long.

jeff-simeon commented 2 years ago

Thanks so much

ascott18 commented 2 years ago

I also experience this issue, usually a couple times a day, with both .NET 5 and .NET 6-rc1, across a few different apps, all ASP.NET Core apps, all on Windows.

Is there anything I can provide to help figure this out?

jeff-simeon commented 2 years ago

If it helps, it seems like this is probably a GC related bug. It seems like whenever it hangs, there is always a GC event at that time (see the yellow dot)

and another

This happens somewhere between 10 and 20 times per day, per developer. @hoyosjs @davmason - is there any progress on this?

hoyosjs commented 2 years ago

There's been some good progress. I don't have a solution, but I have something that might be a workaround - please let me know if it doesn't work. Set the environment variable DOTNET_TieredCompilation to 0. You can do this just for debug by right clicking the project then properties, select the debug tab, and set the env var. The other possibility is to set a PropertyGroup with <TieredCompilation Condition="'$(Configuration)' == 'debug'">false</TieredCompilation> in the projects or props files for the projects that are showing you this issue.

jeff-simeon commented 2 years ago

Thanks I’ll try that. In the property group example it looks like you’re not setting the property to any actual value - is that correct?

hoyosjs commented 2 years ago

Sorry, it's meant to be false

jeff-simeon commented 2 years ago

0 or false? It looks like you update the example to be false

hoyosjs commented 2 years ago

Yeah, false. I made a mistake. The environment variable is 0, the property is false.

jeff-simeon commented 2 years ago

Unfortunately, the workaround doesn't seem to work. I added to our Directory.Build.props file, but still hanging

hoyosjs commented 2 years ago

Thanks for getting back so quickly @jeff-simeon. Sorry this has dragged on for so long - I know this must be painful for your team. I'm surprised the suggested workaround didn't work. From your dumps+logs I was able to get a lot more information, which you can read later in this comment in case you are interested. I thought that the interaction between the debugger and tiered compilation was the root cause of your issue. I still think we can provide a good workaround while we get the fix into a servicing release. Meanwhile, would you mind collecting a dump with the workaround suggested once you reach the irresponsive state? I can take a look at it these couple of days to see if there's another edge case I am not accounting for in this or if it's a different bug so that we can get this solved for you ASAP.

Investigation

The debugger received an event, and is waiting for the runtime to stop all threads. On the runtime side (your app) is waiting for all threads to reach suspension. All of your application dumps show the same state: all threads are in what we call preemptive mode - where managed code can't run - which is one of the requirements for suspension. There's another requirement though that one of your threads doesn't fall in under in all three cases: InForbidSuspendForDebuggerRegion is true for this thread and that's blocking the whole thing from going forward. The thread that entered this state did so in StopAndDeleteAllCallCountingStubs:

https://github.com/dotnet/runtime/blob/6d98f6dcc6e924a265f591f2e4d69f2e84d04807/src/coreclr/src/vm/callcounting.cpp#L955-L960

I suspect when it entered the holder, it didn't have TS_DebugSuspendPending in the thread state, it acquired the lock an set the forbid region on the thread for the debugger. Then it tries to suspend the runtime. At this time the debugger had likely already started enumerating threads and marked them for suspension (in the dump, I can see the thread marked for debugger suspension). It's unable to succeed in SweepThreadsForDebug and when never sends the sync complete event. The thing is InForbidSuspendForDebuggerRegion will never be set to false until we exit StopAndDeleteAllCallCountingStubs. That one will never succeed because the debugger already started the suspension.

@kouvel, do you expect the work around to have worked? I'll start an email thread to discuss possible solutions to this beyond just a work around.

hoyosjs commented 2 years ago

@jeff-simeon, also, if it's possible, Can you please setup the logging to

set COMPlus_StressLog=1
set COMPlus_RSStressLog=1
set COMPlus_LogFacility=8240
set COMPlus_LogLevel=10
set COMPlus_StressLogSize=80000
set COMPlus_TotalStressLogSize=8000000

in case you have the time to collect this? We are looking at the definitive solution, and there's some events there that might help us see why is it happening in your case in terms of timing. Appreciate all the help here to improve this. I really hope we can make this right so that you can get the productivity back without needing to roll back and so that you have an upgrade story going forward.

jeff-simeon commented 2 years ago

Sorry @hoyosjs - been out of pocket for a conference the last few days - will do this as soon as I'm back

hoyosjs commented 2 years ago

No worries - I just want to make sure this doesn't fall through the cracks.

jeff-simeon commented 2 years ago

@hoyosjs - I just uploaded new dump files. The developercommunity site seems to function only sometimes, so including screenshots here. Please let me know if you see them.

Also, I might be crazy, but I think I just noticed a pattern. It seems like the problem occurs when I context switch and bring a new window to the foreground in front of Visual Studio while the program or tests are running.

I must have run our acceptance tests in the debugger 10+ times today with no issue and then I brought up a browser while running and the issue suddenly reproduced. I ran again, brought up a browser, and the issue reproduced again. Then I left VS focused while running and the issue did not occur. I repeated this pattern 4 or 5 more times with the same result.

Thanks

dotnet / runtime