dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.92k stars 4.64k forks source link

Intermittent hang/deadlock in .net core 5.0.400 while debugging #58471

Closed jeff-simeon closed 2 years ago

jeff-simeon commented 3 years ago

Description

This is a duplicate of https://github.com/dotnet/runtime/issues/42375 as far as symptoms and behavior go but I am still encountering the exact same symptoms on 5.0.400. I can reproduce this on Mac OS and Windows.

When debugging, our dev team encounters sporadic hangs (about 30% of the time). There does not seem to be any specific reproducible pattern of when in the program execution the hang occurs. When it happens, the diagnostics logger stops updating

image

and I cannot break or terminate the program:

image image

If I try to dotnet trace collect on a hung process, dotnet trace hangs as well.

image

I have tried taking and analyzing a memory dump using wpr as described here, but I have not been able to find anything meaningful.

Configuration

Reproduced on 5.0.400 on Mac OS and Windows. In Visual Studio and Rider IDE.

Regression?

This issue seems to have started when we upgraded from netcoreapp3.1 to net5.0

Other information

The amount of logging and amount of asynchronous operations seems to make the issue more/less prevalent. For example, turning down the log level makes the issue happen about 20% of the time instead of 30% of the time.

hoyosjs commented 2 years ago

That's... an interesting observation. If you have the locals window open maybe that could cause a funceval. And no, somehow I still don't see them :( The last comment I see is from 9/29.

jeff-simeon commented 2 years ago

Gotcha - @hoyosjs, since it seems like the developercommunity site doesn't function, is there another way you'd suggest I transmit them? I can put them on OneDrive and email a link if you'd like

hoyosjs commented 2 years ago

DevCommunity is preferred because of GDPR compliance and such. If you feel that there's no sensitive data and you feel like you are getting pain from communication, you can get my email from my profile. I'll update this thread with results for the visibility of the community, unless there's private information that comes from it.

jeff-simeon commented 2 years ago

@hoyosjs - It does look like my comment and upload posted image

hoyosjs commented 2 years ago

Ah, I see. You uploaded them to the old issue. You opened a first thread, then a second thread. I closed the older thread and redirected it to the newer one as it had more files.

For future reference, the one that is still open that I was looking at was this.

I'm downloaded them and see they have logs. I'll check them in the next couple of days and get back to you.

jeff-simeon commented 2 years ago

thanks @hoyosjs

hoyosjs commented 2 years ago

@jeff-simeon We continued looking at this and have a new theory of what's causing the deadlock. I also saw that the suggested workaround was properly applied didn't help in the way I thought it would, I'm sorry about that.

jeff-simeon commented 2 years ago

All good @hoyosjs

if you have any other workarounds you can suggest we would greatly appreciate it

joakimriedel commented 2 years ago

@jeff-simeon

Also, I might be crazy, but I think I just noticed a pattern. It seems like the problem occurs when I context switch and bring a new window to the foreground in front of Visual Studio while the program or tests are running.

I must have run our acceptance tests in the debugger 10+ times today with no issue and then I brought up a browser while running and the issue suddenly reproduced. I ran again, brought up a browser, and the issue reproduced again. Then I left VS focused while running and the issue did not occur. I repeated this pattern 4 or 5 more times with the same result.

You're not crazy, I've noticed the same pattern. Sent some new dumps and I think there's some progress being made on this issue, let's hope for a resolution soon.

jeff-simeon commented 2 years ago

hi @hoyosjs - any update on a workaround/resolution?

hoyosjs commented 2 years ago

(Sorry - this seems to have stuck in my outbox limbo. That's what I get for trying to reply to GitHub on my email.)

Hey @jeff-simeon. I think we might have an idea of what is causing this issue. While I might take a while as I make sure I am on the right trail, there's something that might help as a workaround and it will definitely be easier for you to confirm if it helps that anything I can do on my side.

I was talking to @davmason and he realized that my suggestion to disable tiering is not complete. There's a feature in the profiler that uses that same feature that I believe is a player in the current issue you see. So in addition of needing DOTNET_TieredCompilation=0/COMPLUS_TieredCompilation=0, you should set COMPlus_ProfApi_RejitOnAttach=0 and see if it helps out on this.

amandal1810 commented 2 years ago

@hoyosjs I had already tried this (setting both COMPlus_TieredCompilation and COMPlus_ProfApi_RejitOnAttach to 0 in system environment variables) after seeing this comment. But it did not solve the issue. :(

amandal1810 commented 2 years ago

Ok! I was fumbling around and I think I fixed the issue. But I cannot confirm what exactly fixed it. Maybe someone else can try what I did and confirm if it worked. Note that I am using Visual Studio 2022 17.0.1

After the fix:

What I did:

.NET SDK (reflecting any global.json):
 Version:   6.0.100
 Commit:    9e8b04bbff

Runtime Environment:
 OS Name:     Windows
 OS Version:  10.0.22000
 OS Platform: Windows
 RID:         win10-x64
 Base Path:   C:\Program Files\dotnet\sdk\6.0.100\

Host (useful for support):
  Version: 6.0.0
  Commit:  4822e3c3aa

.NET SDKs installed:
  3.1.415 [C:\Program Files\dotnet\sdk]
  5.0.403 [C:\Program Files\dotnet\sdk]
  6.0.100 [C:\Program Files\dotnet\sdk]

.NET runtimes installed:
  Microsoft.AspNetCore.App 3.1.21 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 5.0.12 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.AspNetCore.App 6.0.0 [C:\Program Files\dotnet\shared\Microsoft.AspNetCore.App]
  Microsoft.NETCore.App 3.1.21 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 5.0.12 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.NETCore.App 6.0.0 [C:\Program Files\dotnet\shared\Microsoft.NETCore.App]
  Microsoft.WindowsDesktop.App 3.1.21 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 5.0.12 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]
  Microsoft.WindowsDesktop.App 6.0.0 [C:\Program Files\dotnet\shared\Microsoft.WindowsDesktop.App]

To install additional .NET runtimes or SDKs:
  https://aka.ms/dotnet-download

Tagging the other open issue here for reference

jeff-simeon commented 2 years ago

@hoyosjs - confirmed this is not resolving the issue for us....what is the status here please?

noahfalk commented 2 years ago

Hi @jeff-simeon I am a coworker for @hoyosjs. He has been out on vacation for Christmas holidays but now that I am back from my own holiday vacation I'm going to fill in for him and help get this moving. I assisted with some of the earlier investigation so I think I am already mostly up-to-speed on this. My understanding so far is that:

Dump 1 (thread 17):

02 0000006f`0467f570 00007ffc`dffba656     coreclr!ThreadSuspend::SuspendEE+0x228 [D:\workspace\_work\1\s\src\coreclr\src\vm\threadsuspend.cpp @ 6097] 
03 0000006f`0467f710 00007ffc`dfe5f3d9     coreclr!CallCountingManager::StopAndDeleteAllCallCountingStubs+0xa9182 [D:\workspace\_work\1\s\src\coreclr\src\vm\callcounting.cpp @ 960] 

Dump 2 (thread 30):

07 00000006`6b49f600 00007ffc`cd57a656     coreclr!ThreadSuspend::SuspendEE+0x449 [D:\workspace\_work\1\s\src\coreclr\src\vm\threadsuspend.cpp @ 6236] 
08 00000006`6b49f7a0 00007ffc`cd41f3d9     coreclr!CallCountingManager::StopAndDeleteAllCallCountingStubs+0xa9182 [D:\workspace\_work\1\s\src\coreclr\src\vm\callcounting.cpp @ 960] 

Dump 3(thread 17):

03 000000ed`b157f670 00007ffe`57baa656     coreclr!ThreadSuspend::SuspendEE+0x283 [D:\workspace\_work\1\s\src\coreclr\src\vm\threadsuspend.cpp @ 6144] 
04 000000ed`b157f810 00007ffe`57a4f3d9     coreclr!CallCountingManager::StopAndDeleteAllCallCountingStubs+0xa9182 [D:\workspace\_work\1\s\src\coreclr\src\vm\callcounting.cpp @ 960] 
Dump 4, thread 27
06 00000004`f1778030 00007ffc`85992b0a     coreclr!CrstBase::Enter+0x5a [D:\workspace\_work\1\s\src\coreclr\src\vm\crst.cpp @ 330] 
07 (Inline Function) --------`--------     coreclr!CrstBase::AcquireLock+0x5 [D:\workspace\_work\1\s\src\coreclr\src\vm\crst.h @ 187] 
08 (Inline Function) --------`--------     coreclr!CrstBase::CrstAndForbidSuspendForDebuggerHolder::{ctor}+0x5db [D:\workspace\_work\1\s\src\coreclr\src\vm\crst.cpp @ 819] 
09 (Inline Function) --------`--------     coreclr!MethodDescBackpatchInfoTracker::ConditionalLockHolderForGCCoop::{ctor}+0x5db [D:\workspace\_work\1\s\src\coreclr\src\vm\methoddescbackpatchinfo.h @ 134] 
0a 00000004`f1778060 00007ffc`85991f6c     coreclr!CodeVersionManager::PublishVersionableCodeIfNecessary+0x8ba [D:\workspace\_work\1\s\src\coreclr\src\vm\codeversion.cpp @ 1762] 

In the meantime I am working on a fix for the portions of the bug we do understand from the dumps you already provided. However given that disabling both tiered compilation and rejit didn't work suggests our understanding of the issue is incomplete and anything I do to fix the part we do understand isn't going to be sufficient to fully solve this for you.

Next steps:

jeff-simeon commented 2 years ago

Sorry for the delayed reply @noahfalk. Ultimately, we decided to move to Rider on MacOS for dev along with AVD VMs where Windows is strictly required. While expensive, the cost of the hardware is nominal in comparison to the productivity lost or the effort in downgrading to an earlier version of dotnet.

I still would like to help get you the information you need, but it will take some time to get a new dev environment set up where I can reproduce.

noahfalk commented 2 years ago

No worries on the timing at all @jeff-simeon and sorry that it came to a new hardware purchase just to avoid this issue : ( I certainly appreciate any time you choose to spend helping diagnose the issue whenever that is.

ghost commented 2 years ago

Tagging subscribers to this area: @tommcdon See info in area-owners.md if you want to be subscribed.

Issue Details
### Description This is a duplicate of https://github.com/dotnet/runtime/issues/42375 as far as symptoms and behavior go but I am still encountering the exact same symptoms on 5.0.400. I can reproduce this on Mac OS and Windows. When debugging, our dev team encounters sporadic hangs (about 30% of the time). There does not seem to be any specific reproducible pattern of when in the program execution the hang occurs. When it happens, the diagnostics logger stops updating ![image](https://user-images.githubusercontent.com/87453236/131579797-4b7d6272-c255-4a91-aebb-de65676b611a.png) and I cannot break or terminate the program: ![image](https://user-images.githubusercontent.com/87453236/131579812-5c2a8327-a295-4d11-a8be-7086551f4bc9.png) ![image](https://user-images.githubusercontent.com/87453236/131579820-1a1dd1af-e124-404f-8c75-ca5282a98b4a.png) If I try to `dotnet trace collect` on a hung process, `dotnet trace` hangs as well. ![image](https://user-images.githubusercontent.com/87453236/131581529-b21b7cc3-405d-4f60-8df7-0936c6a1ffd7.png) I have tried taking and analyzing a memory dump using wpr as described [here](https://stackoverflow.com/questions/68746658/net-core-application-cpu-hang#comment121755761_68746658), but I have not been able to find anything meaningful. ### Configuration Reproduced on 5.0.400 on Mac OS and Windows. In Visual Studio and Rider IDE. ### Regression? This issue seems to have started when we upgraded from netcoreapp3.1 to net5.0 ### Other information The amount of logging and amount of asynchronous operations seems to make the issue more/less prevalent. For example, turning down the log level makes the issue happen about 20% of the time instead of 30% of the time.
Author: jeff-simeon
Assignees: -
Labels: `area-Diagnostics-coreclr`
Milestone: 7.0.0
tommcdon commented 2 years ago

Thanks to @kouvel https://github.com/dotnet/runtime/pull/67160 should have fixed the issue in 7.0. @noahfalk is working on a 6.0-servicing version of the fix.

noahfalk commented 2 years ago

The fix @kouvel made thus far addresses the issues that were caused by TieredCompilation and RejitOnAttach. Some of the folks on this thread said that was sufficient to resolve the issue for them but others said they could still reproduce deadlocks after those two features were disabled. We did identify a likely 3rd culprit which is theorized to produce a similar looking deadlock but it hasn't yet been fixed.

kouvel commented 2 years ago

I've looked over a couple of options on that theorized issue after TieredCompilation and RejitOnAttach are disabled, though it's not clear yet what is actually causing that deadlock. There is a promising option for the theorized issue, more to look at.

tommcdon commented 2 years ago

Closing via https://github.com/dotnet/runtime/pull/69121