Closed davkean closed 4 years ago
Looks like the project is changing during close, we're sending on those changes to Roslyn - which is causing the project to close again due to pumping created a value. @lifengl thoughts on this? We probably shouldn't be sending on stuff to Roslyn during close, but any idea why it's been closed twice?
Okay, looked at the dump - the project isn't changing during close I read the stack wrong. This is what is happening:
1) Close is being called 2) This drains critical tasks 3) This calls LanguageServiceHost, which in turn calls Roslyn 4) Roslyn calls Lazy<>.Value which pumps (!) 5) This causes Solution.Close to be called 6) Which attempts to close again (wrapped in ExecuteSynchronously), which blocks
@Lifengl I'm not entirely sure why 6) blocks, I kinda assumed a reentrancy into http://index/?query=ProjectNode.Close&rightProject=Microsoft.VisualStudio.ProjectSystem.VS.Implementation&file=Package%5CNodes%5CProjectNode.cs&line=1236?
Anyway not sure what to do here - we should definitely avoid doing any work on Unload and I'll submit a PR to do an upfront check inside of a LoadProjectAsync - but there's still a race after that check.
I downloaded all the artifacts (in case the roslyn pdbs matter), and put them at \\mlangfs1\public\kevinpi\project-hang-archive.zip
not sure, I guess Solution.Close in the 6th step will still wait critical tasks to be drained. I wonder that we should prevent the message loop reentry caused by the Lazy here. Reentry can easily cause dead lock if anything on the top waits the bottom side to finish.
@lifengl How do you suggest we avoid reentrancy? Basically any wait or lazy init is going to pump...
VS threading library uses some similar to @Pilchie pointed out.
@lifengl At what point do you want to avoid reentrancy? It sounds like Close itself should be doing it.
Pumping messages and COM reentrancy can introduce some odd dead lock issues, because it can be a very random call made into the system at a random time. So if we can prevent that to happen, it reduces chances to run into dead locks. JTF library was created for that reason.
@lifengl Not sure I understand - where in the code should we stop pumping? And how?
The reason Lazy<> can pump message is it calls Monitor.Enter, which controls whether the thread need pumps message or not is the current SynchronizationContext (so it never pumps message inside a thread pool thread). To suppress the message pump, you need set the current SynchronizationContext and recovers after that. But we generally don't want to do that, when we run random other code. So, maybe the safer bet is not to use Lazy<> here.
You can check NoMessagePumpSynchronizationContext usage inside VS threading library to see how it suppresses message pumps (around lock() {}). I think that is what @Pilchie pointed out as well. Andrew is expert on this.
I don't think a statement of "don't call code that pumps" is valid - CPS is an open system and any code can run. Even an innocuous bug fix could cause us or Roslyn to start pumping here.
I'm also not sure we're the right place to prevent pumping - NuGet runs in this code path as well, what if it starts pumping?
Note: this issue is affecting reliability of Roslyn Jenkins runs (since VS is deadlocking on shutdown). This is clearly a fairly unpleasant pain point for the Roslyn dev team.
@davkean, i guess you are right. If the project.close is coming from a COM call in the background thread, it is more a problem the background thread code violates thread rules, than the message pump itself. I would like to take a look the dump file of this issue. Can you tell me where to find it?
I uploaded a dump here: https://drive.google.com/file/d/0ByISdSEZ09PBZWVaM2dOWThTR1U/view?usp=sharing
(dave thinks its the same issue).
Thanks, I took a look the dump. Somehow, our windbg extension didn't work for that dump, (maybe need update the extension?) Anyway, what I found:
1, I don't think solution.Close has been reentered, it is just called during a message pump for a local 'lock', which is very nasty issue. But the call is unrelated to any of the product routine, it is just initialized by thread 30, which is an automation test driving thread:
10 1ea1e598 10909c86 EnvDTE!DomainNeutralILStubClass.IL_STUB_CLRtoCOM(Boolean)+0x96 *** ERROR: Module load completed but symbols could not be loaded for Microsoft.VisualStudio.IntegrationTest.Utilities.dll 11 1ea1e5e8 7320ea56 Microsoft_VisualStudio_IntegrationTest_Utilities!Microsoft.VisualStudio.IntegrationTest.Utilities.InProcess.SolutionExplorer_InProc.CleanUpOpenSolution()+0x16e
Basically, this code violate VS thread rule. It must use JTF library to switch to the UI thread before making any COM calls, without doing that, it can call inside any random time inside the UI thread. Since it is inside an automation test code, I would consider this to be a defect of tests, not a product bug.
2, the solution code will end up to wait critical background tasks to finish. It has no idea that one task is actually running in the call stack below it, so it will never have a chance to finish (until solution is closed), so there is no reason we think we can add some logic inside solution close logic to prevent that. In this case, it is Microsoft.VisualStudio.ProjectSystem.LanguageServices.LanguageServiceHost+<>c__DisplayClass24_0.
Frankly, there is no reason to wait language service to finish all work before closing the project. The extra work to maintain the language service provides little value to the user, because they will be thrown away anyway. The reason the code registers itself to the critical task list is because it sometime accesses the project model, and it doesn't want the model becomes invalid at a random time, so it can crash. But the service does have a long waiting queue. I guess the code can cancel it quickly by listening project unloading event, or registers itself only when a request is being processed. But I guess it doesn't help for this bug. You can register the cancellation token, so the task project.close waits on can finish quickly, but that means the project can become invalid in the middle of your language code (which is now running on the UI thread) running. That is exactly what you want to prevent at the first place.
Anyway, put into simple words: there is nothing we can do in the solution close logic to help you get out of this. You can either push the language service code to run in the background thread (instead of switching to UI thread), and eliminates the chance of reentry problem. Or hope COM calls to follow threading rules, and treat this is a test defect.
This test ran into a deadlock while closing the project: https://github.com/dotnet/roslyn/pull/19863.