Closed cshung closed 4 years ago
Thanks @janvorli for the investigation, he pointed out that problem is the code is doing something inappropriate according to the contracts. In particular, UnlockedLoaderHeap::UnlockedAllocMem indicated it is a GC_NOTRIGGER region yet later on we did a GCX_MAYBE_COOP_NO_THREAD_BROKEN later on and caused grief.
Later on we discussed around and identified 4 possible solutions:
Option 1 is very tedious, if not impossible. EtmDummy.h indicates we have about 400 different call sites for WriteEvent.
Option 2 is also tedious, there were attempts trying to detect whether or not StackWalk is safe. It is hard to do in general, and is also error prone.
Option 3 is interesting - so far the only offender I am seeing (in my simple stack) is ReadyToRunInfo::GetMethodDescForEntryPoint trying to do a hash table lookup, and for reason that I don't understand, it enters into cooperative mode, and screw us up. It could be the only case, or it could a tip of the iceberg, I have no idea for now. If we could make stack walking never need to enter cooperative mode, we can ensure a robust implementation of EventPipe. The great thing about this approach is that there is one big component to analyze, but not 400 call sites.
Option 4 is interesting too, if we use a different stack walker, then we are break free from the constraint of the current stack walker. It also allow the EventPipe traces to get native frames for analysis. It will likely not hit the time we need for .NET Core 3, and there might be other reasons that I don't know to introduce a native stack walker into the runtime.
@jkotas @fadimounir - for the ready to run use case @AaronRobinsonMSFT - for what might that RCW reason be. @sywhang @noahfalk @jorive @vancem @tommcdon
@cshung This is a part of the code I have not had a chance to fully process. There are a lot of dragons here with respect to the RCW cache with the STA and GC. If you look through the hash.cpp
there are a myriad of places that mention the RCWCache scenario is broken. I couldn't really answer the question only to say that when we detect an RCW is being created on an STA we take the opportunity to clean up unused STA objects in the RCWCache. This means we could induce a GC that could be the reason, but that is speculation.
for reason that I don't understand, it enters into cooperative mode
m_entryPointToMethodDescMap
is lock-free hash table. It uses the GC mode for synchronization. The comment in front of class HashMap
explains how it works.
here were attempts trying to detect whether or not StackWalk is safe.
Yep, it was stress-test driven development. We have been running stack walk stress for months and every time we hit a crash, assert or hang in it, we have tried to fix by improving the detection of when it is not safe to do a stackwalk.
so far the only offender I am seeing (in my simple stack)
There are number of these lock-free hash tables and other complex code reachable from stack walk. If you run the tests long enough, you should eventually see all more of them.
The great thing about this approach is that there is one big component to analyze
It is about as tedious as Option 2. You have one big component to analyze for both Option 2 and Option 3. The difference between Option 2 and Option 3 is that Option 2 adds conditions around entrypoint to this big component that try to detect when it is dangerous to enter it; and Option 3 is changing the implementation of this component to eliminate the reasons when it is dangerous to enter it.
I think Option 3 is the cleanest, but it is pretty non-trivial.
Certainly from a design perpective, making something very primtive like stack walking NOT have any interactions with higher level concepts like the GC (or GC mode), would not nice, so I am not averse to that. It sounds like it is hard enough that we shoudl be considering alternatives.
I wanted to mention that the number of events in the runtime that would benefit from having a stack associated with the event is relatively small. Indeed we have this inventory, it is the ClrEtwAllMeta.lst file (all events with 'nostack' don't have stacks (the sense of the flag really should have been reversed, stack is much less common and you should have to opt into it).
In particular the one that caused the bug above, was added with the idea that it would be useful to track memory the loader allocated on your behalf but it has actually never been used. I woudl be OK if we simply remove the event (thus fixing this paritcular instance).
If we go the route of picking and choosing sites to have stack walks, we should really have some enforcement by the contract system (there should be a different call contact for those that do stack walks from those that do not, so we get failures (this assumes we do runs on debug CLRs with events truned on, I am assuming this is the case).
Finally I do note that we have a stackwalker for ETW that was GC_NOTRIGGER (see GetCurrentTheadsCallStack in eventtrace.cpp), but I think this is windows only, and frankly I would like to get rid of it, so best not rely on it.
I'm not sure how well ClrEtwAllMeta.lst is expressing our intent? There are only 6 events that explicitly opt in to having stacks, two of which are fusion (irrelevant), one is LoaderHeapAllocation (which we just decided wasn't valuable) and the other three are GarbageCollectionPrivate. However after that there is a sizable swath of events that aren't marked as 'stack' or 'nostack' and its unclear what the intent is for those.
I think a simple route could be declaring that the only runtime events which will get stacks from EventPipe are the ones explicitly marked 'stack', which is now just three relevant events. This reduces the auditing burden dramatically. If anyone else wants to add more we can fault them in as they are determined to be useful and safe. If we find ourselves doing that too frequently we could step back and re-evaluate the approach. How does that sound?
Longer term, Option 3 seems the best, but for the foreseeable future, I suspect the approach of only capturing stacks for a defined set of "safe" events would be acceptable.
@cshung please reference PR and close
Symptom:
When debugging a test case with an aggresive printing breakpoint set, the debuggee deadlocks.
Repro (x86, debug)
Run the test case in dotnet/runtime#12180 with WinDBG attached and set this aggressively logging breakpoint
bm CoreCLR!EventPipeBufferList::InsertTail "k;g"
. It will stop and deadlock.Analysis
It appears that the EE suspension triggered in thread 6 will not complete because thread 9 is waiting on the LoaderHeap lock while it is held by thread 0 which is already suspended.
@janvorli, can you please take a look?
Stacks