Closed Korporal closed 2 weeks ago
@dotnet/dotnet-diag
Hi @Korporal ! Thanks for the feedback. What you suggest seems plausible to build at first glance (though maybe not easy), and we've already got some debug APIs which allow a similar analysis for debuggers: https://docs.microsoft.com/en-us/dotnet/framework/unmanaged-api/debugging/icordebugthread4-getblockingobjects-method https://docs.microsoft.com/en-us/dotnet/framework/unmanaged-api/debugging/icordebugheapvalue3-getthreadowningmonitorlock-method https://docs.microsoft.com/en-us/dotnet/framework/unmanaged-api/debugging/icordebugheapvalue3-getmonitoreventwaitlist-method
A few other thoughts I have offhand: a) Its not clear to me how many .NET customers would prioritize this type of functionality? This issue may get some additional feedback to help and then UserVoice is also a good way to vote on things you think are interesting and see if others in the community feel the same way. b) There are many other synchronization primitives that customers could use from managed code so this feature would be limited to only the deadlocks that are produced via cyclic acquisition of monitor lock. I worry a bit that we'd do lots of work only to discover most of the issues encountered in real world apps weren't simple enough to be detected, but happy to keep an open mind if the community feedback suggests otherwise.
@gregg-miskelly - Do you happen to recall if VS still has its support for monitor lock deadlock detection and any feedback customers have given us on its effectiveness?
@noahfalk - Yes there are other primitives and initially I was considering scenarios involving only the lock
keyword (rudimentary Monitor locks). I can envisage a general purpose API though that could be leveraged (under the hood) for all of these primitives because the problem is the same in every case.
That API would maintain state date for each thread and each "lock" that has been acquired and is being waited for. Then before proceeding to wait in any way (on a primitive) the algorithm would analyze the state data to ascertain if that wait would be a deadlock - never able to terminate.
The only justification of course is that throwing an exception when this is detected is better than waiting and thus freezing (the relevant threads) in the AppDomain
.
I wonder what overhead this would cause. Taking an uncontrnded lock should be very fast and this would add some book keeping to that path.
I can envisage a general purpose API though that could be leveraged (under the hood) for all of these primitives because the problem is the same in every case. That API would maintain state date for each thread and each "lock" that has been acquired and is being waited for. Then before proceeding to wait in any way (on a primitive) the algorithm would analyze the state data to ascertain if that wait would be a deadlock - never able to terminate.
It's not clear how such an API would work without actual cooperation from synchronization primitives. You're probably end up with races between API reporting and actual blocking.
Would it be possible to make this also work for async
deadlocks? I suspect that many people are switching from blocking lock
s to async locks using e.g. SemaphoreSlim
. Even though there is no thread being blocked, you can still get a deadlock.
I understand that not having a thread associated with a synchronization primitive makes this harder, but I also think it might be useful.
An example:
var semaphore1 = new SemaphoreSlim(1, 1);
var semaphore2 = new SemaphoreSlim(1, 1);
async Task TaskA()
{
await semaphore1.WaitAsync();
await Task.Delay(500);
await semaphore2.WaitAsync();
semaphore2.Release();
semaphore1.Release();
}
async Task TaskB()
{
await semaphore2.WaitAsync();
await Task.Delay(500);
await semaphore1.WaitAsync();
semaphore1.Release();
semaphore2.Release();
}
await Task.WhenAll(TaskA(), TaskB());
@danmosemsft
I wonder what overhead this would cause. Taking an uncontrnded lock should be very fast and this would add some book keeping to that path.
Yes it will carry a cost and you're right in the case of acquiring an uncontested lock it would significantly raise that cost (though still very small overall). This is why I suggested making an AppDomain
option perhaps with a method SetDeadlockDetection(bool)
or something.
@mikedn - I haven't thought about the technical details yet and agree it is not something to be done without serious analysis the API's "database" would itself require some form of synchronization too, ideally though this could be a very briefly held spin lock.
I have done something like this before in the late 80's - on Stratus fault tolerant minis (running VOS a derivative of Multics) and used this to detect deadlock scenarios when acquiring (what are termed in VOS on 68000+ family) "lockwords" it worked but that was a simpler scenario in some ways (didn't have the variety of primitives and VOS schedules processes not threads).
@svick - That's a good test case you outline there! I don't think that the fact it's running async
has any bearing though and the same algorithm would work, after all even in that async
code we still have a thread trying to acquire a lock
(it's just a worker thread or perhaps IO completion thread - I think).
Actually the more I look at that code the more I now wonder if it is actually doing what it might seem to be doing but I'm not familiar with WaitAsync
so I suspect what I'm concerned about has already been ruled out by the designers.
@Korporal You still have a thread, but the issue is that that thread is not the owner of the lock.
For example, the four acquire attempts in my example (in the form of WaitAsync()
) could happen on four different threads. Or they could all happen on a single thread. And in both cases, you would still get deadlock.
@svick - I'm not familiar with this async
usage for these locks it looks rather interesting, I guess you're correct in that all that matters (for deadlock analysis) is the data about all currently held locks and current pending requests to acquire locks.
But how does that code ensure that Thread_1
(executing TaskA
) doesn't first get semaphore1
and then next get semaphore2
in TaskB
- since any thread could in principle execute the code? Thread_2
might next try to acquire semaphore2
in TaskA
and wait and there'd be no deadlock at this point.
I'm sure its solid and my unfamiliarity with this is my problem here!
I haven't thought about the technical details yet and agree it is not something to be done without serious analysis the API's "database" would itself require some form of synchronization too, ideally though this could be a very briefly held spin lock.
To be clear - I wasn't expecting exact implementation details, I was just wondering if something like this is possible without the cooperation of synchronization primitives. Maybe I do not understand how such an API might look like. Something like DeadlockDetection.NotifyBeginWait(object)
and DeadlockDetection.NotifyEndWait(object)
?
@mikedn
To be clear - I wasn't expecting exact implementation details, I was just wondering if something like this is possible without the cooperation of synchronization primitives. Maybe I do not understand how such an API might look like. Something like DeadlockDetection.NotifyBeginWait(object) and DeadlockDetection.NotifyEndWait(object)?
In principle I think it would be but it would be much more attractive to emebed the calls to the API inside existing synchronization classes (like Monitor). But one could write a set of replacement methods for these and have consumers use these instead.
For example a new static class SafeMonitor could be crafted that provides the same methods as Monitor and wraps calls to Monitor but also includes calls to the API.
Deadlock detection could be asynchronous for performance reasons. There could be a timer that checks all blocked threads every 5 seconds. SQL Server does it that way to keep deadlock detection fast. When deadlocks are detected the timer frequency is increased.
Deadlock detection would make certain kinds of apps much safer. A deadlock in production usually does not come alone. The thread pool is quickly overwhelmed and the application is down.
@mikedn - Yes, when I mentioned "an API" I wasn't trying to imply it would be directly called by developers but would for the basis for existing code, I totally agree that classes like Monitor and Mutex would be the ideal places to locate these calls.
adding @kouvel who owns threading primitives
@GSPP
Deadlock detection could be asynchronous for performance reasons. There could be a timer that checks all blocked threads every 5 seconds. SQL Server does it that way to keep deadlock detection fast. When deadlocks are detected the timer frequency is increased.
Deadlock detection would make certain kinds of apps much safer. A deadlock in production usually does not come alone. The thread pool is quickly overwhelmed and the application is down.
I dont' see why that is preferable to simpy doing the check at the point a thread attempts to acquire or wait for a lock.
I can envisage a method WouldCauseDeadlock(args...)
that simply returns a bool. There are potential failure windows though because getting false back at one instant only represents the situation at that instant.
Of course there are deeper things to consider because primitives like Mutex are actually OS kernel objects that exist outside of AppDomains and processes.
But for the C# lock
operation this is just a "wrapper" around a Win32 CriticalSection primitive which is confined to a single process (and no doubt is private to an AppDomain within the CLR).
This is why I initially would focus simply on lock
because that is the most common syncrhonization mechanism used by most C# developers ( <- a claim that I have no data for !!).
WouldCauseDeadlock
would need to run some graph algorithm to detect cycles. That can be very expensive. The time based approach reduces the checking cost to nearly zero.
Maybe there could be an extensive deadlock detector that any synchronization primitive can register with. Then, all the other classes in the framework as well as user types can participate.
@GSPP - OK yes in a situation where lots of locks are being acquired/released very frequently then I agree a periodic check could have performance advantages.
As I think about this a thread attempting to get a lock could only enter a deadlock if another thread has already acquired that lock, therefore an initial (fast) check is "Does another thread already own this lock?" if the answer is no then we're done no need to check any further but if the answer is yes, then further analysis is required.
You're right though, a periodic scan is a great way to do this. Though the (worker) thread that performs the check would need to be able to "wake" the relevant waiting thread and let it experience an exception.
Also if we allow a thread to begin to wait and cause a deadlock, there is nothing to stop additional thread also trying to get locks etc so what I mean is the worker thread may need to jolt several waiters.
This can probably be done more easily with synchronization primitives for which there is only one owner thread that is tracked by the runtime:
I'm thinking an algorithm could be something like this before starting to wait on a compatible lock type:
It seems like it could just be an O(N) search along the wait chain, such that we find the cycle when it happens. If we only do the detection periodically then it would have to search a larger graph. The waiting path is a slow path anyway, so unless the wait chain is very long, it probably would not affect perf too much. To further reduce the perf impact we may also choose to wait for some time first and then trigger deadlock detection before waiting again.
When a deadlock occurs, the app may not necessarily be completely stalled. A buggy component may have triggered a deadlock between two threads, but the rest of the app may still continue. Regarding throwing an exception:
I'm thinking raising an event might be better:
CC @vancem
For async locking maybe having a separate lock type that assigns lock ownership to a task rather than a thread, such that the task may run on any thread, and then the rest could be done similarly?
If it is event, it does not need to be built into the runtime. It can be independent deadlock detection part that is processing the lock contention events and it can use all sorts of customizable policies.
The algorithm is probably a bit more complicated than above, as a lock's owner thread may change when there isn't actually a deadlock and can lead to falsely identifying it as a deadlock. Probably would have to repeat the detection twice and confirm that the cycle is the same.
If an event is raised for waits (such as T0 is waiting for a lock held by T1, maybe even delayed) the above can happen concurrently, for instance T1 may release its lock and it may be acquired by T2, and now T0 is actually waiting for a lock held by T2. We'd then have to update the info by raising an event again but based on acquires, this could be expensive. We could maybe expose APIs to allow running the algorithm above by an independent component.
If the wait chain keeps changing, the detection may have to give up at some point, unless we synchronize it somehow to keep the system stable
I think it is worth thought to making nasty programming issues easier. Ideally we try to keep runtime involvement to a minimum by providing primitives that could be used to build more sophisticated things.
I also note that deadlock detection would not be my on my 'high priority' list because frankly the hang is pretty obvious and does contain the state needed to debug the problem. Thus I am more inclined to suggest a tool that makes the diagnosis of this easy (e.g. a dump analyzer).
Other interesting things that seem more interesting
Still. if it is easy enough maybe deadlock detection is worth it (small gain, but also small cost).
Maybe an alternative could be to provide an API that would do the deadlock detection asynchronously as suggested above and let the app decide when to invoke it. Debugger support should probably be the first step.
Would it be possible to make this also work for
async
deadlocks? I suspect that many people are switching from blockinglock
s to async locks using e.g.SemaphoreSlim
. Even though there is no thread being blocked, you can still get a deadlock.I understand that not having a thread associated with a synchronization primitive makes this harder, but I also think it might be useful.
An example:
var semaphore1 = new SemaphoreSlim(1, 1); var semaphore2 = new SemaphoreSlim(1, 1); async Task TaskA() { await semaphore1.WaitAsync(); await Task.Delay(500); await semaphore2.WaitAsync(); semaphore2.Release(); semaphore1.Release(); } async Task TaskB() { await semaphore2.WaitAsync(); await Task.Delay(500); await semaphore1.WaitAsync(); semaphore1.Release(); semaphore2.Release(); } await Task.WhenAll(TaskA(), TaskB());
At BingAds team, We wrote a wrapper for SemaphoreSlim for this exact reason, it detect deadlock in async code awaiting semaphore.WaitAsync(). I would be happy to share this code if it can be of use to other people. Only problem is it only detect cycle in thread and task waiting on instances of this wrapper.
if we have cycle on task waiting on task completion this will not be detected. EX: 1: "TaskA" call semaphore1.WaitAsync(); 2: "TaskA" start a task "TaskB". 2: "taskB" call await semaphore1.WaitAsync(); 3: "TaskA" call await taskB or taskB.Result
At runtime, we know "TaskA" own the lock and that TaskB try to acquire it, but we don't know a cycle exist because we don't know that the owner is blocked on TaskB completing.
Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.
This process is part of our issue cleanup automation.
This issue will now be closed since it had been marked no-recent-activity
but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.
Given that the
lock
keyword simply "wraps" a critical section and critical sections are confined to a single process it strikes me as possible - in principle - for the CLR to be able to perform a kind of deadlock detection.It could do this when any thread attempts to acquire a
lock
by examining if the calling thread currently owns any otherlocks
and then for each of those examine if any of these are being waited for by any other thread.It could then abort the attempted lockwith some specific
DeadlockException
or something.Because critical sections are process wide it does beg the question in my mind, about if it is possible for two
AppDomains
(in the same process) to ever share alock
, I suspect this isn't ordinarily something one would design though.The proof of concept could just be a small API that wraps the Monitor class and maintains the data structures necessary to track
lock
ownership by threads. The API would manage a static tree/list (that is itself thread safe) that records which threads currently hold whichlocks
. Given the potential performance impact this could be something that is enabled/disabled at runtime at theAppDomain
level.The algorithm is then to deny a request from thread X to acquire a
lock
if doing so would lead to a hang because some other thread is trying to acquire alock
already held by thread X. The key observation here being that any thread that haslocks
held and is waiting for some otherlock
- by definition - cannot release thelocks
it already holds.The
exception
that would be thrown could contain metadata describing why a deadlock was detected which is far better than seeing anAppDomain
simply freeze.The ultimate benefit is not that deadlocks would be eliminated but that they could be detected at the instant they arise and used during testing or when attempting to reproduce a deadlock that has been reported in an app or class library.
Thoughts?