Proposal - Incorporate deadlock detection into the CLR.

Korporal commented 6 years ago

Given that the lockkeyword simply "wraps" a critical section and critical sections are confined to a single process it strikes me as possible - in principle - for the CLR to be able to perform a kind of deadlock detection.

It could do this when any thread attempts to acquire a lockby examining if the calling thread currently owns any other locksand then for each of those examine if any of these are being waited for by any other thread.

It could then abort the attempted lockwith some specific DeadlockException or something.

Because critical sections are process wide it does beg the question in my mind, about if it is possible for two AppDomains(in the same process) to ever share a lock, I suspect this isn't ordinarily something one would design though.

The proof of concept could just be a small API that wraps the Monitor class and maintains the data structures necessary to track lockownership by threads. The API would manage a static tree/list (that is itself thread safe) that records which threads currently hold which locks. Given the potential performance impact this could be something that is enabled/disabled at runtime at the AppDomainlevel.

The algorithm is then to deny a request from thread X to acquire a lockif doing so would lead to a hang because some other thread is trying to acquire a lockalready held by thread X. The key observation here being that any thread that has locksheld and is waiting for some other lock- by definition - cannot release the locksit already holds.

The exceptionthat would be thrown could contain metadata describing why a deadlock was detected which is far better than seeing an AppDomainsimply freeze.

The ultimate benefit is not that deadlocks would be eliminated but that they could be detected at the instant they arise and used during testing or when attempting to reproduce a deadlock that has been reported in an app or class library.

Thoughts?

RussKeldorph commented 6 years ago

@dotnet/dotnet-diag

noahfalk commented 6 years ago

Hi @Korporal ! Thanks for the feedback. What you suggest seems plausible to build at first glance (though maybe not easy), and we've already got some debug APIs which allow a similar analysis for debuggers: https://docs.microsoft.com/en-us/dotnet/framework/unmanaged-api/debugging/icordebugthread4-getblockingobjects-method https://docs.microsoft.com/en-us/dotnet/framework/unmanaged-api/debugging/icordebugheapvalue3-getthreadowningmonitorlock-method https://docs.microsoft.com/en-us/dotnet/framework/unmanaged-api/debugging/icordebugheapvalue3-getmonitoreventwaitlist-method

A few other thoughts I have offhand: a) Its not clear to me how many .NET customers would prioritize this type of functionality? This issue may get some additional feedback to help and then UserVoice is also a good way to vote on things you think are interesting and see if others in the community feel the same way. b) There are many other synchronization primitives that customers could use from managed code so this feature would be limited to only the deadlocks that are produced via cyclic acquisition of monitor lock. I worry a bit that we'd do lots of work only to discover most of the issues encountered in real world apps weren't simple enough to be detected, but happy to keep an open mind if the community feedback suggests otherwise.

@gregg-miskelly - Do you happen to recall if VS still has its support for monitor lock deadlock detection and any feedback customers have given us on its effectiveness?

Korporal commented 6 years ago

@noahfalk - Yes there are other primitives and initially I was considering scenarios involving only the lock keyword (rudimentary Monitor locks). I can envisage a general purpose API though that could be leveraged (under the hood) for all of these primitives because the problem is the same in every case.

That API would maintain state date for each thread and each "lock" that has been acquired and is being waited for. Then before proceeding to wait in any way (on a primitive) the algorithm would analyze the state data to ascertain if that wait would be a deadlock - never able to terminate.

The only justification of course is that throwing an exception when this is detected is better than waiting and thus freezing (the relevant threads) in the AppDomain.

danmoseley commented 6 years ago

I wonder what overhead this would cause. Taking an uncontrnded lock should be very fast and this would add some book keeping to that path.

mikedn commented 6 years ago

I can envisage a general purpose API though that could be leveraged (under the hood) for all of these primitives because the problem is the same in every case. That API would maintain state date for each thread and each "lock" that has been acquired and is being waited for. Then before proceeding to wait in any way (on a primitive) the algorithm would analyze the state data to ascertain if that wait would be a deadlock - never able to terminate.

It's not clear how such an API would work without actual cooperation from synchronization primitives. You're probably end up with races between API reporting and actual blocking.

svick commented 6 years ago

Would it be possible to make this also work for async deadlocks? I suspect that many people are switching from blocking locks to async locks using e.g. SemaphoreSlim. Even though there is no thread being blocked, you can still get a deadlock.

I understand that not having a thread associated with a synchronization primitive makes this harder, but I also think it might be useful.

An example:

var semaphore1 = new SemaphoreSlim(1, 1);
var semaphore2 = new SemaphoreSlim(1, 1);

async Task TaskA()
{
    await semaphore1.WaitAsync();

    await Task.Delay(500);

    await semaphore2.WaitAsync();

    semaphore2.Release();

    semaphore1.Release();
}

async Task TaskB()
{
    await semaphore2.WaitAsync();

    await Task.Delay(500);

    await semaphore1.WaitAsync();

    semaphore1.Release();

    semaphore2.Release();
}

await Task.WhenAll(TaskA(), TaskB());

Korporal commented 6 years ago

@danmosemsft

I wonder what overhead this would cause. Taking an uncontrnded lock should be very fast and this would add some book keeping to that path.

Yes it will carry a cost and you're right in the case of acquiring an uncontested lock it would significantly raise that cost (though still very small overall). This is why I suggested making an AppDomainoption perhaps with a method SetDeadlockDetection(bool)or something.

Korporal commented 6 years ago

@mikedn - I haven't thought about the technical details yet and agree it is not something to be done without serious analysis the API's "database" would itself require some form of synchronization too, ideally though this could be a very briefly held spin lock.

I have done something like this before in the late 80's - on Stratus fault tolerant minis (running VOS a derivative of Multics) and used this to detect deadlock scenarios when acquiring (what are termed in VOS on 68000+ family) "lockwords" it worked but that was a simpler scenario in some ways (didn't have the variety of primitives and VOS schedules processes not threads).

Korporal commented 6 years ago

@svick - That's a good test case you outline there! I don't think that the fact it's running asynchas any bearing though and the same algorithm would work, after all even in that asynccode we still have a thread trying to acquire a lock(it's just a worker thread or perhaps IO completion thread - I think).

Actually the more I look at that code the more I now wonder if it is actually doing what it might seem to be doing but I'm not familiar with WaitAsyncso I suspect what I'm concerned about has already been ruled out by the designers.

svick commented 6 years ago

@Korporal You still have a thread, but the issue is that that thread is not the owner of the lock.

For example, the four acquire attempts in my example (in the form of WaitAsync()) could happen on four different threads. Or they could all happen on a single thread. And in both cases, you would still get deadlock.

Korporal commented 6 years ago

@svick - I'm not familiar with this asyncusage for these locks it looks rather interesting, I guess you're correct in that all that matters (for deadlock analysis) is the data about all currently held locks and current pending requests to acquire locks.

But how does that code ensure that Thread_1 (executing TaskA) doesn't first get semaphore1and then next get semaphore2in TaskB- since any thread could in principle execute the code? Thread_2might next try to acquire semaphore2in TaskA and wait and there'd be no deadlock at this point.

I'm sure its solid and my unfamiliarity with this is my problem here!

mikedn commented 6 years ago

I haven't thought about the technical details yet and agree it is not something to be done without serious analysis the API's "database" would itself require some form of synchronization too, ideally though this could be a very briefly held spin lock.

To be clear - I wasn't expecting exact implementation details, I was just wondering if something like this is possible without the cooperation of synchronization primitives. Maybe I do not understand how such an API might look like. Something like DeadlockDetection.NotifyBeginWait(object) and DeadlockDetection.NotifyEndWait(object)?

Korporal commented 6 years ago

@mikedn

To be clear - I wasn't expecting exact implementation details, I was just wondering if something like this is possible without the cooperation of synchronization primitives. Maybe I do not understand how such an API might look like. Something like DeadlockDetection.NotifyBeginWait(object) and DeadlockDetection.NotifyEndWait(object)?

In principle I think it would be but it would be much more attractive to emebed the calls to the API inside existing synchronization classes (like Monitor). But one could write a set of replacement methods for these and have consumers use these instead.

For example a new static class SafeMonitor could be crafted that provides the same methods as Monitor and wraps calls to Monitor but also includes calls to the API.

GSPP commented 6 years ago

Deadlock detection could be asynchronous for performance reasons. There could be a timer that checks all blocked threads every 5 seconds. SQL Server does it that way to keep deadlock detection fast. When deadlocks are detected the timer frequency is increased.

Deadlock detection would make certain kinds of apps much safer. A deadlock in production usually does not come alone. The thread pool is quickly overwhelmed and the application is down.

Korporal commented 6 years ago

@mikedn - Yes, when I mentioned "an API" I wasn't trying to imply it would be directly called by developers but would for the basis for existing code, I totally agree that classes like Monitor and Mutex would be the ideal places to locate these calls.

danmoseley commented 6 years ago

adding @kouvel who owns threading primitives

Korporal commented 6 years ago

@GSPP

Deadlock detection could be asynchronous for performance reasons. There could be a timer that checks all blocked threads every 5 seconds. SQL Server does it that way to keep deadlock detection fast. When deadlocks are detected the timer frequency is increased.

Deadlock detection would make certain kinds of apps much safer. A deadlock in production usually does not come alone. The thread pool is quickly overwhelmed and the application is down.

I dont' see why that is preferable to simpy doing the check at the point a thread attempts to acquire or wait for a lock.

I can envisage a method WouldCauseDeadlock(args...)that simply returns a bool. There are potential failure windows though because getting false back at one instant only represents the situation at that instant.

Of course there are deeper things to consider because primitives like Mutex are actually OS kernel objects that exist outside of AppDomains and processes.

But for the C# lockoperation this is just a "wrapper" around a Win32 CriticalSection primitive which is confined to a single process (and no doubt is private to an AppDomain within the CLR).

This is why I initially would focus simply on lockbecause that is the most common syncrhonization mechanism used by most C# developers ( <- a claim that I have no data for !!).

GSPP commented 6 years ago

WouldCauseDeadlock would need to run some graph algorithm to detect cycles. That can be very expensive. The time based approach reduces the checking cost to nearly zero.

Maybe there could be an extensive deadlock detector that any synchronization primitive can register with. Then, all the other classes in the framework as well as user types can participate.

Korporal commented 6 years ago

@GSPP - OK yes in a situation where lots of locks are being acquired/released very frequently then I agree a periodic check could have performance advantages.

As I think about this a thread attempting to get a lock could only enter a deadlock if another thread has already acquired that lock, therefore an initial (fast) check is "Does another thread already own this lock?" if the answer is no then we're done no need to check any further but if the answer is yes, then further analysis is required.

You're right though, a periodic scan is a great way to do this. Though the (worker) thread that performs the check would need to be able to "wake" the relevant waiting thread and let it experience an exception.

Also if we allow a thread to begin to wait and cause a deadlock, there is nothing to stop additional thread also trying to get locks etc so what I mean is the worker thread may need to jolt several waiters.

kouvel commented 6 years ago

This can probably be done more easily with synchronization primitives for which there is only one owner thread that is tracked by the runtime:

Monitor is possible, probably a good place to start
Mutex's owner is not tracked by the runtime on Windows (and can't be tracked accurately), maybe if there's a way to query that, but still with WaitHandles threads can wait for multiple of them and it would be more complicated
Semaphore doesn't have thread affinity
Maybe for ReaderWriterLockSlim

I'm thinking an algorithm could be something like this before starting to wait on a compatible lock type:

If the wait has a timeout, it's not a deadlock
If there are no other compatible locks owned by this thread, it's not a deadlock
D = current thread, T0 = owner thread of the lock that would be waited upon
Loop:
- If T0 is not waiting on a compatible lock type, it's not a deadlock
- T1 = owner thread of the lock T0 is waiting upon
- If T1 == D, it's a deadlock
- T0 = T1

It seems like it could just be an O(N) search along the wait chain, such that we find the cycle when it happens. If we only do the detection periodically then it would have to search a larger graph. The waiting path is a slow path anyway, so unless the wait chain is very long, it probably would not affect perf too much. To further reduce the perf impact we may also choose to wait for some time first and then trigger deadlock detection before waiting again.

When a deadlock occurs, the app may not necessarily be completely stalled. A buggy component may have triggered a deadlock between two threads, but the rest of the app may still continue. Regarding throwing an exception:

This might be a bit intrusive and for an app to detect deadlock it would need to be aware of this and catch the exception all over the place to prevent the process from going down as a result
Some components involved may not be deadlock-aware and it may be difficult to determine which component triggered the deadlock (maybe from the stack trace)
Detection would need to be perfect, can't be based on heuristic

I'm thinking raising an event might be better:

The event could include the set of threads involved in the deadlock, and maybe also an exception for the stack trace
The app could choose how to handle it (log it, raise another event, identify component by thread and stop using it, interrupt one of the threads and try to continue, fail-fast, etc.)
Could consider heuristic-based deadlock detection in the future to notify of situations where threads are blocked for a long time but it may not necessarily be a deadlock

CC @vancem

kouvel commented 6 years ago

For async locking maybe having a separate lock type that assigns lock ownership to a task rather than a thread, such that the task may run on any thread, and then the rest could be done similarly?

jkotas commented 6 years ago

If it is event, it does not need to be built into the runtime. It can be independent deadlock detection part that is processing the lock contention events and it can use all sorts of customizable policies.

kouvel commented 6 years ago

The algorithm is probably a bit more complicated than above, as a lock's owner thread may change when there isn't actually a deadlock and can lead to falsely identifying it as a deadlock. Probably would have to repeat the detection twice and confirm that the cycle is the same.

If an event is raised for waits (such as T0 is waiting for a lock held by T1, maybe even delayed) the above can happen concurrently, for instance T1 may release its lock and it may be acquired by T2, and now T0 is actually waiting for a lock held by T2. We'd then have to update the info by raising an event again but based on acquires, this could be expensive. We could maybe expose APIs to allow running the algorithm above by an independent component.

kouvel commented 6 years ago

If the wait chain keeps changing, the detection may have to give up at some point, unless we synchronize it somehow to keep the system stable

vancem commented 6 years ago

@stephentoub

I think it is worth thought to making nasty programming issues easier. Ideally we try to keep runtime involvement to a minimum by providing primitives that could be used to build more sophisticated things.

I also note that deadlock detection would not be my on my 'high priority' list because frankly the hang is pretty obvious and does contain the state needed to debug the problem. Thus I am more inclined to suggest a tool that makes the diagnosis of this easy (e.g. a dump analyzer).

Other interesting things that seem more interesting

detecting when you access Mutable multi-thread state (that is you are MISSING a lock)
detecting when you have lock ordering problems (that is you COULD have a deadlock, you just have not hit the right case yet).

Still. if it is easy enough maybe deadlock detection is worth it (small gain, but also small cost).

kouvel commented 6 years ago

Maybe an alternative could be to provide an API that would do the deadlock detection asynchronously as suggested above and let the app decide when to invoke it. Debugger support should probably be the first step.

maximecaron commented 5 years ago

Would it be possible to make this also work for async deadlocks? I suspect that many people are switching from blocking locks to async locks using e.g. SemaphoreSlim. Even though there is no thread being blocked, you can still get a deadlock.

I understand that not having a thread associated with a synchronization primitive makes this harder, but I also think it might be useful.

An example:
var semaphore1 = new SemaphoreSlim(1, 1);
var semaphore2 = new SemaphoreSlim(1, 1);

async Task TaskA()
{
    await semaphore1.WaitAsync();

    await Task.Delay(500);

    await semaphore2.WaitAsync();

    semaphore2.Release();

    semaphore1.Release();
}

async Task TaskB()
{
    await semaphore2.WaitAsync();

    await Task.Delay(500);

    await semaphore1.WaitAsync();

    semaphore1.Release();

    semaphore2.Release();
}

await Task.WhenAll(TaskA(), TaskB());

At BingAds team, We wrote a wrapper for SemaphoreSlim for this exact reason, it detect deadlock in async code awaiting semaphore.WaitAsync(). I would be happy to share this code if it can be of use to other people. Only problem is it only detect cycle in thread and task waiting on instances of this wrapper.

if we have cycle on task waiting on task completion this will not be detected. EX: 1: "TaskA" call semaphore1.WaitAsync(); 2: "TaskA" start a task "TaskB". 2: "taskB" call await semaphore1.WaitAsync(); 3: "TaskA" call await taskB or taskB.Result

At runtime, we know "TaskA" own the lock and that TaskB try to acquire it, but we don't know a cycle exist because we don't know that the owner is blocked on TaskB completing.

dotnet-policy-service[bot] commented 1 month ago

Due to lack of recent activity, this issue has been marked as a candidate for backlog cleanup. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will undo this process.

This process is part of our issue cleanup automation.

dotnet-policy-service[bot] commented 2 weeks ago

This issue will now be closed since it had been marked no-recent-activity but received no further activity in the past 14 days. It is still possible to reopen or comment on the issue, but please note that the issue will be locked if it remains inactive for another 30 days.

dotnet / runtime

Proposal - Incorporate deadlock detection into the CLR. #10335