jjmason commented 4 years ago

The Problem

When using transactions in production, @jsteinich and I have encountered deadlocks that are difficult to reproduce and predict. While the ultimate solution is to order accesses and structure data to prevent these situations altogether, we would like to be able to do two things to keep our systems stable and push toward that goal:

Quickly (~1 sec) abort and retry the request, preferably knowing that it's a deadlock and being able to tune our retry logic appropriately. Most deadlocks are resolved by retry - if we can do these retries relatively quickly the impact on player experience is minimized.
Diagnose the deadlock by examining Silo logs. Currently we have a transaction id and a request message, which leaves a lot of guesswork in determining the cause of the deadlock. It would be nice to see which grains are being locked for all transactions involved in the deadlock.

While reducing the transaction break-lock timeout would enable (1), in production we see a small number of transactions take a very long time (> 5 sec). Aborting and retrying these transactions after a short time interval won't help, and in many cases will make things worse.

Our Proposal

We'd like to add an (optional) runtime deadlock detection system to orleans transactions. This would work something like the following:

Each lock group has two deadlines - one to check for deadlocks, and one to abort the transaction normally. When the first deadline is hit, we run the deadlock detector, and abort with a special "deadlock" status if one is found. The deadlock detection would also log the transactions and grains that are part of the cycle.
If deadlock detection is enabled, each ReaderWriterLock sends information about waits and locks to a per silo service.
If the deadlock check deadline is reached for a particular lock group, we request a deadlock check for the transactions that are part of that group.
- As an optimization, we could first check for deadlocks in the per silo state.
- If no local deadlocks are found, we send a message containing the suspected transaction(s) to a single grain. This grain requests relevant lock/wait information for each silo (essentially, relevant means 'reachable from the suspected transaction id, via either locks or waits') from the local tracking services.
- With this information, we use cycle detection to identify deadlocks and trigger aborts.
- If deadlock detection is disabled, we never send info to the local tracking service or request deadlock checks, so the overhead in this case is minimal.

Before we start building this, we'd like the Orleans team's take on the proposal in general and advice on a few specific problems.

The current prototype we're working with uses simple locking to handle concurrency in the per silo service - it might be easier/better to use message passing by working with the orleans runtime.
Requesting information from per silo services is tricky - the IManagementGrain.SendControlCommandToProvider trick works in our prototype, but we'd like to use more appropriate systems if they exist.

jjmason commented 4 years ago

@sergeybykov @jason-bragg any thoughts on this? We have a quick and dirty implementation working, but it needs some changes to Orleans code to get it done.

ReubenBond commented 4 years ago

I'm interested to see what the code + required changes look like. We can certainly add something more fit-for-purpose than SendControlCommandToProvider.

jjmason commented 4 years ago

Great, we'll put together a rough PR.

ReubenBond commented 4 years ago

Thanks @jjmason - a rough PR largely as fuel for discussion (at least to begin with) would be great - so there's no need to make it clean/mergable quality

jason-bragg commented 4 years ago

Posting notes from previous thinking on general orleans deadlock problem from 2012 (before transactions). Transactions adds more complexity, so this isn't directly about this issue, but posting for context. Note, since this was from 2012, it may not all still be accurate.

tagging @gabikliot, as he was, to my understanding, central to much of this investigation at the time.

2012 Deadlock detection notes

Let's break the problem of deadlock detection into 3 sub problems: 
    1. Detecting a cycle of requests within the same transaction (the same client request)
    2. Detecting a cycle of size 2 between independent requests
    3. Detecting a cycle of size larger than 2 between independent requests
For each sub-problem and each solution, desirable properties would be:
    1. No false positives  - if we declare  a deadlock, there is indeed a deadlock
    2. No false negatives -  if we did not declare a deadlock, there was not any (we did not miss any deadlock)
Since all our methods are based on runtime execution (and not code analysis), we can not be sure anyway that the code is fully deadlock free. That means that occasional false negatives are actually allowed. Simply put, since we can't guarantee that the service is deadlock free anyway, it is not so bad if we also mistakenly did not notice some deadlock that indeed occurred. Such deadlock will  not be harmful any way, since we have timeouts to break deadlock cycles.
However, we should try to minimize false negatives.

    1. Detecting a cycle of requests within the same transaction (the same client request)
        a. We need to keep the full causal request chain in a message header, and when the request arrives check to see if the target activation is not already part of the chain (processing one of the requests in the chain before delivering the new request).
            1) A cheap version of this would be to detect cycles of length 2 -- A calls B calls A -- by looking at the sender of a new request when it is received at non-reentrant grain G and checking to see if G has a pending request waiting on a response from that sender. You could do the same, actually, before sending, except that you won't know whether the target grain is reentrant or not.
        b. Properties: The solution does not have neither false positives nor false negatives.
        c. Cost:
            1) The solution does not have any remote cost - no additional remote msgs.
            2) Additional call chain id header in every msg.  List<Tuple<ActivationId, InterfaceId, MethodId>>. 
                a) Need the actual activation id to make sure this activation is part of the deadlock. Also assume we can find grain based on activation id and figure out it is Reentrant or not.
                b) Need interface id and method id for reporting back to the caller. Assume can get method name from interface id and method id on the client. Not sure.
                c) We already track call history in a string format and store in metadata, so will just need to store in a header.
            3) An ability to know which request an activation currently executes when a new request arrives.
                a) In the new pipeline architecture, we will have a layer responsible for reentrancy and isolation of requests and interaction with the scheduler, so this layer would know that for sure, for no additional cost.
                b) In the current architecture, Catalog knows the currently executing request (Message Running)
        d. Implementation:
            1) Dispatcher looks in the arrived message before  queuing it due to request non-reentrancy (must be done in the Dispatcher before it queues it and not in InsideGrainClient.Invoke() since the former is too late)
            2) Extracts the current target (activationId, methodId, interfaceId) from the arrived msg.
            3) Extracts the call history chain from msg.
            4) Checks for deadlock:
                a) if yes, replies with DeadlockException.
                b) If no,  allows to queue. Still need to make sure the current invocation data also propagate to the next msg.
            5) InsideGrainClient before invoke:
                a) Adds the current call site to the call chain up until now (that just extracted from the msg).
                b) Stores the new call chain in the RequestContext
                    a) Should be kept inaccessible to the app code. We should NEVER allow the app from iterating over the RequestContext values.
                c) RequestContext is already now being automatically saved in the msg upon send.
                    a) GrainClient.CreateMessage -> RequestContext.ExportToMessage(message);
        e. An alternative solution that does not track the whole call chain and only uses  unique tx id is WRONG.
            i. For this problem we just need to allocate a unique tx id on every new client request coming into the system and every time it is queued to execution into an activation and an activation is busy with previous request, we should check what request it is currently executing and if the target grain is non reentrant and if they have the same tx id, we know for sure that it is a deadlock.
            ii. (Alan) This algorithm gives false positives in "diamond" calling patterns. Client calls A, A calls B and C in parallel, B calls D, C calls D while D is processing B's request. There's no deadlock, but a new request for D has arrived for the same transaction that D is currently processing.
        f. (Alan) I think this is the most important case to detect, since it is generally 100% reproducible, while the other cases below are timing-dependent.
        g. Need to pass a call chain of activation IDs, interface ID and method ID, and detect loops on the receiver side

    2. Detecting a cycle of size 2 between independent requests:
        a. Activation a1 sent a request r1 to activation a2, and at the same time activation a2 sent a request r2 to activation a1. Both requests have different tx ids.
        b. First, I would like to make a simplifying assumption about ordering:
            i. Lets assume that any 2 msgs sent between 2 activations are FIFO ordered. This includes also ordering of responses with requests. 
            ii. So for example if a2 sent a response for r1 to a1 and later on it sent a request r2 to a1, the response to r1 will arrive to a1 before r2.
            iii. This assumption (requirement) would simplify the solution design and make it possible to have no false positives  and no false negatives .
        c. There are 2 places we could implement the algorithm:
            i. Test for deadlocks when the request arrives to an activation 
            ii. Test for deadlocks when the request timeouts in the callback data in the issuing activation
        d. Solution A: When the request arrives to an activation
            i. When request r2 arrives to a1 we need to check that there are no outstanding requests from a1 to a2.
            ii. If there is such an outstanding request, we know for sure there is a deadlock.
                1) Due to the assumption of FIFO, if when r2 arrives to a1 and there is an outstanding r1, it means the response for r1 has not arrived yet, which means it has not been sent yet (FIFO between responses and requests) and it means when a2 sent r2 to a1, a2 did not receive (did not start processing) r1 yet.
                2) This is only true if both a1 and a2 are non-reentrant grains. If a2 is reentrant, there is no deadlock, since a2 will be able to process r1 in parallel with waiting on r2 reply from a1, thus a2 will be able to respond to r1 as well and unblock a1.
            iii. Properties: The solution does not have neither false positives nor false negatives.
            iv. Cost:
                1) The solution does not have any remote cost - no additional remote msgs.
                2) An ability to know which request an activation has issued when a new request arrives.
                    a) I think we don't track this info right now.  CallbackData collection is for all requests from this silo, keyed by correlation id. So we would need to remember, per activation, which request it issued. This should not impose any big runtime cost, since in the general case an activation would not issue a lot of requests. 
                3) An ability to know if the issued request was issued to non-reentrant or reentrant grain.
                    a) GrainTypeManager holds this info. Might need to refactor, but should be doable. 
                    b) One problem: if interface has multiple impl. classes, which one the msg will go to?
        e. Solution B: When the request times out test for deadlocks
            i. When request r1 from a1 to a2 times out, test locally if we have any pending request from a2 to a1.
            ii. This solution is actually identical to Solution A (test on request arrival) - it will detect exactly the same set of deadlocks, but later instead of earlier.
        f. If there is no FIFO between requests and responses:
            i. Now it could be that a2 sent a response for r1 and later on a2 sent r2, but r2 arrived before r1 response.
            ii. Therefore, if we declare deadlock we might have false positives.
            iii. In such a case it is better to do deadlock detection upon timeout and not upon request arrival
                1) It will lower the chance for false positives, but it will not eliminate them completely. But in this case, we fail the request anyway, so instead of failing with timeout, we will fail with PotentialDeadlock exc.
            iv. (Alan) It still may be better to check on arrival, accepting the relatively rare false positive in favor of early failure. False positives should be quite rare because the processing pipeline for sending a reply is shorter than that for sending a request, so it would be highly unusual (although certainly not impossible) for a request to move ahead of a reply in the message stream.
                1) I think Alan is right. In this particular case of Response we might have FIFO. Response is send by InsideGrainClient as part of request post invoke step - in the same turn as invoke. And it goes in the same turn into Transport.SendMessage(message) which calls OutboundQueue.SendMessage(msg); -> AsynchQueueAgent.QueueRequest(). Need to check, but we might be doing FIFO along this route in the non-failure case even now.

    3. Detecting a cycle of size larger than 2 between independent requests:
        a. I think it is possible to show (and prove) that for a cycle of 3 independent request, it is not be  possible to devise a deadlock detection algorithm that uses only local information available at silos upon request generation or arrival.
        b. Imagine 3 activations on 3 silos, a1 sending request r1 to a2, a2 sending r2 to a3, and a3 sending r3 to a1. All requests are sent at the same time, so when one is issued from one silo, this silo did not receive the request to this silo.
            i. It means, when the request arrives it has nothing in it about any request destined to its originating silo.
            ii. So r3 arriving at a1 does not know about r2 destined to a3. 
            iii. a1's silo also does not know about r2 that originated at a2.
            iv. So it seems it is not possible to detect the deadlock only based on a local information or on requests.
        c. So there is a need to send additional messages deadlock detection:
            i. For example, responses with some partial information available in the silos (these responses are actually not app responses, but rejection msgs after timeout, which include some deadlock relevant info).
            ii. Or some sort of cycle detection algorithm that will "walk" the cycle in some cases.
        d. At any event, any such algorithm is not local and will involve additional messaging.
We can therefore delay such an algorithm for future and for now on only deal with the first 2 cases.

jjmason commented 4 years ago

Hey all, I've added a sketch of the implementation that we can start discussion in PR #6253.

Let me know what you think of the overall strategy.

jjmason commented 4 years ago

Hi @ReubenBond @sergeybykov, any thoughts on this?

ReubenBond commented 4 years ago

Apologies for the very long delay, @jjmason. I am in favor of the approach you've taken.

One question is: if system performance degrades, would this change result in a spike in load which could degrade performance further?

@jason-bragg, do you have any thoughts on this design & associated PR?

KevinCathcart commented 4 years ago

One concern I see is that the implementation shown in the pull request will miss some deadlocks that only involve transaction locks, because it only queries the other silos once, but they don’t return their entire lock graph, but only a subset based on parameters passed in.

Assume that we have five resources, named A-E. And further, each resource happens to be on a separate silo, which we will also call A-E, after the resource they host. Now assume that we have five transactions ongoing, each of which has an exclusive lock on one of the five resources, and is waiting on the lock for the alphabetically next resource, with wraparound, (so the transaction holding E is waiting on A). Obviously that is a classic deadlock.

Now when the deadlock check deadline for Resource A happens, it will ask the checking grain, which for arguments sake, we will assume lives on some sixth silo, so it will have no local lock information, and will rely on the remote silos for info. So it calls out to all of them, asking about locks or waits that involve resource A, or the transaction that holds it. Only silos A and B will have this information, as the other silos know nothing about that resource, or the transaction holding it. But data from those two silos is not enough to detect this deadlock.

For that, it is necessary to examine the responses, and requery, the silos including any new transactions learned about, until some steady state is reached, or a deadlock can be detected in the partial data collected so far. Not really too big a problem, but will make this at least somewhat more expensive than it otherwise is.

jjmason commented 4 years ago

Both of these are good points.

I think that Reuben's question about performance could be addressed with a basic rate limit on deadlock detection - for example, only N potential deadlocks could be analyzed at once. This means that we would miss some actual deadlocks, but in the scenario where the system is under load and it's probably better to let those deadlocks reach the normal timeout than to add more load.

Collecting information from multiple silos is certainly a challenge, and might be prohibitively expensive. We could limit how far down a chain we go looking for deadlocks and probably still see a satisfactorily low false negative rate, but that's something we'll have to assess with more data.

One alternative to repeated queries and steady-state tracking might be to have each silo send requests to silos hosting grains that are involved in locks it's aware of, along with the address of the grain running deadlock detection, and have each silo in the chain report its lock information to the deadlock detector. This would require tracking which requests each silo has responded to, since all cross-silo deadlocks involve a "cycle" of silos. Does this seem like it might offer performance advantages over repeated queries from the deadlock detector to all silos?

I have some time to work on this over the next couple weeks - would it make sense to add work in progress to this PR?

jjmason commented 4 years ago

@ReubenBond I've got the PR in a more production ready place. It (probably) needs some more work, but I'd be interested to get the team's thoughts on it.

In particular, I'd like some advice on testing (it's naturally difficult to setup consistent test scenarios), and overall correctness.

ghost commented 2 years ago

We've moved this issue to the Backlog. This means that it is not going to be worked on for the coming release. We review items in the backlog at the end of each milestone/release and depending on the team's priority we may reconsider this issue for the following milestone.

KSemenenko commented 2 years ago

this is a good one

dotnet / orleans

Adding Transactional Deadlock Detection #6219

The Problem

Our Proposal