dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.37k stars 4.75k forks source link

Deadlock when competing threads are locking on multiple objects #81199

Closed eiriktsarpalis closed 1 year ago

eiriktsarpalis commented 1 year ago

Description

I might be missing something obvious, but I encountered an interesting issue when attempting to implement synchronized graph traversal using locks.

Reproduction Steps

On my machine, the following console app deadlocks ~1 out of 4 times:

using System.Diagnostics;

Node[] graph;

//// uncommenting causes deadlocks to stop appearing.
//graph = GraphHelpers.CreateRandomGraph(size: 10, seed: 42);
//GraphHelpers.TraverseSequential(graph);

graph = GraphHelpers.CreateRandomGraph(size: 10, seed: 42);
GraphHelpers.TraverseParallel(graph);

// Models a simple directed multigraph
[DebuggerDisplay("#{Id}")]
public class Node
{
    public int Id { get; init; }
    public object LockObject { get; } = new object();
    public NodeState State { get; set; }
    public List<Node> Children { get; } = new();
}

public enum NodeState { NotTraversed, Traversing, Traversed };

public static class GraphHelpers
{
    // Creates a pseudorandom directed multigraph
    public static Node[] CreateRandomGraph(int size, int? seed = null)
    {
        Random random = seed is null ? Random.Shared : new Random(seed.Value);
        Node[] nodes = Enumerable.Range(0, size).Select(i => new Node { Id = i }).ToArray();
        foreach (Node node in nodes)
        {
            int childCount = random.Next(0, size);
            for (int i = 0; i < childCount; i++)
            {
                node.Children.Add(nodes[random.Next(0, size)]);
            }
        }

        return nodes;
    }

    public static void Traverse(Node node)
    {
        if (node.State is NodeState.Traversed)
        {
            return;
        }

        lock (node.LockObject)
        {
            switch (node.State)
            {
                case NodeState.Traversed:
                    break;

                case NodeState.Traversing: 
                    // can only be hit when traversal reaches a cycle.
                    break;

                default:
                    node.State = NodeState.Traversing;

                    foreach (var child in node.Children)
                        Traverse(child);

                    node.State = NodeState.Traversed;
                    break;
            }
        }
    }

    public static void TraverseSequential(Node[] nodes)
    {
        foreach (Node node in nodes)
            Traverse(node);

        Console.WriteLine("Sequential traversal done.");
    }

    public static void TraverseParallel(Node[] nodes)
    {
        Parallel.ForEach(nodes, Traverse);
        Console.WriteLine("Parallel traversal done.");
    }
}

Expected behavior

Should eventually complete traversal of the entire graph.

Actual behavior

Application deadlocks. Increasing the graph size increases the probability of this happening.

Regression?

No response

Known Workarounds

  1. Using a single lock object for the entire graph, i.e. something like
    private static readonly object _lockObj = new();
    public static void Traverse(Node node)
    {
        lock (_lockObj)
        {
           /* same recursive code as before */
        }
    }

    appears to fix the issue.

  2. Performing a sequential traversal before running a parallel traversal appears to fix the issue, even though it is acting on a completely different graph instance.

Configuration

Running .NET 7/8 on Windows x64.

Other information

No response

dotnet-issue-labeler[bot] commented 1 year ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 1 year ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

Issue Details
### Description I might be missing something obvious, but I encountered an interesting issue when attempting to implement synchronized graph traversal using locks. ### Reproduction Steps On my machine, the following console app deadlocks ~1 out of 4 times: ```C# using System.Diagnostics; Node[] graph; //// uncommenting causes deadlocks to stop appearing. //graph = GraphHelpers.CreateRandomGraph(size: 10, seed: 42); //GraphHelpers.TraverseSequential(graph); graph = GraphHelpers.CreateRandomGraph(size: 10, seed: 42); GraphHelpers.TraverseParallel(graph); // Models a simple directed multigraph [DebuggerDisplay("#{Id}")] public class Node { public int Id { get; init; } public object LockObject { get; } = new object(); public NodeState State { get; set; } public List Children { get; } = new(); } public enum NodeState { NotTraversed, Traversing, Traversed }; public static class GraphHelpers { // Creates a pseudorandom directed multigraph public static Node[] CreateRandomGraph(int size, int? seed = null) { Random random = seed is null ? Random.Shared : new Random(seed.Value); Node[] nodes = Enumerable.Range(0, size).Select(i => new Node { Id = i }).ToArray(); foreach (Node node in nodes) { int childCount = random.Next(0, size); for (int i = 0; i < childCount; i++) { node.Children.Add(nodes[random.Next(0, size)]); } } return nodes; } public static void Traverse(Node node) { if (node.State is NodeState.Traversed) { return; } lock (node.LockObject) { switch (node.State) { case NodeState.Traversed: break; case NodeState.Traversing: // can only be hit when traversal reaches a cycle. break; default: node.State = NodeState.Traversing; foreach (var child in node.Children) Traverse(child); node.State = NodeState.Traversed; break; } } } public static void TraverseSequential(Node[] nodes) { foreach (Node node in nodes) Traverse(node); Console.WriteLine("Sequential traversal done."); } public static void TraverseParallel(Node[] nodes) { Parallel.ForEach(nodes, Traverse); Console.WriteLine("Parallel traversal done."); } } ``` ### Expected behavior Should eventually complete traversal of the entire graph. ### Actual behavior Application deadlocks. Increasing the graph size increases the probability of this happening. ### Regression? _No response_ ### Known Workarounds 1. Using a single lock object for the entire graph, i.e. something like ```C# private static readonly object _lockObj = new(); public static void Traverse(Node node) { lock (_lockObj) { /* same recursive code as before */ } } ``` appears to fix the issue. 2. Performing a sequential traversal before running a parallel traversal appears to fix the issue, even though it is acting on a completely different graph instance. ### Configuration Running .NET 7/8 on Windows x64. ### Other information _No response_
Author: eiriktsarpalis
Assignees: -
Labels: `area-System.Threading`
Milestone: -
stephentoub commented 1 year ago

This is a classic lock inversion in the repro. Thread 1 locks object A and then tries to acquire the lock for object B, and Thread 2 locks object B and then tries to acquire the lock for A. If code can end up taking multiple locks, you need to ensure the locks are always taken in the same order, e.g. lock leveling, lock hierarchy, etc., or you need to be prepared to deal with the inability to acquire a lock (e.g. TryEnter instead of Enter).

Note that the Parallel Stacks window in VS can highlight the culprits in such deadlocks:

image
eiriktsarpalis commented 1 year ago

Thanks, that makes perfect sense. What's not yet clear to me is why preceding this with a sequential traversal prevents the deadlock from occurring. Maybe it's just dumb luck that it hasn't happened to me yet, but could the first operation somehow impact the interleavings of the second one?

Note that the Parallel Stacks window in VS can highlight the culprits in such deadlocks:

TIL, that's pretty awesome 🤯

stephentoub commented 1 year ago

could the first operation somehow impact the interleavings of the second one?

Yes. For example the first iteration will incur overheads of JIT'ing on each new method invoked, changing timing.

theodorzoulias commented 1 year ago

Dining philosophers problem :-)

eiriktsarpalis commented 1 year ago

If code can end up taking multiple locks, you need to ensure the locks are always taken in the same order, e.g. lock leveling, lock hierarchy, etc.

I'm struggling to come up with a lock ordering scheme when arbitrary graphs are involved. I should clarify that traversal needs to be depth-first so you couldn't sort the visited nodes based on some order. Seems taking a global lock on the graph is the only feasible option.

stephentoub commented 1 year ago

Why do you need to hold the node's lock while processing all of the nodes reachable from it?

What work exactly are you trying to parallelize?

And if the traversal must be purely depth-first, why is it acceptable to effectively jump to a random node to continue processing there (which is what happens when you Parallel.ForEach and have multiple threads processing different parts of the graph in parallel)?

eiriktsarpalis commented 1 year ago

Long story short, it concerns STJ's JsonTypeInfo<T> "configuration" routine:

https://github.com/dotnet/runtime/blob/26b58b99aeef79c48b8e29463a1e6f1855174e1b/src/libraries/System.Text.Json/src/System/Text/Json/Serialization/Metadata/JsonTypeInfo.cs#L596-L608

This method was not designed with thread-safety in mind, however we do expose JsonTypeInfo<T> instances as singletons in the source generator. This means that any thread could attempt to use (and configure) arbitrary nodes in the same global type graph instance. I fixed a number of races in .NET 7 by putting locks on the configure methods, however the deadlock does not currently manifest because child nodes are resolved and configured lazily.

This changes with a new feature I've been working on, which requires eager traversal of the full type graph at configuration time. I think it should be possible to remove the need for locking altogether, but that would require comprehensive refactoring of how the Configure method works. For the moment I'm looking for more targeted changes (using one lock for the entire type graph works).