Closed eiriktsarpalis closed 1 year ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.
Author: | eiriktsarpalis |
---|---|
Assignees: | - |
Labels: | `area-System.Threading` |
Milestone: | - |
This is a classic lock inversion in the repro. Thread 1 locks object A and then tries to acquire the lock for object B, and Thread 2 locks object B and then tries to acquire the lock for A. If code can end up taking multiple locks, you need to ensure the locks are always taken in the same order, e.g. lock leveling, lock hierarchy, etc., or you need to be prepared to deal with the inability to acquire a lock (e.g. TryEnter instead of Enter).
Note that the Parallel Stacks window in VS can highlight the culprits in such deadlocks:
Thanks, that makes perfect sense. What's not yet clear to me is why preceding this with a sequential traversal prevents the deadlock from occurring. Maybe it's just dumb luck that it hasn't happened to me yet, but could the first operation somehow impact the interleavings of the second one?
Note that the Parallel Stacks window in VS can highlight the culprits in such deadlocks:
TIL, that's pretty awesome 🤯
could the first operation somehow impact the interleavings of the second one?
Yes. For example the first iteration will incur overheads of JIT'ing on each new method invoked, changing timing.
If code can end up taking multiple locks, you need to ensure the locks are always taken in the same order, e.g. lock leveling, lock hierarchy, etc.
I'm struggling to come up with a lock ordering scheme when arbitrary graphs are involved. I should clarify that traversal needs to be depth-first so you couldn't sort the visited nodes based on some order. Seems taking a global lock on the graph is the only feasible option.
Why do you need to hold the node's lock while processing all of the nodes reachable from it?
What work exactly are you trying to parallelize?
And if the traversal must be purely depth-first, why is it acceptable to effectively jump to a random node to continue processing there (which is what happens when you Parallel.ForEach and have multiple threads processing different parts of the graph in parallel)?
Long story short, it concerns STJ's JsonTypeInfo<T>
"configuration" routine:
This method was not designed with thread-safety in mind, however we do expose JsonTypeInfo<T>
instances as singletons in the source generator. This means that any thread could attempt to use (and configure) arbitrary nodes in the same global type graph instance. I fixed a number of races in .NET 7 by putting locks on the configure methods, however the deadlock does not currently manifest because child nodes are resolved and configured lazily.
This changes with a new feature I've been working on, which requires eager traversal of the full type graph at configuration time. I think it should be possible to remove the need for locking altogether, but that would require comprehensive refactoring of how the Configure method works. For the moment I'm looking for more targeted changes (using one lock for the entire type graph works).
Description
I might be missing something obvious, but I encountered an interesting issue when attempting to implement synchronized graph traversal using locks.
Reproduction Steps
On my machine, the following console app deadlocks ~1 out of 4 times:
Expected behavior
Should eventually complete traversal of the entire graph.
Actual behavior
Application deadlocks. Increasing the graph size increases the probability of this happening.
Regression?
No response
Known Workarounds
appears to fix the issue.
Configuration
Running .NET 7/8 on Windows x64.
Other information
No response