dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.2k stars 4.72k forks source link

[mono][loader] Calling assembly resolve while holding the loader lock can lead to deadlocks #51864

Open lambdageek opened 3 years ago

lambdageek commented 3 years ago

The original issue is https://github.com/mono/mono/issues/21032 Adding a copy here to make sure we do this work for .NET 6, too. The assembly loading algorithm is different, but the same situation is possible


Sample stack trace from VS Mac https://gist.github.com/Therzok/df71c056b0dabc8a2f3ffb6f1c264488

Thread 8 calls mono_class_create_from_typedef which then needs to get the parent type and calls mono_class_get_checked... etc... until we need to load an assembly, and when we can't find it, we call the managed assembly resolver event from mono_domain_assembly_postload_search.

The managed code tries to lock some managed data structure, but the monitor is already locked, so it waits.

Meanwhile other threads that want to do any type creation are deadlocked because they can't take the loader lock.

ghost commented 3 years ago

Tagging subscribers to this area: @CoffeeFlux See info in area-owners.md if you want to be subscribed.

Issue Details
The original issue is https://github.com/mono/mono/issues/21032 Adding a copy here to make sure we do this work for .NET 6, too. The assembly loading algorithm is different, but the same situation is possible --- Sample stack trace from VS Mac https://gist.github.com/Therzok/df71c056b0dabc8a2f3ffb6f1c264488 Thread 8 calls `mono_class_create_from_typedef` which then needs to get the parent type and calls `mono_class_get_checked`... etc... until we need to load an assembly, and when we can't find it, we call the managed assembly resolver event from `mono_domain_assembly_postload_search`. The managed code tries to lock some managed data structure, but the monitor is already locked, so it waits. Meanwhile other threads that want to do any type creation are deadlocked because they can't take the loader lock.
Author: lambdageek
Assignees: lambdageek
Labels: `area-VM-meta-mono`
Milestone: 6.0.0
lambdageek commented 3 years ago

Repro https://github.com/lambdageek/repro-gh-21032

(to run with mono on dotnet/runtime, need to cd X && dotnet publish --self-contained -r osx-x64 and then replace libcoreclr and System.Private.CoreLib by ones from a build of mono from dotnet/runtime)

lambdageek commented 3 years ago

This is hard https://github.com/mono/mono/issues/21032#issuecomment-827274328

It's essentially equivalent to having class loader levels - in particular being able to talk, outside of the loader lock, about a class without its parent being set.

lambdageek commented 3 years ago

One other, more targeted approach that is worth considering is to have some way for class creating and class initialization to return a "I need to load an assembly, please load it and retry" value. That means that instead of always just calling the loader and failing the class (setting MonoClass:has_failure) if it doesn't succeed, we have a flag on a class that says "I depend on something", and checking for that in many places in class loading.

lambdageek commented 2 years ago

Another stack trace with the same issue https://github.com/dotnet/runtime/issues/62419 - sporadic hangs in the System.Text.RegularExpressions.Tests testsuite in Release mode with interpreter.

lambdageek commented 1 year ago

Another idea (particularly for mono_class_create_from_typedef) is to pre-trigger assembly loading. We know we will need to look at the parent type and the interfaces of the class we want to instantiate. So we can go look at the corresponding metadata and figure out if we need to access other assemblies and trigger the assembly loading before we take the loader lock in mono_class_create_from_typedef before we create the initial MonoClass* object.

I'm not 100% sure if this is workable for generics (the parent might be a generic instance which mentions the type itself) but it's not too different, in principle from how we form the MonoImageSet* in order to properly parent a generic instance.

I'm also not sure if there's a way to do this efficiently without slowing down app loading.


Another option (particularly for #93686) is to postpone firing managed post-load callbacks until we are out of the loader lock. (Although this also might not be workable: it's a recursive lock, so we're not really sure how long we're postponing)

lambdageek commented 1 year ago

Another idea (particularly for mono_class_create_from_typedef) is to pre-trigger assembly loading. We know we will need to look at the parent type and the interfaces of the class we want to instantiate. So we can go look at the corresponding metadata and figure out if we need to access other assemblies and trigger the assembly loading before we take the loader lock in mono_class_create_from_typedef before we create the initial MonoClass* object.

So I started prototyping this, and I think I see the place where I will run into problems. The issue is that when we're parsing a generic instance type from metadata, we're already creating MonoTypes for all the type arguments (and the MonoType for a reference type already contains a MonoClass* - which means that the if the preloader wants to parse a generic inst it will call right back into class loading). (And that might be fine - except that we need to avoid cycles, so we need to pass the preloader's "visited typedefs" set through the metadata parser - and I was hoping to avoid adding a thread-local set).

So one way to work around this is to generalize the metadata parser to do some kind of callback/visitor mechanism (or just copypasta the metadata parser).

I'm going to try to prototype this a little bit more just to see if I can get something working, but I'm not really happy with this design.


The right solution still feels like having some kind of a "this MonoClassDef has been barebones reserved, but all we know is its parent MonoImage and type definition token" class loader level. That way we can use the normal parser (which would put classes into this state initially and then we would take the global lock when we go to resolve the parents at which point we wouldn't need to call back into managed at all)