Open uweigand opened 11 months ago
It seems to me the root cause of the deadlock is that
mono_class_create_from_typedef
holdsmono_loader_lock
across function calls that may trigger invoking managed code (the assembly load hooks) which might do anything - this doesn't seem a good idea.
Good catch!
Related to / duplicate of https://github.com/dotnet/runtime/issues/51864
This is hard
@uweigand this will be quite tricky to fix and I think even if we put something together, it is highly unlikely that we will backport the fix in upstream net8.0 as it's bound to be risky (any time we make big changes to class-init.c
we uncover issues with the change).
I would encourage, to whatever extent possible, to fix this in user code by rearranging the calling code to trigger assembly loading early. (I realize in this case I'm proposing that you change MSBuild... sorry.)
I did reproduce the hang issue on dotnet 8 on different Redhat beaker machines with different RHEL distros.
I was able to reproduce for around 10 or 12 hang situations and in each of the case, when I attached gdb and looked at the backtrace information, I see the same backtrace as attached in the file..
I did work with Ulrich and local team in IBM, and Ulrich helped us to figure out that its the exact same issue as this one.
I have attached the logfile with all the details of all the process and threads backtrace.. dotnet_8_hang_issue_gdb_all_proc_info.txt
@Vishwanatha-HD has now found an instance of this deadlock in the .NET 6 source-build. Unfortunately, while one of the involved locks is also the mono_loader_lock
, the other involved lock is a different one; not a Microsoft.Build.BackEnd.Logging.LoggingService._lockObject
, but rather an _lock
object inside the implementation of ArrayPool<byte>.Shared
in the runtime.
Again, we have a deadlock between two of msbuild's threads, this time a dedicated event logging thread and some thread pool thread.
The event logging thread
Microsoft.Build.BackEnd.Logging.LoggingService:RouteBuildEvent
System.IO.BinaryWriter:Write (string)
callArrayPool<byte>.Shared.Rent
and returned via ArrayPool<byte>.Shared.Return
Shared.Return
stores the returned buffer for re-use in a per-thread pool, which on first invocation is initialized by InitializeTlsBucketsAndTrimming
ConditionalWeakTable::Add
on its _allTlsBuckets
variableAdd
TAKES an internal _lock
objectmono_jit_compile_method
, which TAKES the mono_loader_lock
, where we deadlockThe thread pool thread
Microsoft.Build.BackEnd.AssemblyTaskFactory:InitializeFactory
Microsoft.Build.Shared.TypeLoader:Load
Microsoft.Build.Shared.TypeLoader/AssemblyInfoToLoadedTypes:ScanAssemblyForPublicTypes
Microsoft.Build.Shared.MSBuildLoadContext
and then inspects it via System.Reflection.RuntimeAssembly:GetExportedTypes
mono_class_create_from_typedef
mono_loader_lock
mono_class_get_checked
, which ends up calling mono_alc_invoke_resolve_using_load
Microsoft.Build.Shared.MSBuildLoadContext:Load
installed aboveSystem.IO.File:Exists
Interop.Sys.LStat
ValueUtf8Converter
to perform string marshallingArray<byte>.Shared
buffer, and the dispose operation calls Return
on itShared.Return
calls InitializeTlsBucketsAndTrimming
_allTlsBuckets
via ConditionalWeakTable::Add
and deadlocksHi All, I have attached a txt file with the complete backtrace information of all the processes and threads involved in reproducing the .NET6 source build hang issue. dotnet-6-source-build-hang-process-thread-bt-info.txt
@uweigand overall how frequent are these deadlocks?
I tried to prototype one idea for a fix (essentially avoid loader callbacks into managed code while holding the global loader lock by pre-loading any assemblies mentioned by a a MonoClass before taking the lock), and it was very invasive, was likely to result in overall performance degradation and also never quite converged on a correct solution (I could still see the loader lock held when loading assemblies - and I ran out of time on the experiment to fix them all).
I'm going to think about some alternate approaches during .NET 9, but I'm not certain there is any reasonable approach here.
mono_jit_compile_method
specifically might be somewhat fixable:
// FIXME Locking here is somewhat historical due to mono_register_jit_icall_wrapper taking loader lock.
// atomic_compare_exchange should suffice.
mono_loader_lock ();
mono_jit_lock ();
if (!callinfo->wrapper) {
callinfo->wrapper = p;
}
mono_jit_unlock ();
mono_loader_unlock ();
This will likely just shift around the specific circumstances of the deadlock, however. It wont' fix the fundamental problem.
I've now done a bit of code-flow analysis to try and figure out the places where Mono invokes a C# routine while holding the loader lock. The following list contains functions that take the loader lock, and lists all calls to subroutines while holding the lock that potentially invoke a C# routine on some path. This may not be 100% accurate (there may be false positives as in many cases, the invocation of the C# routine is conditional, and it might happen that at this call site the condition is never true; and there may be false negatives in case I missed part of the CFG due to indirect calls or macro magic), but it does show that this problem might be more widespread that I originally thought.
I've done the same analysis on the code after applying the "ready levels" patch https://github.com/dotnet/runtime/pull/95927, but unfortunately this doesn't look significantly different - even when accounting for the fact that some of the recursive class loader calls under the lock no longer load the parent classes, most of these functions still have other code paths where C# is invoked.
mono/metadata/class-init.c:mono_class_setup_fields
--> mono_class_layout_fields
mono/metadata/class-init.c:mono_class_create_from_typedef[_at_level]
--> mono_metadata_load_generic_params (LEVEL: no longer under loader lock, but now under image lock)
--> mono_metadata_lookup_generic_class (LEVEL: no longer under loader lock, but now under image lock)
--> mono_class_get_checked
--> mono_class_inflate_generic_class_checked
--> mono_class_setup_mono_type
--> fix_gclass_incomplete_instantiation
--> mono_class_create_from_typedef
--> mono_metadata_interfaces_from_typedef_full
--> mono_class_find_enum_basetype
--> mono_class_from_mono_type_internal
--> class_has_isbyreflike_attribute
--> class_has_inlinearray_attribute
mono/metadata/class-init.c:mono_class_create_generic_inst[_at_level]
--> check_valid_generic_inst_arguments (LEVEL: additionally under lock)
--> mono_type_get_full_name (LEVEL: additionally under lock)
--> mono_class_get_nullable_param_internal
--> mono_generic_class_setup_parent
--> class_loaded
mono/metadata/class-init.c:mono_class_create_ptr_at_level:
--> class_loaded (LEVEL: additionally under lock)
mono/metadata/class-init.c:mono_class_layout_fields
--> mono_class_full_name
mono/metadata/class-init.c:mono_class_setup_interface_id
--> mono_class_setup_interface_id_nolock --> mono_get_unique_iid
mono/metadata/class-setup-vtable.c:mono_class_setup_interface_offsets_internal
--> mono_class_setup_interface_id_nolock
--> mono_class_get_implemented_interfaces
--> mono_type_get_full_name
--> mono_class_setup_count_virtual_methods
mono/metadata/object.c:mono_vtable_build_imt_slot
--> build_imt_slot
mono/metadata/object.c:mono_method_add_generic_virtual_invocation
--> mini_get_imt_trampoline (via callbacks.get_imt_trampoline)
--> mini_get_vtable_trampoline (via callbacks.get_vtable_trampoline)
--> mono_arch_build_imt_trampoline (via mono_get_runtime_callbacks ()->build_imt_trampoline)
mono/metadata/object.c:allocate_collectible_static_fields
--> mono_class_get_fields_internal
--> allocate_loader_alloc_slot
--> mono_class_from_mono_type_internal
--> mono_object_new_pinned
mono/metadata/object.c:mono_class_create_runtime_vtable
--> mono_class_init_internal
--> mono_class_setup_vtable
--> mono_class_check_vtable_constraints
--> mono_class_has_finalizer
--> mono_class_data_size
--> alloc_vtable
--> mono_mem_manager_get_loader_alloc
--> mono_class_compute_gc_descriptor
--> allocate_collectible_static_fields
--> compute_class_bitmap
--> mono_class_get_fields_internal
--> field_is_special_static
--> mono_class_from_mono_type_internal
--> mono_type_size
--> mono_field_get_data
--> mono_class_value_size
--> mono_type_get_object_handle
mono/metadata/sre.c:reflection_setup_class_hierarchy
--> mono_class_from_mono_type_internal
--> mono_class_setup_mono_type
mono/metadata/sre.c:reflection_setup_internal_class_internal
--> reflection_init_generic_class
--> reflection_setup_internal_class
--> mono_reflection_type_handle_mono_type
--> mono_class_setup_vtable_general
--> mono_class_setup_mono_type
--> mono_class_from_mono_type_internal
mono/metadata/sre.c:reflection_methodbuilder_to_mono_method
--> mono_reflection_type_get_handle
--> mono_metadata_type_dup
--> mono_get_shared_generic_inst
--> mono_marshal_spec_from_builder
mono/metadata/sre.c:ctorbuilder_to_mono_method
--> mono_reflection_methodbuilder_from_ctor_builder
--> ctor_builder_to_signature_raw
mono/metadata/sre.c:methodbuilder_to_mono_method
--> method_builder_to_signature
mono/metadata/sre.c:ves_icall_TypeBuilder_create_runtime_class
--> mono_class_setup_mono_type
--> ensure_runtime_vtable
--> reflection_setup_internal_class
--> mono_class_from_mono_type_internal
--> typebuilder_setup_fields
--> remove_instantiations_of_and_ensure_contents
mono/metadata/sre.c:reflection_create_dynamic_method
--> mono_gc_reference_queue_add_internal
mono/metadata/reflection.c:mono_type_get_object_checked /*FIXME mono_class_init_internal and mono_class_vtable acquire it*/
--> mono_type_normalize
--> mono_type_get_full_name
--> mono_weak_hash_table_insert
--> mono_mem_manager_get_loader_alloc
--> mono_object_new_pinned
--> mono_type_get_object_checked (!!! recursive call under lock !!!)
--> mono_mem_manager_init_reflection_hashes (recursive!)
--> mono_class_from_mono_type_internal (recursive!)
mono/metadata/reflection.c:mono_reflection_bind_generic_parameters
--> mono_reflection_type_handle_mono_type
--> mono_class_from_mono_type_internal
mono/metadata/metadata.c:mono_metadata_get_canonical_generic_inst // Hashtable key equal func can take loader lock
--> mono_metadata_generic_inst_equal
--> mono_metadata_type_dup
mono/metadata/metadata.c:mono_metadata_get_canonical_aggregate_modifiers // Hashtable key equal func can take loader lock
--> aggregate_modifiers_equal
--> mono_metadata_type_dup
mono/metadata/metadata.c:mono_metadata_lookup_generic_class // Hashtable key equal func can take loader lock
--> mono_generic_class_equal (via mono_conc_hashtable_insert)
mono/mini/mini.c:mono_jit_compile_method_inner
--> mono_method_is_generic_sharable
--> mono_method_desc_full_match
--> mono_method_get_full_name
--> mono_emit_jit_map
mono/mini/mini-generic-sharing.c:mono_class_get_runtime_generic_context_template
--> mono_class_get_runtime_generic_context_template (!!! recursive call under lock !!!)
--> class_get_rgctx_template_oti
mono/mini/mini-llvm.c:mono_llvm_emit_method
--> mono_aot_get_mangled_method_name
--> mono_aot_get_method_name
--> mono_method_full_name
--> emit_method_inner
--> get_callee
mono/mini/aot-runtime.c:init_amodule_got
--> decode_patches
--> mini_llvm_init_method
--> mono_resolve_patch_target
mono/mini/aot-runtime.c:load_container_amodule
--> load_aot_module
mono/component/hot_reload.c:hot_reload_get_or_add_ginst_update_info
--> recompute_ginst_update_info
mono/component/debugger-engine.c:mono_de_add_pending_breakpoints
--> mono_jit_search_all_backends_for_jit_info
--> insert_breakpoint
mono/component/debugger-engine.c:mono_de_set_breakpoint
--> set_bp_in_method
--> mono_jit_search_all_backends_for_jit_info
--> mono_debugger_log_add_bp
mono/component/debugger-engine.c:mono_de_clear_breakpoint
--> mono_debugger_log_remove_bp
mono/component/debugger-engine.c:mono_de_clear_breakpoints_for_domain
--> remove_breakpoint
mono/component/debugger-engine.c:mono_de_process_breakpoint
--> mono_find_prev_seq_point_for_native_offset
--> no_seq_points_found
--> mono_debugger_log_bp_hit
--> mono_de_frame_async_id
--> mono_de_ss_update
--> mono_de_ss_start
mono/component/debugger-engine.c:mono_de_ss_start
--> mono_find_prev_seq_point_for_native_offset
--> mono_de_start_single_stepping
mono/component/debugger-agent.c:suspend_vm
--> mono_de_start_single_stepping
--> notify_thread
mono/component/debugger-agent.c:process_frame
--> calc_il_offset
--> mono_method_full_name
--> mono_marshal_method_from_wrapper
mono/component/debugger-agent.c:send_types_for_domain
--> emit_type_load
mono/component/debugger-agent.c:event_requests_cleanup
--> mono_de_clear_breakpoint
mono/component/debugger-agent.c:clear_event_request
--> mono_de_clear_breakpoint
--> mono_de_cancel_ss
mono/component/debugger-agent.c:clear_event_requests_for_assembly
--> clear_event_request
--> ss_clear_for_assembly
mono/component/debugger-agent.c:clear_types_for_assembly
--> type_comes_from_assembly
mono/component/debugger-agent.c:dispose_vm
--> clear_event_request
mono/component/debugger-agent.c:vm_commands
--> clear_event_request
--> get_types_for_source_file
--> get_types
mono/component/debugger-agent.c:event_commands
--> emit_appdomain_load
--> send_assemblies_for_domain
--> emit_thread_start
--> send_types_for_domain
--> clear_event_request
--> mono_de_clear_breakpoint
As an aside, it seems unclear what exactly the loader lock is protecting - many of these uses of the lock should probably use some other serialization mechanism (if any is needed at all) ...
We've seen a deadlock in msbuild when performing a source-build of rc2 natively on s390x (using the Mono runtime). In a nutshell, the deadlock occurs due to a lock-order inversion between the Mono JIT's
mono_loader_lock
and the managedMicrosoft.Build.BackEnd.Logging.LoggingService._lockObject
.Specifically, we have two involved threads, the main thread and a thread pool thread.
The main thread runs
NuGet.Build.Tasks.Console.Program:Main
which callsNuGet.Build.Tasks.Console.MSBuildStaticGraphRestore:LoadProjects
which callsMicrosoft.Build.Evaluation.ProjectCollection:.ctor
which callsMicrosoft.Build.BackEnd.Logging.LoggingService:RegisterLogger
which TAKESMicrosoft.Build.BackEnd.Logging.LoggingService._lockObject
and callsMicrosoft.Build.Evaluation.ProjectCollection/ReusableLogger:RegisterForEvents
Microsoft.Build.BackEnd.Logging.EventSourceSink:add_BuildStarted
gets invokedmono_class_create_from_typedef
which TAKESmono_loader_lock
The thread pool thread runs
NuGet.Build.Tasks.Console.ConsoleLoggingQueue:Process
which callsNuGet.Build.Tasks.ConsoleOutLogMessage:ToJson
which callsSystem.Collections.Concurrent.ConcurrentDictionary'2<TKey_REF, TValue_REF>:GetOrAdd
Newtonsoft.Json.Serialization.DefaultContractResolver:CreateContract
gets invokedmono_class_create_from_typedef
which TAKESmono_loader_lock
mono_metadata_interfaces_from_typedef_full
, which loads for the first timeSystem.Linq.Expressions.dll
Microsoft.Build.BackEnd.Components.RequestBuilder.AssemblyLoadsTracker:CurrentDomainOnAssemblyLoad
which callsMicrosoft.Build.BackEnd.Logging.LoggingService:ProcessLoggingEvent
which TAKESMicrosoft.Build.BackEnd.Logging.LoggingService._lockObject
It seems to me the root cause of the deadlock is that
mono_class_create_from_typedef
holdsmono_loader_lock
across function calls that may trigger invoking managed code (the assembly load hooks) which might do anything - this doesn't seem a good idea.CC - @directhex @lambdageek @vargaz @akoeplinger FYI - @giritrivedi @alhad-deshpande @janani66 @omajid @tmds