dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.26k stars 4.73k forks source link

Rare failures can occur when SegmentResortChains races with coreclr!Thread construction #107467

Closed ChrisAhna closed 1 month ago

ChrisAhna commented 1 month ago

Description

Over the years, failures reported by a few different high-scale services have revealed that all versions of CLR/CoreCLR contain a latent race condition which can cause failures (e.g., infinite loops) if a call to SegmentResortChains happens to "line up perfectly" with both a) a racing unmanaged thread entering the runtime for the first time (and thus running the Thread ctor) and b) specific handle table state conditions which cause the handle allocations done in the Thread ctor to "contend perfectly" with the data structures being processed by SegmentResortChains.

See the "Reproduction Steps" and "Other information" sections below for additional information.

It is likely that this problem has existed in .NET "forever". That said, increasing cloud scale is (slowly) leading to an increase in the real-world failure rate (e.g., in a recent case, a service was seeing this failure several times per week). It would be terrific to eliminate this problem once and for all before it grows to any increased prevalence.

Reproduction Steps

To date, all customer-reported failures have specifically been infinite loops in SegmentAllocHandlesFromTypeChain which occurred during a preemptive-mode HndCreateHandle call generated by the Thread ctor (specifically during the preemptive-mode execution of this ctor which occurs when an unmanaged thread enters the runtime for the first time).

Having Thread ctor execution "line up perfectly" with SegmentResortChains in the required manner is something that is "hard to hit", especially given that in addition to the timing being correct, the handle table state needs to meet a number of additional conditions at the time. This is something that can happen occasionally at global scale, but (despite many days of effort) I haven't found a way to make this race condition happen in bounded time on a repro machine.

That said, I have been able to reproduce the race condition (and the specific customer-reported infinite loop) by having an unmanaged thread make a high rate of direct preemptive-mode calls to HndCreateHandle/HndDestroyHandle while another thread triggers repeated GCs.

The linked informational repro app sources contain a number of additional details on the specific approach that is being used (including the techniques used to ensure that the handle table state meets all of the required conditions).

While this repro approach is fragile and awkward, it has made it possible to confirm that the customer-reported infinite loop is reproducible on Net6, Net8, and current dotnet/runtime (with both Checked and Release builds of the runtime). In all experiments so far (across all of these product versions), the repro approach reliably triggers an infinite loop within the first several minutes of execution, with this infinite loop always occurring in either coreclr!SegmentAllocHandlesFromTypeChain or coreclr!SegmentRemoveFreeBlocks.

Expected behavior

The runtime remains stable regardless of how Gen1-or-larger foreground GCs happen to interleave with preemptive-mode activity and general handle table activity elsewhere in the process.

Actual behavior

The runtime has always made some number of HndCreateHandle/HndDestroyHandle calls in preemptive-mode (including at least the null handle table slot allocations done in the coreclr!Thread ctor), and these preemptive-mode operations can currently malfunction (e.g., get stuck in infinite loops) if they happen to "line up perfectly" with a coreclr!SegmentResortChains call made during a Gen1-or-larger foreground GC happening on a different thread.

In all customer reports to date, the visible malfunction was specifically an infinite loop in SegmentAllocHandlesFromTypeChain.

In experiments with the repro system discussed above (which has one thread make preemptive-mode HndCreateHandle/HndDestroyHandle calls one-after-another while other threads trigger Gen1-or-larger foreground GCs), the visible malfunctions observed thus far have always been infinite loops in either coreclr!SegmentAllocHandlesFromTypeChain or coreclr!SegmentRemoveFreeBlocks. (Though it is definitely possible that the fundamental race condition can cause additional bad outcomes in addition to the "noisy" infinite loops that have been concretely observed thus far.)

Regression?

Not a regression. (This problem exists in .NET Framework and all versions of .NET Core.)

Known Workarounds

There don't seem to be any practical workarounds.

(Reducing the number of times the coreclr!Thread ctor runs throughout the life of the process will reduce the chance of triggering the problem. But making this happen, if possible at all, commonly requires some "scary" change where a) you start with a system that currently does work by starting and ending a bunch of dedicated from-scratch threads over time and then b) you convert this system to do the same work by reusing members of a longer-lived pool of threads. So "helping the chances" in this way may be practical in some cases, but isn't practical in general.)

Configuration

Customers have reported the problem in Windows x64 .NET Framework configurations.

I have reproduced the problem in Windows x64 configurations across all versions of .NET (i.e., .NET Framework, .NET 6, .NET 8, and current dotnet/runtime).

I believe that the problem exists across all CoreCLR configurations (i.e., the two sides of the fundamental race condition both appear to be in code that is common across all possible CoreCLR builds).

Other information

There are two sides to the fundamental latent race condition.

First, across all versions of CoreCLR, the coreclr!Thread ctor allocates two handle table slots:

m_ExposedObject = CreateGlobalShortWeakHandle(NULL);
m_StrongHndToExposedObject = CreateGlobalStrongHandle(NULL);

Since the Thread is not yet created at this point (e.g., is not in the ThreadStore), the thread is by-definition effectively still in preemptive-mode. Consistent with this, as long as the allocated handle table slot is being initially set to null (as opposed to being bound to a non-null object reference), handle table slot allocations (i.e., HndCreateHandle operations) are allowed to occur in preemptive-mode.

Second, across all versions of CoreCLR, calls to SegmentResortChains always happen during GC and always occur along the following stack:

coreclr!SegmentResortChains
coreclr!StandardSegmentIterator
...

During foreground GCs, the handle table lock is not held at the time of this call (consistent with the broader model where foreground GC handle table scanning acquires the lock only when needed, whereas background GC handle table scanning holds the lock at all times except when it is temporarily released in xxxTableScanQueuedBlocksAsync).

If one of the allocations in the coreclr!Thread ctor cannot be satisfied by the handle table cache, it will acquire the handle table lock and call through to functions like coreclr!SegmentAllocHandlesFromTypeChain.

Since the coreclr!Thread ctor is running in preemptive-mode, a foreground GC on another thread can call coreclr!SegmentResortChains at the same time. Since this path does not acquire the handle table lock, it can lead to a) the GC thread running coreclr!SegmentResortChains while b) another thread (which holds the handle table lock) is concurrently processing the same data structures (e.g., in coreclr!SegmentAllocHandlesFromTypeChain).

SegmentResortChains is not hardened against the possibility of racing modifications to the handle table. In other words, it takes a number of actions that are really only safe if the handle table lock is held. (In the linked informational repro app sources, see the #CORE_FAILURE_SEQUENCE mark in the NativeDll.cpp file for details on how a race between SegmentResortChains and SegmentAllocHandlesFromTypeChain can put SegmentAllocHandlesFromTypeChain into the kind of infinite loop that customers have reported.)

Fixing this latent problem requires fixing the race condition, and therefore requires modifying at least one side of the race condition to prevent it from participating in this kind of failure (e.g., by adding synchronization to that side, by eliminating that side entirely, etc).

dotnet-policy-service[bot] commented 1 month ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.