dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.26k stars 4.73k forks source link

Compiler crashes when failing to release Mutex #53420

Open ViktorHofer opened 3 years ago

ViktorHofer commented 3 years ago

Error Message

Fill the error message using step by step known issues guidance.

{
  "ErrorMessage": "ReleaseMutex failed. WaitOne Id:",
  "BuildRetry": true,
  "ErrorPattern": "",
  "ExcludeConsoleLog": true
}

Example error:

/__w/1/s/.packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error MSB3883: Unexpected exception:  [/__w/1/s/src/tasks/installer.tasks/installer.tasks.csproj]
/__w/1/s/.packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error : One or more errors occurred. (ReleaseMutex failed. WaitOne Id: 14 Release Id: 14) [/__w/1/s/src/tasks/installer.tasks/installer.tasks.csproj]

Build: https://dev.azure.com/dnceng/public/_build/results?buildId=1161146&view=logs&j=1a1e7658-955a-5bfe-8410-74fa4a1a1fbf&t=9528cbd6-a8dd-5006-9c48-38938c12a3af

Configuration: CoreCLR Product Build Linux arm64 release

cc @jkotas @jaredpar @stephentoub

Known issue validation

Build: :mag_right: Result validation: :warning: Provided build not found. Provide a valid build in the "Build: :mag_right:" line. Validation performed at: 9/15/2023 5:21:46 PM UTC

Report

Build Definition Step Name Console log Pull Request
851653 dotnet/runtime Build product Log
838678 dotnet/runtime Restore and Build Product Log dotnet/runtime#108761
829620 dotnet/runtime Build product Log dotnet/runtime#108614

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
0 0 3
dotnet-issue-labeler[bot] commented 3 years ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

jkotas commented 3 years ago

Is the exception that it failed with written in some log?

ghost commented 3 years ago

Tagging subscribers to this area: @cston See info in area-owners.md if you want to be subscribed.

Issue Details
``` /__w/1/s/.packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error MSB3883: Unexpected exception: [/__w/1/s/src/tasks/installer.tasks/installer.tasks.csproj] /__w/1/s/.packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error : One or more errors occurred. (ReleaseMutex failed. WaitOne Id: 14 Release Id: 14) [/__w/1/s/src/tasks/installer.tasks/installer.tasks.csproj] ``` Build: https://dev.azure.com/dnceng/public/_build/results?buildId=1161146&view=logs&j=1a1e7658-955a-5bfe-8410-74fa4a1a1fbf&t=9528cbd6-a8dd-5006-9c48-38938c12a3af Configuration: `CoreCLR Product Build Linux arm64 release` cc @jkotas @jaredpar @stephentoub
Author: ViktorHofer
Assignees: -
Labels: `area-Microsoft.CSharp`, `bug`, `untriaged`
Milestone: 6.0.0
ericstj commented 3 years ago

Doesn't appear to be, this is all we have in the binlog https://artprodcus3.artifacts.visualstudio.com/Ab55de4ed-4b5a-4215-a8e4-0a0a5f71e7d8/9ee6d478-d288-47f7-aacc-f6e6d082ae6d/_apis/artifact/cGlwZWxpbmVhcnRpZmFjdDovL2RuY2VuZy9wcm9qZWN0SWQvOWVlNmQ0NzgtZDI4OC00N2Y3LWFhY2MtZjZlNmQwODJhZTZkL2J1aWxkSWQvMTE2MTE0Ni9hcnRpZmFjdE5hbWUvQnVpbGRMb2dzX0NvcmVDTFJfTGludXhfYXJtNjRfcmVsZWFzZQ2/content?format=file&subPath=%2FRelease%2FBuild.binlog

Essentially the same two lines Viktor shared.

hoyosjs commented 3 years ago

So far I can only find that one instance: https://runfo.azurewebsites.net/search/timelines/?q=started%3A%7E7+definition%3Aruntime+message%3A%22%5EReleaseMutex+failed.%22

jaredpar commented 3 years ago

That crash is coming from this line

https://sourceroslyn.io/#Microsoft.Build.Tasks.CodeAnalysis/BuildServerConnection.cs,239

When communicating between MSBuild nodes and the Compiler server we use a Mutex as a part of the initial handshake. The code here is crashing because when we release the Mutex via Dispose we're getting an ApplicationException.

Our code considers this fatal because the docs say that ApplicationException will only be thrown when the thread calling Dispose does not own the Mutex. Basically coding error on our part that is significant. That is also why we log the thread IDs in the message here. You'll notice though in this case that the thread IDs are the same though which means the same thread both opened and released the Mutex which should be safe (based on our understanding).

What other circumstances can cause an ApplicationException to happen here?

@RikkiGibson, @chsienki, @cston

cston commented 3 years ago

in this case that the thread IDs are the same though which means the same thread both opened and released the Mutex which should be safe

Could the Mutex have been disposed on another thread?

jaredpar commented 3 years ago

In our case unlikely. This Mutex instance is created and managed on a per-thread basis. It's a local that isn't shared amongst threads and the message confirms that there isn't any tricky await code involved that causes us to inadvertently switch threads.

The one issue to keep in mind is that we have an abstraction over Mutex here to deal with the difference implementation between the server on the Mono and .NET runtime. Hence the Dispose call is really a wrapper over:

        public void Dispose()
        {
            if (IsDisposed)
                return;
            IsDisposed = true;

            try
            {
                if (IsLocked)
                    Mutex.ReleaseMutex();
            }
            finally
            {
                Mutex.Dispose();
                IsLocked = false;
            }
        }

One possibility is we are incorrectly setting IsLocked which would lead to this exception being thrown.

jkotas commented 3 years ago

Failed again in #53590 . I think it is most likely Linux arm64 specific bug in CoreCLR PAL.

jkotas commented 3 years ago

Current runfo query shows a hit on x64 too, so it is not arm64 specific.

danmoseley commented 3 years ago

Again, but in Mono Android Arm64 release build, using 3.1.

2021-06-25T16:42:48.0342786Z  /__w/1/s/build.sh -ci -arch x64 -os Android -s mono+libs+host+packs+mono.mscordbi -c Release /p:OfficialBuildId=20210625.7 /p:CrossBuild=false  
2021-06-25T16:42:48.0343494Z ========================== Starting Command Output ===========================
2021-06-25T16:42:48.0364929Z [command]/bin/bash --noprofile --norc /__w/_temp/455af3b1-60d8-49d1-88ab-abbd6629e02b.sh
2021-06-25T16:42:48.0702381Z __DistroRid: android-x64
2021-06-25T16:42:48.2260529Z /__w/1/s/.dotnet/sdk/6.0.100-preview.4.21255.9/MSBuild.dll /nologo -logger:/__w/1/s/.packages/microsoft.dotnet.arcade.sdk/6.0.0-beta.21311.3/tools/netcoreapp3.1/Microsoft.DotNet.ArcadeLogging.dll -maxcpucount /m -verbosity:m /v:minimal /bl:/__w/1/s/artifacts/log/Release/Build.binlog /clp:Summary /nr:false /p:TreatWarningsAsErrors=true /p:ContinuousIntegrationBuild=true /p:Configuration=Release /p:RepoRoot=/__w/1/s/ /p:Restore=true /p:Build=true /p:Rebuild=false /p:Test=false /p:Pack=false /p:IntegrationTest=false /p:PerformanceTest=false /p:Sign=false /p:Publish=false /p:TargetOS=Android /p:Subset=mono+libs+host+packs+mono.mscordbi /p:TargetArchitecture=x64 /p:BuildArchitecture=x64 /p:CMakeArgs= /p:OfficialBuildId=20210625.7 /p:CrossBuild=false /warnaserror /__w/1/s/.packages/microsoft.dotnet.arcade.sdk/6.0.0-beta.21311.3/tools/Build.proj
...
2021-06-25T16:45:35.4955064Z /__w/1/s/.packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error MSB3883: Unexpected exception:  [/__w/1/s/src/tasks/Crossgen2Tasks/Crossgen2Tasks.csproj]
2021-06-25T16:45:35.5037748Z ##[error].packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error MSB3883: (NETCORE_ENGINEERING_TELEMETRY=Build) Unexpected exception: 
2021-06-25T16:45:35.5051214Z /__w/1/s/.packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error : One or more errors occurred. (ReleaseMutex failed. WaitOne Id: 14 Release Id: 14) [/__w/1/s/src/tasks/Crossgen2Tasks/Crossgen2Tasks.csproj]
2021-06-25T16:45:35.5056497Z ##[error].packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error : (NETCORE_ENGINEERING_TELEMETRY=Build) One or more errors occurred. (ReleaseMutex failed. WaitOne Id: 14 Release Id: 14)
2021-06-25T16:45:35.5059398Z /__w/1/s/.packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error MSB3883: Unexpected exception:  [/__w/1/s/src/tasks/AotCompilerTask/MonoAOTCompiler.csproj]
2021-06-25T16:45:35.5063602Z ##[error].packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error MSB3883: (NETCORE_ENGINEERING_TELEMETRY=Build) Unexpected exception: 
2021-06-25T16:45:35.5089855Z /__w/1/s/.packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error : One or more errors occurred. (ReleaseMutex failed. WaitOne Id: 14 Release Id: 14) [/__w/1/s/src/tasks/AotCompilerTask/MonoAOTCompiler.csproj]
2021-06-25T16:45:35.5094289Z ##[error].packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error : (NETCORE_ENGINEERING_TELEMETRY=Build) One or more errors occurred. (ReleaseMutex failed. WaitOne Id: 14 Release Id: 14)
janvorli commented 3 years ago

@jaredpar I would like to look into this issue - is there a reasonable way to repro it locally?

jaredpar commented 3 years ago

I haven't been able to reproduce this locally. It only shows up sporadically in customer / CI reports.

mangod9 commented 3 years ago

I am moving this to 7 since its not actionable at the moment without a repro. We can move it back if there is one available

Madajevas commented 2 years ago

+1. This happens quite often when building multiple containers with docker compose (each using mcr.microsoft.com/dotnet/sdk:5.0.403-alpine3.14-amd64). Buildkit enabled.

mangod9 commented 2 years ago

Hey @Madajevas, do you have any repro steps or a docker file you could share?

Madajevas commented 2 years ago

Unfortunately what I have cannot be shared. Will look which containers fails most often and maybe it will be possible to create some minimal repo without corporate code

mangod9 commented 2 years ago

ok thanks, also would be curious to know if it repros with 6.0 as well?

Madajevas commented 2 years ago

Not using 6.0 right now. Not planning to switch until Q1. Just a wild idea, haven't analyzed code and I'm probably wrong. Could this mutex be shared then docker builds multiple images in parallel and due to some race condition one build disposes mutex and then another one tries to do the same and fails? Not sure if we experienced this on windows but CentOS 7 build server crashes at leas once a day if build happens multiple times. Nothing of the sort if single image is being built. Since this happens in docker build no message of inner (actual) exception is visible.

jaredpar commented 2 years ago

Could this mutex be shared then docker builds multiple images in parallel and due to some race condition one build disposes mutex and then another one tries to do the same and fails?

From the perspective of the Mutex API that should not be a problem. The only issue there is that we make sure that we do the create / dispose from the same thread. It's possible that this is rubbing up against an implementation detail of Mutex though.

mikeymikeymikeymikey commented 2 years ago

This happens to us every once in a while in our CI. .NET SDK 6.0.100

danmoseley commented 2 years ago

@kouvel may have thoughts

am11 commented 2 years ago

Looks like this condition is failing: https://github.com/dotnet/runtime/blob/57bfe474518ab5b7cfe6bf7424a79ce3af9d6657/src/libraries/System.Private.CoreLib/src/System/Threading/WaitSubsystem.WaitableObject.Unix.cs#L713 If current thread was the owning thread, then the only other explanation is waitableobject was already signaled.

kouvel commented 2 years ago

@am11 I don't think that would be it since that implementation does not support named mutexes.

I didn't see anything obvious that would cause a ReleaseMutex to fail with ApplicationException. Another reason for throwing that exception is if the handle does not represent a mutex for some reason, though that would be unlikely.

Maybe it would be useful to add some more info to the exception message, like maybe:

Not sure it would help much but may at least rule something out.

Noticed something else, though it probably wouldn't lead to this issue. This line should probably catch AbandonedMutexException and set IsLocked to true in that case, instead of catching it in the caller, otherwise IsLocked would remain false upon exception and the thread would abandon the mutex again.

am11 commented 2 years ago

to fail with ApplicationException

Do we know where is this coming from? (if not WaitSubsystem.WaitableObject.Unix.cs#714)

kouvel commented 2 years ago

Do we know where is this coming from? (if not WaitSubsystem.WaitableObject.Unix.cs#714)

It's probably from here: https://github.com/dotnet/runtime/blob/dfbae37e91c4744822018dde10cbd414c661c0b8/src/libraries/System.Private.CoreLib/src/System/Threading/Mutex.Windows.cs#L92

After getting an exception here: https://github.com/dotnet/runtime/blob/dfbae37e91c4744822018dde10cbd414c661c0b8/src/coreclr/pal/src/synchobj/mutex.cpp#L457

am11 commented 2 years ago

Hm, original report was for Linux arm64, so I was tracing Unix partials (which calls into the waitsubsystem).

kouvel commented 2 years ago

Mutex.Unix.cs is only used with the WaitSubsystem implementation, on CoreCLR the Unix implementation (confusingly) uses Mutex.Windows.cs, the p/invoke goes into the PAL, which implements Windows-like APIs such as ReleaseMutex.

mangod9 commented 1 year ago

this is an old issue, is this something which needs to be looked at ?

kouvel commented 1 year ago

Closing for now, as there haven't been recent reports and there isn't sufficient info available to diagnose the issue. Issue https://github.com/dotnet/runtime/issues/89090 covers adding more info about system call errors to named mutex exceptions.

ericstj commented 1 year ago

I see that this was just hit on my validation build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=407963&view=logs&j=1131d339-d842-59c0-dcc0-73e7a31acc11&t=9c8d0f40-0c4c-53c5-f33c-9e8e40de8387

Also noticed that MSBuild is tracking this here: https://github.com/dotnet/msbuild/issues/7601

Somfic commented 9 months ago

Hey, this just hit on my ci/cd, (macos) https://github.com/Somfic/vla/actions/runs/7751958359/job/21140666679?pr=74

jaredpar commented 9 months ago

From @Somfic log

ReleaseMutex failed. WaitOne Id: 16 Release Id: 16

Confirms that the calls are indeed happening on the same thread id.

AaronRobinsonMSFT commented 7 months ago

I'm seeing this while building on my macOS M2.

kouvel commented 7 months ago

I'm seeing this while building on my macOS M2.

There should be better diagnostic info in .NET 9 including info from system call errors (https://github.com/dotnet/runtime/pull/92603). Can you share more details about the failure?

AaronRobinsonMSFT commented 7 months ago

@kouvel Not really. The error message was basically the same as this issue. Let me see if I can reproduce it.

kouvel commented 7 months ago

Ah right I forgot the extra info is not tracked for ReleaseMutex. I'll also try to repro.

mangod9 commented 2 months ago

Assume this is a long standing issue and not specific to 9. @kouvel, can we move to 10?