Open ViktorHofer opened 3 years ago
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.
Is the exception that it failed with written in some log?
Tagging subscribers to this area: @cston See info in area-owners.md if you want to be subscribed.
Author: | ViktorHofer |
---|---|
Assignees: | - |
Labels: | `area-Microsoft.CSharp`, `bug`, `untriaged` |
Milestone: | 6.0.0 |
Doesn't appear to be, this is all we have in the binlog https://artprodcus3.artifacts.visualstudio.com/Ab55de4ed-4b5a-4215-a8e4-0a0a5f71e7d8/9ee6d478-d288-47f7-aacc-f6e6d082ae6d/_apis/artifact/cGlwZWxpbmVhcnRpZmFjdDovL2RuY2VuZy9wcm9qZWN0SWQvOWVlNmQ0NzgtZDI4OC00N2Y3LWFhY2MtZjZlNmQwODJhZTZkL2J1aWxkSWQvMTE2MTE0Ni9hcnRpZmFjdE5hbWUvQnVpbGRMb2dzX0NvcmVDTFJfTGludXhfYXJtNjRfcmVsZWFzZQ2/content?format=file&subPath=%2FRelease%2FBuild.binlog
Essentially the same two lines Viktor shared.
So far I can only find that one instance: https://runfo.azurewebsites.net/search/timelines/?q=started%3A%7E7+definition%3Aruntime+message%3A%22%5EReleaseMutex+failed.%22
That crash is coming from this line
https://sourceroslyn.io/#Microsoft.Build.Tasks.CodeAnalysis/BuildServerConnection.cs,239
When communicating between MSBuild nodes and the Compiler server we use a Mutex
as a part of the initial handshake. The code here is crashing because when we release the Mutex
via Dispose
we're getting an ApplicationException
.
Our code considers this fatal because the docs say that ApplicationException
will only be thrown when the thread calling Dispose
does not own the Mutex
. Basically coding error on our part that is significant. That is also why we log the thread IDs in the message here. You'll notice though in this case that the thread IDs are the same though which means the same thread both opened and released the Mutex
which should be safe (based on our understanding).
What other circumstances can cause an ApplicationException
to happen here?
@RikkiGibson, @chsienki, @cston
in this case that the thread IDs are the same though which means the same thread both opened and released the
Mutex
which should be safe
Could the Mutex
have been disposed on another thread?
In our case unlikely. This Mutex
instance is created and managed on a per-thread basis. It's a local that isn't shared amongst threads and the message confirms that there isn't any tricky await
code involved that causes us to inadvertently switch threads.
The one issue to keep in mind is that we have an abstraction over Mutex
here to deal with the difference implementation between the server on the Mono and .NET runtime. Hence the Dispose
call is really a wrapper over:
public void Dispose()
{
if (IsDisposed)
return;
IsDisposed = true;
try
{
if (IsLocked)
Mutex.ReleaseMutex();
}
finally
{
Mutex.Dispose();
IsLocked = false;
}
}
One possibility is we are incorrectly setting IsLocked
which would lead to this exception being thrown.
Failed again in #53590 . I think it is most likely Linux arm64 specific bug in CoreCLR PAL.
Current runfo query shows a hit on x64 too, so it is not arm64 specific.
Again, but in Mono Android Arm64 release build, using 3.1.
2021-06-25T16:42:48.0342786Z /__w/1/s/build.sh -ci -arch x64 -os Android -s mono+libs+host+packs+mono.mscordbi -c Release /p:OfficialBuildId=20210625.7 /p:CrossBuild=false
2021-06-25T16:42:48.0343494Z ========================== Starting Command Output ===========================
2021-06-25T16:42:48.0364929Z [command]/bin/bash --noprofile --norc /__w/_temp/455af3b1-60d8-49d1-88ab-abbd6629e02b.sh
2021-06-25T16:42:48.0702381Z __DistroRid: android-x64
2021-06-25T16:42:48.2260529Z /__w/1/s/.dotnet/sdk/6.0.100-preview.4.21255.9/MSBuild.dll /nologo -logger:/__w/1/s/.packages/microsoft.dotnet.arcade.sdk/6.0.0-beta.21311.3/tools/netcoreapp3.1/Microsoft.DotNet.ArcadeLogging.dll -maxcpucount /m -verbosity:m /v:minimal /bl:/__w/1/s/artifacts/log/Release/Build.binlog /clp:Summary /nr:false /p:TreatWarningsAsErrors=true /p:ContinuousIntegrationBuild=true /p:Configuration=Release /p:RepoRoot=/__w/1/s/ /p:Restore=true /p:Build=true /p:Rebuild=false /p:Test=false /p:Pack=false /p:IntegrationTest=false /p:PerformanceTest=false /p:Sign=false /p:Publish=false /p:TargetOS=Android /p:Subset=mono+libs+host+packs+mono.mscordbi /p:TargetArchitecture=x64 /p:BuildArchitecture=x64 /p:CMakeArgs= /p:OfficialBuildId=20210625.7 /p:CrossBuild=false /warnaserror /__w/1/s/.packages/microsoft.dotnet.arcade.sdk/6.0.0-beta.21311.3/tools/Build.proj
...
2021-06-25T16:45:35.4955064Z /__w/1/s/.packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error MSB3883: Unexpected exception: [/__w/1/s/src/tasks/Crossgen2Tasks/Crossgen2Tasks.csproj]
2021-06-25T16:45:35.5037748Z ##[error].packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error MSB3883: (NETCORE_ENGINEERING_TELEMETRY=Build) Unexpected exception:
2021-06-25T16:45:35.5051214Z /__w/1/s/.packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error : One or more errors occurred. (ReleaseMutex failed. WaitOne Id: 14 Release Id: 14) [/__w/1/s/src/tasks/Crossgen2Tasks/Crossgen2Tasks.csproj]
2021-06-25T16:45:35.5056497Z ##[error].packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error : (NETCORE_ENGINEERING_TELEMETRY=Build) One or more errors occurred. (ReleaseMutex failed. WaitOne Id: 14 Release Id: 14)
2021-06-25T16:45:35.5059398Z /__w/1/s/.packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error MSB3883: Unexpected exception: [/__w/1/s/src/tasks/AotCompilerTask/MonoAOTCompiler.csproj]
2021-06-25T16:45:35.5063602Z ##[error].packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error MSB3883: (NETCORE_ENGINEERING_TELEMETRY=Build) Unexpected exception:
2021-06-25T16:45:35.5089855Z /__w/1/s/.packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error : One or more errors occurred. (ReleaseMutex failed. WaitOne Id: 14 Release Id: 14) [/__w/1/s/src/tasks/AotCompilerTask/MonoAOTCompiler.csproj]
2021-06-25T16:45:35.5094289Z ##[error].packages/microsoft.net.compilers.toolset/3.10.0-4.21273.6/tasks/netcoreapp3.1/Microsoft.CSharp.Core.targets(71,5): error : (NETCORE_ENGINEERING_TELEMETRY=Build) One or more errors occurred. (ReleaseMutex failed. WaitOne Id: 14 Release Id: 14)
@jaredpar I would like to look into this issue - is there a reasonable way to repro it locally?
I haven't been able to reproduce this locally. It only shows up sporadically in customer / CI reports.
I am moving this to 7 since its not actionable at the moment without a repro. We can move it back if there is one available
+1. This happens quite often when building multiple containers with docker compose (each using mcr.microsoft.com/dotnet/sdk:5.0.403-alpine3.14-amd64). Buildkit enabled.
Hey @Madajevas, do you have any repro steps or a docker file you could share?
Unfortunately what I have cannot be shared. Will look which containers fails most often and maybe it will be possible to create some minimal repo without corporate code
ok thanks, also would be curious to know if it repros with 6.0 as well?
Not using 6.0 right now. Not planning to switch until Q1. Just a wild idea, haven't analyzed code and I'm probably wrong. Could this mutex be shared then docker builds multiple images in parallel and due to some race condition one build disposes mutex and then another one tries to do the same and fails? Not sure if we experienced this on windows but CentOS 7 build server crashes at leas once a day if build happens multiple times. Nothing of the sort if single image is being built. Since this happens in docker build no message of inner (actual) exception is visible.
Could this mutex be shared then docker builds multiple images in parallel and due to some race condition one build disposes mutex and then another one tries to do the same and fails?
From the perspective of the Mutex
API that should not be a problem. The only issue there is that we make sure that we do the create / dispose from the same thread. It's possible that this is rubbing up against an implementation detail of Mutex
though.
This happens to us every once in a while in our CI. .NET SDK 6.0.100
@kouvel may have thoughts
Looks like this condition is failing: https://github.com/dotnet/runtime/blob/57bfe474518ab5b7cfe6bf7424a79ce3af9d6657/src/libraries/System.Private.CoreLib/src/System/Threading/WaitSubsystem.WaitableObject.Unix.cs#L713 If current thread was the owning thread, then the only other explanation is waitableobject was already signaled.
@am11 I don't think that would be it since that implementation does not support named mutexes.
I didn't see anything obvious that would cause a ReleaseMutex
to fail with ApplicationException
. Another reason for throwing that exception is if the handle does not represent a mutex for some reason, though that would be unlikely.
Maybe it would be useful to add some more info to the exception message, like maybe:
TryLock
Not sure it would help much but may at least rule something out.
Noticed something else, though it probably wouldn't lead to this issue. This line should probably catch AbandonedMutexException
and set IsLocked
to true in that case, instead of catching it in the caller, otherwise IsLocked
would remain false upon exception and the thread would abandon the mutex again.
to fail with
ApplicationException
Do we know where is this coming from? (if not WaitSubsystem.WaitableObject.Unix.cs#714)
Do we know where is this coming from? (if not WaitSubsystem.WaitableObject.Unix.cs#714)
It's probably from here: https://github.com/dotnet/runtime/blob/dfbae37e91c4744822018dde10cbd414c661c0b8/src/libraries/System.Private.CoreLib/src/System/Threading/Mutex.Windows.cs#L92
After getting an exception here: https://github.com/dotnet/runtime/blob/dfbae37e91c4744822018dde10cbd414c661c0b8/src/coreclr/pal/src/synchobj/mutex.cpp#L457
Hm, original report was for Linux arm64, so I was tracing Unix partials (which calls into the waitsubsystem).
Mutex.Unix.cs
is only used with the WaitSubsystem
implementation, on CoreCLR the Unix implementation (confusingly) uses Mutex.Windows.cs
, the p/invoke goes into the PAL, which implements Windows-like APIs such as ReleaseMutex
.
this is an old issue, is this something which needs to be looked at ?
Closing for now, as there haven't been recent reports and there isn't sufficient info available to diagnose the issue. Issue https://github.com/dotnet/runtime/issues/89090 covers adding more info about system call errors to named mutex exceptions.
I see that this was just hit on my validation build: https://dev.azure.com/dnceng-public/public/_build/results?buildId=407963&view=logs&j=1131d339-d842-59c0-dcc0-73e7a31acc11&t=9c8d0f40-0c4c-53c5-f33c-9e8e40de8387
Also noticed that MSBuild is tracking this here: https://github.com/dotnet/msbuild/issues/7601
Hey, this just hit on my ci/cd, (macos) https://github.com/Somfic/vla/actions/runs/7751958359/job/21140666679?pr=74
From @Somfic log
ReleaseMutex failed. WaitOne Id: 16 Release Id: 16
Confirms that the calls are indeed happening on the same thread id.
I'm seeing this while building on my macOS M2.
I'm seeing this while building on my macOS M2.
There should be better diagnostic info in .NET 9 including info from system call errors (https://github.com/dotnet/runtime/pull/92603). Can you share more details about the failure?
@kouvel Not really. The error message was basically the same as this issue. Let me see if I can reproduce it.
Ah right I forgot the extra info is not tracked for ReleaseMutex. I'll also try to repro.
Assume this is a long standing issue and not specific to 9. @kouvel, can we move to 10?
Error Message
Fill the error message using step by step known issues guidance.
Example error:
Build: https://dev.azure.com/dnceng/public/_build/results?buildId=1161146&view=logs&j=1a1e7658-955a-5bfe-8410-74fa4a1a1fbf&t=9528cbd6-a8dd-5006-9c48-38938c12a3af
Configuration:
CoreCLR Product Build Linux arm64 release
cc @jkotas @jaredpar @stephentoub
Known issue validation
Build: :mag_right: Result validation: :warning: Provided build not found. Provide a valid build in the "Build: :mag_right:" line. Validation performed at: 9/15/2023 5:21:46 PM UTC
Report
Summary