dotnet / sdk

Core functionality needed to create .NET Core projects, that is shared between Visual Studio and CLI
https://dot.net/core
MIT License
2.75k stars 1.07k forks source link

[Bug]: MSBuild crashes on parallel build using docker #43750

Open pchalamet opened 2 months ago

pchalamet commented 2 months ago

Issue Description

Random crashes in MSBuild where trying to parallelize .net builds in Docker:

Steps to Reproduce

docker run --rm --net=host --name 8C91EC60706A76686DEE83F23CE80DD78D48E014DCCB1F7389F9F5EF9D9BFF09 -v /var/run/docker.sock:/var/run/docker.sock -v /Users/pierre/.terrabuild/home/containers:/root -v /Users/pierre/.terrabuild/home/tmp:/tmp -v /Users/pierre/src/MagnusOpera/terrabuild/terrabuild/src:/terrabuild -w /terrabuild/Terrabuild.Common --entrypoint dotnet -e DOTNET_CLI_TELEMETRY_OPTOUT -e DOTNET_NOLOGO -e DOTNET_SKIP_FIRST_TIME_EXPERIENCE mcr.microsoft.com/dotnet/sdk:8.0.302 build --no-dependencies --configuration Debug

9/27/2024 11:51:49 AM ERR System.IO.IOException: The system cannot open the device or file specified. : 'NuGet-Migrations'
9/27/2024 11:51:49 AM ERR    at System.Threading.Mutex.CreateMutexCore(Boolean initiallyOwned, String name, Boolean& createdNew)
9/27/2024 11:51:49 AM ERR    at System.Threading.Mutex..ctor(Boolean initiallyOwned, String name)
9/27/2024 11:51:49 AM ERR    at NuGet.Common.Migrations.MigrationRunner.Run(String migrationsDirectory)
9/27/2024 11:51:49 AM ERR    at Microsoft.DotNet.Configurer.DotnetFirstTimeUseConfigurer.Configure()
9/27/2024 11:51:49 AM ERR    at Microsoft.DotNet.Cli.Program.ConfigureDotNetForFirstTimeUse(IFirstTimeUseNoticeSentinel firstTimeUseNoticeSentinel, IAspNetCertificateSentinel aspNetCertificateSentinel, IFileSentinel toolPathSentinel, Boolean isDotnetBeingInvokedFromNativeInstaller, DotnetFirstRunConfiguration dotnetFirstRunConfiguration, IEnvironmentProvider environmentProvider, Dictionary`2 performanceMeasurements)
9/27/2024 11:51:49 AM ERR    at Microsoft.DotNet.Cli.Program.ProcessArgs(String[] args, TimeSpan startupTime, ITelemetry telemetryClient)
9/27/2024 11:51:49 AM ERR    at Microsoft.DotNet.Cli.Program.Main(String[] args)
9/27/2024 11:51:49 AM ERR
9/27/2024 11:51:49 AM OUT
docker run --rm --net=host --name DAB5E60C96ACE37A01B06B64DFD9CD55E4ED14F2C614AA512BA51291FD95266E -v /var/run/docker.sock:/var/run/docker.sock -v /Users/pierre/.terrabuild/home/containers:/root -v /Users/pierre/.terrabuild/home/tmp:/tmp -v /Users/pierre/src/MagnusOpera/terrabuild/terrabuild/src:/terrabuild -w /terrabuild/Terrabuild.PubSub --entrypoint dotnet -e DOTNET_CLI_TELEMETRY_OPTOUT -e DOTNET_NOLOGO -e DOTNET_SKIP_FIRST_TIME_EXPERIENCE mcr.microsoft.com/dotnet/sdk:8.0.302 build --no-dependencies --configuration Debug

9/27/2024 11:51:49 AM ERR System.ApplicationException: Object synchronization method was called from an unsynchronized block of code.
9/27/2024 11:51:49 AM ERR    at System.Threading.Mutex.ReleaseMutex()
9/27/2024 11:51:49 AM ERR    at NuGet.Common.Migrations.MigrationRunner.Run(String migrationsDirectory)
9/27/2024 11:51:49 AM ERR    at Microsoft.DotNet.Configurer.DotnetFirstTimeUseConfigurer.Configure()
9/27/2024 11:51:49 AM ERR    at Microsoft.DotNet.Cli.Program.ConfigureDotNetForFirstTimeUse(IFirstTimeUseNoticeSentinel firstTimeUseNoticeSentinel, IAspNetCertificateSentinel aspNetCertificateSentinel, IFileSentinel toolPathSentinel, Boolean isDotnetBeingInvokedFromNativeInstaller, DotnetFirstRunConfiguration dotnetFirstRunConfiguration, IEnvironmentProvider environmentProvider, Dictionary`2 performanceMeasurements)
9/27/2024 11:51:49 AM ERR    at Microsoft.DotNet.Cli.Program.ProcessArgs(String[] args, TimeSpan startupTime, ITelemetry telemetryClient)
9/27/2024 11:51:49 AM ERR    at Microsoft.DotNet.Cli.Program.Main(String[] args)
9/27/2024 11:51:49 AM ERR
9/27/2024 11:51:49 AM OUT

Expected Behavior

Crashes are random. I expect this to always work.

Actual Behavior

Exception thrown. See above.

Analysis

Call stacks are provided for analysis.

Versions & Configurations

I have this in my .bashrc (as they are passed to Docker):

export DOTNET_SKIP_FIRST_TIME_EXPERIENCE=true
export DOTNET_NOLOGO=true
export DOTNET_CLI_TELEMETRY_OPTOUT=true

also running on Intel mac and Arm mac (both Sequoia 15 but was crashing with previous versions). This crashes the same for both machine at the same rate.

.net sdk version (8.0.302) is specified on the docker command line.

for sources (to reproduce the build), use this: https://github.com/MagnusOpera/Terrabuild/tree/44ce393db4e8ad891cf072389c7a2023096bc44f

baronfel commented 2 months ago

This is happening in the .NET SDK itself before MSBuild is ever invoked, so I will move it to the SDK repo.

baronfel commented 2 months ago

cc @zivkan / @nkolev92 - the NuGet migrations that were added a while back are erroring inconsistently for this user. What are the runtime requirements of the migrations in terms of file permissions, etc?

nkolev92 commented 2 months ago

@jeffkl is our hotseat, but @kartheekp-ms was involved in the original migration code.

jeffkl commented 2 months ago

The code is trying to create a global named Mutex: https://github.com/NuGet/NuGet.Client/blob/308d5595190925cf406135f565b69db9ba4860a1/src/NuGet.Core/NuGet.Common/Migrations/MigrationRunner.cs#L35

Perhaps related to https://github.com/dotnet/runtime/issues/9987 ?

kartheekp-ms commented 2 months ago

There are several issues in runtime repo related to this error.

pchalamet commented 2 months ago

I've tried chmod'ing tmp folder as advised (777 also tested with sticky bit 1777) but this does not change anything.

It feels like it's a CreateMutex misbehavior in .net when used across several docker instance. when I starts all those docker instances, I mount /tmp to a global host directory (which is 1777 chmod'ed) - as well home folder (~). The goal is to amortize the initialization cost and allow all instances to hit the global NuGet cache.

But if synchronization is broken - at least for init - I guess it's also broken for concurrent cache access as well. I've tried to think about it and the way CreateMutex works. I've ended to pass --pid=host and --ipc=host to docker - which definitively makes sense when considering such primitive.

This led to a drastic amount of error on x64. On Arm I can still observe the error System.IO.IOException: The system cannot open the device or file specified. : 'NuGet-Migrations'

Do you have internal guidance at Microsoft how to allow shared memory for multiple .net-runtime docker instances ? Looks that the crux of the problem.

At least I see:

Is there something else to make it work reliably ?


Docker parameters are now:

docker run --rm --net=host --name DAB5E60C96ACE37A01B06B64DFD9CD55E4ED14F2C614AA512BA51291FD95266E --pid=host --ipc=host -v /var/run/docker.sock:/var/run/docker.sock -v /Users/pct/.terrabuild/home/containers:/root -v /Users/pct/.terrabuild/home/tmp:/tmp -v /Users/pct/src/MagnusOpera/terrabuild/terrabuild/src:/terrabuild -w /terrabuild/Terrabuild.PubSub --entrypoint dotnet mcr.microsoft.com/dotnet/sdk:8.0.302 build --no-dependencies --configuration Debug