Open pchalamet opened 2 months ago
This is happening in the .NET SDK itself before MSBuild is ever invoked, so I will move it to the SDK repo.
cc @zivkan / @nkolev92 - the NuGet migrations that were added a while back are erroring inconsistently for this user. What are the runtime requirements of the migrations in terms of file permissions, etc?
@jeffkl is our hotseat, but @kartheekp-ms was involved in the original migration code.
The code is trying to create a global named Mutex: https://github.com/NuGet/NuGet.Client/blob/308d5595190925cf406135f565b69db9ba4860a1/src/NuGet.Core/NuGet.Common/Migrations/MigrationRunner.cs#L35
Perhaps related to https://github.com/dotnet/runtime/issues/9987 ?
There are several issues in runtime repo related to this error.
I've tried chmod'ing tmp folder as advised (777 also tested with sticky bit 1777) but this does not change anything.
It feels like it's a CreateMutex
misbehavior in .net when used across several docker instance. when I starts all those docker instances, I mount /tmp
to a global host directory (which is 1777 chmod'ed) - as well home folder (~
). The goal is to amortize the initialization cost and allow all instances to hit the global NuGet cache.
But if synchronization is broken - at least for init - I guess it's also broken for concurrent cache access as well. I've tried to think about it and the way CreateMutex
works. I've ended to pass --pid=host
and --ipc=host
to docker - which definitively makes sense when considering such primitive.
This led to a drastic amount of error on x64. On Arm I can still observe the error System.IO.IOException: The system cannot open the device or file specified. : 'NuGet-Migrations'
Do you have internal guidance at Microsoft how to allow shared memory for multiple .net-runtime docker instances ? Looks that the crux of the problem.
At least I see:
--pid=host
--ipc=host
/tmp
and ~
to dedicated shared volumes + chmod 1777 on shared volumesIs there something else to make it work reliably ?
Docker parameters are now:
docker run --rm --net=host --name DAB5E60C96ACE37A01B06B64DFD9CD55E4ED14F2C614AA512BA51291FD95266E --pid=host --ipc=host -v /var/run/docker.sock:/var/run/docker.sock -v /Users/pct/.terrabuild/home/containers:/root -v /Users/pct/.terrabuild/home/tmp:/tmp -v /Users/pct/src/MagnusOpera/terrabuild/terrabuild/src:/terrabuild -w /terrabuild/Terrabuild.PubSub --entrypoint dotnet mcr.microsoft.com/dotnet/sdk:8.0.302 build --no-dependencies --configuration Debug
Issue Description
Random crashes in MSBuild where trying to parallelize .net builds in Docker:
Steps to Reproduce
Expected Behavior
Crashes are random. I expect this to always work.
Actual Behavior
Exception thrown. See above.
Analysis
Call stacks are provided for analysis.
Versions & Configurations
I have this in my
.bashrc
(as they are passed to Docker):also running on Intel mac and Arm mac (both Sequoia 15 but was crashing with previous versions). This crashes the same for both machine at the same rate.
.net sdk version (8.0.302) is specified on the docker command line.
for sources (to reproduce the build), use this: https://github.com/MagnusOpera/Terrabuild/tree/44ce393db4e8ad891cf072389c7a2023096bc44f