dotnet / msbuild

The Microsoft Build Engine (MSBuild) is the build platform for .NET and Visual Studio.
https://docs.microsoft.com/visualstudio/msbuild/msbuild
MIT License
5.22k stars 1.35k forks source link

[Bug]: File copy is no longer retried, causing builds to randomly fail #9169

Closed Nezz closed 7 months ago

Nezz commented 1 year ago

Issue Description

We started receiving random build failures caused by failing file copy operations:

C:\Program Files\dotnet\sdk\7.0.400\Microsoft.Common.CurrentVersion.targets(4879,5): error MSB3021: Unable to copy file "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEngine.UnityCurlModule.dll" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEngine.UnityCurlModule.dll". Access to the path 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEngine.UnityCurlModule.dll' is denied. [C:\jenkins\workspace\\.csproj]

After investigating we found that this is caused by a regression between .NET SDK 7.0.306 (MSBuild version 17.6.8+c70978d4d for .NET) and 7.0.400 (MSBuild version 17.7.1+971bf70db for .NET). The older version worked around this by doing a retry:

C:\Program Files\dotnet\sdk\7.0.306\Microsoft.Common.CurrentVersion.targets(4862,5): warning MSB3026: Could not copy "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEditor.SceneTemplateModule.xml" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.SceneTemplateModule.xml". Beginning retry 1 in 1000ms. The process cannot access the file 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.SceneTemplateModule.xml' because it is being used by another process. [C:\jenkins\workspace\\.csproj]

The new version fails right away if there is already a retry queued:

[2023-08-24T08:22:18.186Z] C:\Program Files\dotnet\sdk\7.0.400\Microsoft.Common.CurrentVersion.targets(4879,5): warning MSB3026: Could not copy "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEditor.CoreModule.xml" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml". Beginning retry 1 in 1000ms. The process cannot access the file 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml' because it is being used by another process.  [C:\jenkins\workspace\redacted2.csproj]
[2023-08-24T08:22:18.186Z] C:\Program Files\dotnet\sdk\7.0.400\Microsoft.Common.CurrentVersion.targets(4879,5): error MSB3021: Unable to copy file "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEditor.CoreModule.xml" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml". Access to the path 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml' is denied. [C:\jenkins\workspace\redacted1.csproj]

Note that in the logs above the two lines are logged for different projects.

Steps to Reproduce

Build a large project that references many of the same files. Exact details or minimal repro project is not provided because root cause has been identified above.

Expected Behavior

File copy operations should be retried

Actual Behavior

File copy operations are not retried and fail

Analysis

Relevant logs:

[2023-08-24T08:22:07.879Z] MSBuild version 17.7.1+971bf70db for .NET
[2023-08-24T08:22:08.081Z] Build started 8/24/2023 8:22:08 AM.
[2023-08-24T08:22:08.384Z]      1>Project "C:\jenkins\workspace\redacted.sln" on node 1 (Clean target(s)).
[2023-08-24T08:22:08.384Z]      1>ValidateSolutionConfiguration:
[2023-08-24T08:22:08.384Z]          Building solution configuration "Debug|Any CPU".
[2023-08-24T08:22:08.585Z]      1>Project "C:\jenkins\workspace\redacted.sln" (1) is building "C:\jenkins\workspace\redacted.csproj" (2) on node 1 (Clean target(s)).
[2023-08-24T08:22:08.586Z]      2>CoreClean:
[2023-08-24T08:22:08.586Z]          Creating directory "obj\Debug\".
[2023-08-24T08:22:08.687Z]      2>Done Building Project "C:\jenkins\workspace\redacted.csproj" (Clean target(s)).
[2023-08-24T08:22:09.846Z]      1>Done Building Project "C:\jenkins\workspace\redacted.sln" (Clean target(s)).
[2023-08-24T08:22:09.846Z] 
[2023-08-24T08:22:09.846Z] Build succeeded.
[2023-08-24T08:22:09.846Z]     0 Warning(s)
[2023-08-24T08:22:09.846Z]     0 Error(s)
[2023-08-24T08:22:09.846Z] 
[2023-08-24T08:22:09.846Z] Time Elapsed 00:00:01.80
[2023-08-24T08:22:10.249Z] SonarScanner for MSBuild 5.13
[2023-08-24T08:22:10.249Z] Using the .NET Core version of the Scanner for MSBuild
...
[2023-08-24T08:22:18.185Z] C:\Program Files\dotnet\sdk\7.0.400\Microsoft.Common.CurrentVersion.targets(4879,5): warning MSB3026: Could not copy "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEngine.UnityWebRequestModule.dll" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEngine.UnityWebRequestModule.dll". Beginning retry 1 in 1000ms. The process cannot access the file 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEngine.UnityWebRequestModule.dll' because it is being used by another process.  [C:\jenkins\workspace\redacted2.csproj]
[2023-08-24T08:22:18.185Z] C:\Program Files\dotnet\sdk\7.0.400\Microsoft.Common.CurrentVersion.targets(4879,5): warning MSB3026: Could not copy "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEngine.PhysicsModule.xml" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEngine.PhysicsModule.xml". Beginning retry 1 in 1000ms. The process cannot access the file 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEngine.PhysicsModule.xml' because it is being used by another process.  [C:\jenkins\workspace\redacted1.csproj]
[2023-08-24T08:22:18.186Z] CSC : warning CS8021: No value for RuntimeMetadataVersion found. No assembly containing System.Object was found nor was a value for RuntimeMetadataVersion specified through options. [C:\jenkins\workspace\Yousician.Analytics.Yap.Tests.csproj]
[2023-08-24T08:22:18.186Z]   Sonar: (UniRx.csproj) Project processed successfully
[2023-08-24T08:22:18.186Z] C:\Program Files\dotnet\sdk\7.0.400\Microsoft.Common.CurrentVersion.targets(4879,5): warning MSB3026: Could not copy "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEditor.CoreModule.xml" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml". Beginning retry 1 in 1000ms. The process cannot access the file 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml' because it is being used by another process.  [C:\jenkins\workspace\redacted2.csproj]
[2023-08-24T08:22:18.186Z] CSC : warning CS8021: No value for RuntimeMetadataVersion found. No assembly containing System.Object was found nor was a value for RuntimeMetadataVersion specified through options. [C:\jenkins\workspace\Yousician.Networking.Tests.csproj]
[2023-08-24T08:22:18.186Z] CSC : warning CS2008: No source files specified. [C:\jenkins\workspace\redacted.csproj]
[2023-08-24T08:22:18.186Z] C:\Program Files\dotnet\sdk\7.0.400\Microsoft.Common.CurrentVersion.targets(4879,5): error MSB3021: Unable to copy file "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEditor.CoreModule.xml" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml". Access to the path 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml' is denied. [C:\jenkins\workspace\redacted1.csproj]
...
[2023-08-24T08:22:54.586Z] C:\Program Files\dotnet\sdk\7.0.400\Microsoft.Common.CurrentVersion.targets(4879,5): error MSB3021: Unable to copy file "C:\Program Files\Unity\Hub\Editor\2021.3.26f1\Editor\Data\Managed\UnityEngine\UnityEditor.CoreModule.xml" to "Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml". Access to the path 'C:\jenkins\workspace\Temp\Bin\Debug\Unity.Rider.Editor\UnityEditor.CoreModule.xml' is denied. [C:\jenkins\workspace\redacted1.csproj]
[2023-08-24T08:22:54.586Z]     12614 Warning(s)
[2023-08-24T08:22:54.586Z]     1 Error(s)
[2023-08-24T08:22:54.586Z] 
[2023-08-24T08:22:54.586Z] Time Elapsed 00:00:42.18

Versions & Configurations

MSBuild version 17.7.1+971bf70db for .NET

AR-May commented 1 year ago

Team triage: @GangWang01 could you try to reproduce the bug?

KalleOlaviNiemitalo commented 1 year ago

There is a similar report in https://github.com/dotnet/msbuild/discussions/9190.

KalleOlaviNiemitalo commented 1 year ago

The log looks like each file got ERROR_SHARING_VIOLATION initially but ERROR_ACCESS_DENIED on the retry. src/Tasks/Copy.cs does not normally retry after ERROR_ACCESS_DENIED but this can be changed via the MSBUILDALWAYSRETRY environment variable. I don't see any obvious change in this logic between 17.6.8+c70978d4d and 17.7.1+971bf70db, though.

RobSwDev commented 1 year ago

Our builds started failing for the same reason on 9th August - the day after the 7.0.4 sdk was released. Which seemed too neat to be a coincidence.

However, looks like our build agents didn't upgrade to 7.0.4 until a couple of days later - we'd already had loads of failures by then. So it's not that simple - at least for us.

Could this be due to some other Microsoft environmental change around the same time?

Nezz commented 1 year ago

It could also be a windows update. The base virtual machine image we use with 7.0.400 was newer than the one we had with 7.0.306.

KalleOlaviNiemitalo commented 1 year ago

Between the virtual machine images, is there any difference in the file system minifilters listed by fltmc? (I expect not, but it's quick to check.)

Nezz commented 1 year ago

Problematic VM:

09:45:00  Filter Name                     Num Instances    Altitude    Frame
09:45:00  ------------------------------  -------------  ------------  -----
09:45:00  storqosflt                              0       244000         0
09:45:00  wcifs                                   0       189900         0
09:45:00  CldFlt                                  0       180451         0
09:45:00  FileCrypt                               0       141100         0
09:45:00  luafv                                   1       135000         0
09:45:00  npsvctrig                               1        46000         0
09:45:00  Wof                                     1        40700         0

Good VM:

10:02:44  Filter Name                     Num Instances    Altitude    Frame
10:02:44  ------------------------------  -------------  ------------  -----
10:02:44  storqosflt                              0       244000         0
10:02:44  wcifs                                   0       189900         0
10:02:44  CldFlt                                  0       180451         0
10:02:44  FileCrypt                               0       141100         0
10:02:44  luafv                                   1       135000         0
10:02:44  npsvctrig                               1        46000         0
10:02:44  Wof                                     1        40700         0

Looks like no difference?

Nezz commented 1 year ago

Not sure if it's relevant, but we call dotnet build with the -nodereuse:false parameter because on Windows we had issues with lingering dotnet processes that can leave files locked. We don't need that parameter on macOS or Linux.

GangWang01 commented 1 year ago

I couldn't repro the issue. But I did some tries to understand what happened. As KalleOlaviNiemitalo mentioned

The log looks like each file got ERROR_SHARING_VIOLATION initially but ERROR_ACCESS_DENIED on the retry. src/Tasks/Copy.cs does not normally retry after ERROR_ACCESS_DENIED but this can be changed via the MSBUILDALWAYSRETRY environment variable.

it was ERROR_ACCESS_DENIED on the retry broke the retry referring to https://github.com/dotnet/msbuild/blob/3c910ba83fc9dbd8e12f50dddc8c381404f928c4/src/Tasks/Copy.cs#L827-L843.

With this simple App.zip, I tried

I guess in this issue overwriting the same target file by multiple copy tasks got the file's ACL being initialized and next copy task happened to overwrite the file, then access denied occurred and broke the retry. Not sure about this.

Nezz commented 1 year ago

I'm not sure what causes it, but export MSBUILDALWAYSRETRY=1 resolved the issue completely.

GangWang01 commented 1 year ago

@Nezz Glad MSBUILDALWAYSRETRY=1resolved the issue.

We still need your help to understand what causes the file locked or access denied and what needs to improve from msbuild if any. Can you use Process Monitor to filter out which process gets the target file of copy task locked or its ACL changed while reproducing this issue? If possible, it’s better to provide the process monitor log as well as build binary log (see how to Providing MSBuild Binary Logs for investigation). Thank you!

Tips about Process Monitor: procmon

YuliiaKovalova commented 11 months ago

Hi @Nezz, Could you provide the requested information? Thank you!

Nezz commented 11 months ago

Sadly these issues happen on AWS EC2 instances that are terminate as soon as the job finishes running and there is no remote desktop access. Is there a way to obtain this information via the command line?

rokonec commented 8 months ago

I am not sure if there is a follow up with this issue. We have workaround, and without repro or process monitor log we don't know how to detect this transient file access issue. We can consider MSBUILDALWAYSRETRY as default behavior but it will make build with wrong user entered path took longer before fail. I recommend to lower priority of this to P3. @rainersigwald @YuliiaKovalova do you agree?

YuliiaKovalova commented 7 months ago

Close it as a low priority bug. It can be reconsidered in the future.