Closed Nezz closed 7 months ago
Team triage: @GangWang01 could you try to reproduce the bug?
There is a similar report in https://github.com/dotnet/msbuild/discussions/9190.
The log looks like each file got ERROR_SHARING_VIOLATION initially but ERROR_ACCESS_DENIED on the retry. src/Tasks/Copy.cs does not normally retry after ERROR_ACCESS_DENIED but this can be changed via the MSBUILDALWAYSRETRY environment variable. I don't see any obvious change in this logic between 17.6.8+c70978d4d and 17.7.1+971bf70db, though.
Our builds started failing for the same reason on 9th August - the day after the 7.0.4 sdk was released. Which seemed too neat to be a coincidence.
However, looks like our build agents didn't upgrade to 7.0.4 until a couple of days later - we'd already had loads of failures by then. So it's not that simple - at least for us.
Could this be due to some other Microsoft environmental change around the same time?
It could also be a windows update. The base virtual machine image we use with 7.0.400 was newer than the one we had with 7.0.306.
Between the virtual machine images, is there any difference in the file system minifilters listed by fltmc
? (I expect not, but it's quick to check.)
Problematic VM:
09:45:00 Filter Name Num Instances Altitude Frame
09:45:00 ------------------------------ ------------- ------------ -----
09:45:00 storqosflt 0 244000 0
09:45:00 wcifs 0 189900 0
09:45:00 CldFlt 0 180451 0
09:45:00 FileCrypt 0 141100 0
09:45:00 luafv 1 135000 0
09:45:00 npsvctrig 1 46000 0
09:45:00 Wof 1 40700 0
Good VM:
10:02:44 Filter Name Num Instances Altitude Frame
10:02:44 ------------------------------ ------------- ------------ -----
10:02:44 storqosflt 0 244000 0
10:02:44 wcifs 0 189900 0
10:02:44 CldFlt 0 180451 0
10:02:44 FileCrypt 0 141100 0
10:02:44 luafv 1 135000 0
10:02:44 npsvctrig 1 46000 0
10:02:44 Wof 1 40700 0
Looks like no difference?
Not sure if it's relevant, but we call dotnet build with the -nodereuse:false
parameter because on Windows we had issues with lingering dotnet processes that can leave files locked. We don't need that parameter on macOS or Linux.
I couldn't repro the issue. But I did some tries to understand what happened. As KalleOlaviNiemitalo mentioned
The log looks like each file got ERROR_SHARING_VIOLATION initially but ERROR_ACCESS_DENIED on the retry. src/Tasks/Copy.cs does not normally retry after ERROR_ACCESS_DENIED but this can be changed via the MSBUILDALWAYSRETRY environment variable.
it was ERROR_ACCESS_DENIED on the retry broke the retry referring to https://github.com/dotnet/msbuild/blob/3c910ba83fc9dbd8e12f50dddc8c381404f928c4/src/Tasks/Copy.cs#L827-L843.
With this simple App.zip, I tried
Lock the target file Copy with retries worked well. See binlog RetryCopying.zip
Set the target file readonly or change ACL to deny the access There was no copy with retries. It directly failed with access denied. See binlog AccessDenied.zip
I guess in this issue overwriting the same target file by multiple copy tasks got the file's ACL being initialized and next copy task happened to overwrite the file, then access denied occurred and broke the retry. Not sure about this.
I'm not sure what causes it, but export MSBUILDALWAYSRETRY=1
resolved the issue completely.
@Nezz Glad MSBUILDALWAYSRETRY=1
resolved the issue.
We still need your help to understand what causes the file locked or access denied and what needs to improve from msbuild if any. Can you use Process Monitor to filter out which process gets the target file of copy task locked or its ACL changed while reproducing this issue? If possible, it’s better to provide the process monitor log as well as build binary log (see how to Providing MSBuild Binary Logs for investigation). Thank you!
Tips about Process Monitor:
Hi @Nezz, Could you provide the requested information? Thank you!
Sadly these issues happen on AWS EC2 instances that are terminate as soon as the job finishes running and there is no remote desktop access. Is there a way to obtain this information via the command line?
I am not sure if there is a follow up with this issue. We have workaround, and without repro or process monitor log we don't know how to detect this transient file access issue. We can consider MSBUILDALWAYSRETRY as default behavior but it will make build with wrong user entered path took longer before fail. I recommend to lower priority of this to P3. @rainersigwald @YuliiaKovalova do you agree?
Close it as a low priority bug. It can be reconsidered in the future.
Issue Description
We started receiving random build failures caused by failing file copy operations:
After investigating we found that this is caused by a regression between .NET SDK 7.0.306 (MSBuild version 17.6.8+c70978d4d for .NET) and 7.0.400 (MSBuild version 17.7.1+971bf70db for .NET). The older version worked around this by doing a retry:
The new version fails right away if there is already a retry queued:
Note that in the logs above the two lines are logged for different projects.
Steps to Reproduce
Build a large project that references many of the same files. Exact details or minimal repro project is not provided because root cause has been identified above.
Expected Behavior
File copy operations should be retried
Actual Behavior
File copy operations are not retried and fail
Analysis
Relevant logs:
Versions & Configurations
MSBuild version 17.7.1+971bf70db for .NET