dotnet / msbuild

The Microsoft Build Engine (MSBuild) is the build platform for .NET and Visual Studio.
https://docs.microsoft.com/visualstudio/msbuild/msbuild
MIT License
5.17k stars 1.34k forks source link

[Infra] SampleAnalyzerIntegrationTest randomly timeouting in CI #10036

Open JanKrivanek opened 2 months ago

JanKrivanek commented 2 months ago

Context

System.TimeoutException : Test failed due to timeout: process 12042 is active for more than 30 sec.

   at Microsoft.Build.UnitTests.Shared.RunnerUtilities.RunProcessAndGetOutput(String process, String parameters, Boolean& successfulExit, Boolean shellExecute, ITestOutputHelper outputHelper) in /_/src/UnitTests.Shared/RunnerUtilities.cs:line 139
   at Microsoft.Build.UnitTests.Shared.RunnerUtilities.ExecBootstrapedMSBuild(String msbuildParameters, Boolean& successfulExit, Boolean shellExecute, ITestOutputHelper outputHelper) in /_/src/UnitTests.Shared/RunnerUtilities.cs:line 67
   at Microsoft.Build.BuildCheck.UnitTests.EndToEndTests.SampleAnalyzerIntegrationTest(Boolean buildInOutOfProcessNode, Boolean analysisRequested) in /home/vsts/work/1/s/src/BuildCheck.UnitTests/EndToEndTests.cs:line 122
   at System.RuntimeMethodHandle.InvokeMethod(Object target, Void** arguments, Signature sig, Boolean isConstructor)
   at System.Reflection.MethodBaseInvoker.InvokeDirectByRefWithFewArgs(Object obj, Span`1 copyOfArgs, BindingFlags invokeAttr)

Sample run: https://dev.azure.com/dnceng-public/public/_build/results?buildId=648080&view=ms.vss-test-web.build-test-results-tab&runId=16013174&resultId=100001&paneView=debug

YuliiaKovalova commented 2 months ago

discussed offline: the execution time can be extended for handling it. Please track it for some time if this approach solves the problem

GangWang01 commented 2 months ago

After observing 9 runs that failed with this test, the failure only happened to the following test cases. From the output logs attached inline, it looks like within 30 seconds the build check analyzer hadn't finished.

Checking with the output log Microsoft.Build.BuildCheck.UnitTests_net8.0_x64.log of fully passed test run, I found there were over 1000 warnings when the analysis was requested. But the warnings except BC0101 couldn't be reproduced locally. This may explain it looks like within 30 seconds the build check analyzer hadn't finished. No idea why CI run has so many extra warnings.

YuliiaKovalova commented 2 months ago

@JanKrivanek , can this fix help with the test performance? https://github.com/dotnet/msbuild/pull/10084

JanKrivanek commented 2 months ago

Only slightly - it slashes only about a third of currently detected slowdown of the analyzers. Though the detected slowdown was in range of about 14% - so it's puzzling that the process runs for over 30 second. It would be interesting to see if it fails with about the same or significantly decreased ratio if we increase the timeout to say - 120 seconds. That might bisect between the slowdown and deadlock theory

JanKrivanek commented 1 month ago

Another incidence (after the timeout increasing): https://dev.azure.com/dnceng-public/public/_build/results?buildId=681320&view=ms.vss-test-web.build-test-results-tab&runId=16909114&resultId=100003&paneView=debug

JanKrivanek commented 1 month ago

Might be totally red herring:

The timeouting cases seem to have TerminalLogger used (notice the control sequences):

   Microsoft.Build.BuildCheck.UnitTests: [Long Running Test] 'Microsoft.Build.BuildCheck.UnitTests.EndToEndTests.SampleAnalyzerIntegrationTest', Elapsed: 00:01:02
Executing [/home/vsts/work/1/s/.dotnet/dotnet /home/vsts/work/1/s/artifacts/bin/bootstrap/net8.0/MSBuild/MSBuild.dll FooBar.csproj /m:1 -nr:False -restore -analyze]
==== OUTPUT ====
    Microsoft.Build.BuildCheck.UnitTests.EndToEndTests.SampleAnalyzerIntegrationTest(buildInOutOfProcessNode: True, analysisRequested: True) [FAIL]
      System.TimeoutException : Test failed due to timeout: process 11430 is active for more than 120 sec.

Sample full log: TIMEOUT-Microsoft.Build.BuildCheck.UnitTests_net8.0_x64.log

Wherease I do not see those appear in individual test cases for the success cases - sample log: OK-Microsoft.Build.BuildCheck.UnitTests_net8.0_x64.log

Regardless of it being red herring or not - it is strange that TerminalLogger is used in CI. I believe @AR-May was looking into somthing similar

JaynieBai commented 1 month ago

@JanKrivanek and @maridematte I repro this issue with the command "dotnet test ./artifacts/bin/Microsoft.Build.BuildCheck.UnitTests/Debug/net8.0/Microsoft.Build.BuildCheck.UnitTests.dll --filter "FullyQualifiedName~SampleAnalyzerIntegrationTest" randomly " in linux machine.

Added -v:d with process command to output more info and found the build is stuck in target GetCopyToPublishDirectoryItems

Image

This is build log.txt.

JanKrivanek commented 1 month ago

Perfect!

Can you collect 2 more cases to see if there is any pattern? Plus appart from the diag log it might be helpfull to have binlog as well.

Other than that I unfortunately do not have any idea now about what can bw causing this

JaynieBai commented 1 month ago

Perfect!

Can you collect 2 more cases to see if there is any pattern? Plus appart from the diag log it might be helpfull to have binlog as well.

Other than that I unfortunately do not have any idea now about what can bw causing this

Here is log with -v:diag diaglog.txt

The binlog is not completed when failed. True.binlog.txt

Image

JanKrivanek commented 1 month ago

Yeah - the binlog being cut aburptly is fine (and expected as test is killed after timeout) - the timeline from binlog will be interresting - it should show where was the time spent. At it'd be nice to see 2 or 3 cases - to be able to compare if the excesive time is spent in the same part of the build or if it is random.

Since it is Linux, we unfortunately cannot use ETW :-/ Maybe @AR-May or @ladipro have some top of the head ideas of what to focus on beyond the timing in binlog.

JaynieBai commented 1 month ago

Yeah - the binlog being cut aburptly is fine (and expected as test is killed after timeout) - the timeline from binlog will be interresting - it should show where was the time spent. At it'd be nice to see 2 or 3 cases - to be able to compare if the excesive time is spent in the same part of the build or if it is random.

Since it is Linux, we unfortunately cannot use ETW :-/ Maybe @AR-May or @ladipro have some top of the head ideas of what to focus on beyond the timing in binlog.

I compare the logs of several failed cases, it's always the same part of the build. And unfortunately, there is no timeline shown in the binlog True.binlog.txt

JanKrivanek commented 1 month ago

From offline discussion

Observations:

Recommendation on further steps:

As a side not - the issue happens during restore. So it might get resolved by skipping the restore - https://github.com/dotnet/msbuild/issues/9747. But it still would be very valuable to understand why is it happening

GangWang01 commented 22 hours ago

Though this issue might be resolved in the fix above, I had some information about trying to debug the test running on linux remotely from VS on Windows. I tried the following ways. But they failed with different errors. I was not able to debug this test running on Linux.

YuliiaKovalova commented 22 hours ago

Though this issue might be resolved in the fix above, I had some information about trying to debug the test running on linux remotely from VS on Windows. I tried the following ways. But they failed with different errors. I was not able to debug this test running on Linux.

  • Debug the test in WSL Remote debugging in WSL got the error MSB4062: The "CheckForImplicitPackageReferenceOverrides" task could not be loaded from the assembly /mnt/d/WS/fork/msbuild/artifacts/bin/bootstrap/net8.0/MSBuild/Sdks/Microsoft.NET.Sdk/targets/../tools/net8.0/Microsoft.NET.Build.Tasks.dll. An attempt was made to load a program with an incorrect format. See TestLog.txt for more details. After my investigation, I found the assembly Microsoft.NET.Build.Tasks.dll existed. But this assembly from clean build on Windows is not compatible to run in WSL due to its architecture x64, while the architecture of it from clean build on Linux is 64797.
  • Attach the debugger to the child process started by the test running on Linux I could attach the debugger to the started process(dotnet ) by the test on Linux for a while. There is was no call stack displayed and after a while the error below popped up. No Windows application event log was found either... image

@GangWang01 , thank you for analysis. I was able to catch it on Windows locally, after adding a new build-in rule. Using Parallel.ForEach in this case was causing deadlock, according to the callstack. So it has to be OK now. Could you please reenable to tests in a separate PR now? I included it in my changes, but the review can take time :)