dotnet / sdk

Core functionality needed to create .NET Core projects, that is shared between Visual Studio and CLI
https://dot.net/core
MIT License
2.71k stars 1.07k forks source link

dotnet test hangs with .NET9 RC1/RC2 in CI build on Ubuntu #43432

Open gimlichael opened 1 month ago

gimlichael commented 1 month ago

Describe the bug

In some cases, dotnet test hangs doing or after a build with .NET9 RC1.

Prior to .NET9 RC1, this worked without a problem - for previous versions as well.

Full "log" can be found here (as I was unsure which team to put the issue to): https://github.com/dotnet/core/discussions/9496#discussioncomment-10628380

To Reproduce

Make a CI build in GitHub Actions using Ubuntu and this repo: https://github.com/gimlichael/Cuemon The main branch will fail on these two projects:

If you use this branch instead, https://github.com/gimlichael/Cuemon/tree/v9.0.0/net9rc1-gha-troubleshoot, all .NET9 RC1 updates has been reverted to .NET9 preview 7, and the two test projects mentioned above, runs fine.

Exceptions (if any)

No exceptions, but the test will hang - and for me, timeout after 15 minutes (cause I don't like the built-in timeout of 6 hours or so).

Further technical details

After thorough investigation over the last two days, my conclusion is that something has changed for dotnet test with .NET9 RC1.

Oddly enough, it works fine on Windows 2022 CI build as well.

Locally, using Windows 11, Docker and WSL (Ubuntu) it works as expected as well.

Workaround for .NET9 RC1 and RC2

Segregate dotnet build and dotnet test.

NOK:

dotnet test --configuration Debug --verbosity normal --logger trx --results-directory $RUNNER_TEMP/TestResults --collect:"XPlat Code Coverage;Format=opencover" -p:CoverletOutputFormat=opencover -p:UseSourceLink=true -p:SkipSignAssembly=true

OK:

dotnet build --configuration Debug --verbosity normal -p:SkipSignAssembly=true

dotnet test --no-build --configuration Debug --verbosity normal --logger trx --results-directory $RUNNER_TEMP/TestResults --collect:"XPlat Code Coverage;Format=opencover" --blame-hang-timeout 10m --blame-hang-dump-type mini -p:CoverletOutputFormat=opencover -p:UseSourceLink=true
nohwnd commented 1 month ago

@gimlichael would it be possible to create a PR that shows the problem and produces a binary log in your repository please?

gimlichael commented 1 month ago

@nohwnd I will try to put aside some time to fulfill your request; hopefully before the end of this week.

gimlichael commented 1 month ago

@nohwnd i have created this branch for you: https://github.com/gimlichael/Cuemon/tree/v9.0.0/dotnet-sdk-issue-43432

I did notice (looking in the source associated with dotnet test and the timeframe the bug occurred) that you guys are using a lot of .GetAwaiter().GetResult() .. I think this might be an issue in a CI environment with low-spec CPU.

Anyway, feel free to clone it and run it .. still Ubuntu only that fails. Windows run fine.

I did enable the -bl build switch, but I don't see any binary files added to GitHub artifacts. If I go into the cog-wheel, I do see an option to download the log archive, but after 20 MB or so, an error occurs from GitHub.

https://github.com/gimlichael/Cuemon/actions/runs/11000415914/job/30542914518

gimlichael commented 1 month ago

@nohwnd did you have a chance to look into the issue?

nohwnd commented 1 month ago

I just tried to do it, but checking out the branch and running dotnet build fails with many errors like this:

S:\c\Cuemon\test\Cuemon.AspNetCore.Mvc.FunctionalTests\Filters\Diagnostics\FaultDescriptorFilterTest.cs(1342,21): error CS0012: The type 'Disposable' is defined in an assembly that is not referenced. You must add a reference to assembly 'Cuemon.Core, Version=9.0.0.0, Culture=neutral, PublicKeyToken=9f6823cab47d945f'.

Build failed with 5525 error(s) and 4 warning(s) in 13,8s

I don't see contributing.md or other description of build steps, and your CI seems to do the same. What do I need to do to build please?

gimlichael commented 1 month ago

@nohwnd thank you for getting back.

I just verified your discovery and I realized, that the branch I used had to many experimental changes for the forthcoming v9.0.0 release, why I replaced the branch in question with one a little earlier in the development process.

I have verified that:

Now works as expected.

I also verified, that the bug is still present.

Sorry for the inconvenience; and thank you for hinting I should have a look at my community health pages while I am at it 🙂

nohwnd commented 1 month ago

I've updated to the latest, and I don't know what I am doing wrong but I still see the same errors. I also tried to delete the repo and clone again, but that did not help either.

Are all the changes you wanted to push pushed?

S:\c\Cuemon [v9.0.0/dotnet-sdk-issue-43432]> git log --oneline
bd3caa34 (HEAD -> v9.0.0/dotnet-sdk-issue-43432, origin/main, origin/HEAD, main) V9.0.0/housekeeping for extensions globalization (#90)
d5f284b9 (tag: v9.0.0-preview.9) V9.0.0/move replace-line-endings to xunit (#89)
e1941c3b (tag: v9.0.0-preview.8) V9.0.0/finalize newtonsoft json migration (#88)
aa3bfa62 V9.0.0/finalize swashbuckle aspnetcore migration (#87)
5e308b4f V9.0.0/finalize asp versioning migration (#86)
f2f6db4a Bump Backport.System.Threading.Lock to 2.0.3 (#85)
b0cac8b9 (tag: v9.0.0-preview.7) V9.0.0/finalize yamldotnet migration (#84)
73f14cd6 (tag: v9.0.0-preview.6) V9.0.0/finalize xunit migration (#83)
87ca1047 V9.0.0/net9rc1 housekeeping (#82)
cf9609de (tag: v9.0.0-preview.5) V9.0.0/xunit testoutputhelperaccessor (#78)
19af986a V9.0.0/lock refactoring (#77)
51c275e4 (tag: v9.0.0-preview.4) V9.0.0/hotfix, housekeeping and dependency bump (#76)
450b22ef (tag: v9.0.0-preview.3) V9.0.0/System.Threading.Lock (#73)
a539fa16 V9.0.0/housekeeping and CI fix (#75)
11c206ac (tag: v9.0.0-preview.2) V9.0.0/problem details (#74)
e251523d (tag: v9.0.0-preview.1) V9.0.0/major (#72)
ad2cd5c1 :pencil2: updated changelog
nohwnd commented 1 month ago

I see you target both net9 and net8, I am on net9-rc2, but I also tried downgrading to net9-preview7 and it has the same problem. The errors also don't seem to be upgraded from warnings.

gimlichael commented 1 month ago

I have now double- and tripled checked the branch.

Expectations to meet:

If these TFMs are in place, build will succeed.

I was able to locate a typo in my post from before - DO REMOVE -b (as this will checkout a new branch from the main) ..

Here are the adjusted steps:

Sorry for the typo .. 😪

gimlichael commented 1 month ago

@nohwnd do let me know, despites the set-backs, if you were able to proceed with bughunting 🙂

nohwnd commented 1 month ago

I was able to make it run, but I don't see any hangs yet.

gimlichael commented 1 month ago

Keep in mind, that I cannot provoke the error in the confines of my local development. It is only in GitHub Actions using an Ubuntu runner.

Consider running this flow; it is tweaked to only include the two failing projects.

McMlok commented 4 weeks ago

Hello,

I have the same problem but on different CI. We are using JenkinsX on on-premise K8s (1.29) cluster. We have our own tailored build images based on Ubuntu 22.04. When I update .NET SDK to 9.0 RC1 pipeline stuck on test step. As in case of @gimlichael I run dotnet test with --no-build parameter.

I tried SDK versions 9.0.100-preview.7.24407.12 - OK 9.0.100-rc.1.24452.12 - NOK 9.0.100-rc.2.24474.11 - NOK

First, I suspected breaking change with terminal logger but adding MSBUILDTERMINALLOGGER=off to image does not help.

P.S. I cannot send logs with our internal project but if I will be able to create minimal repro I will publish it.

gimlichael commented 4 weeks ago

@nohwnd - any update on this issue? My workaround works, but I still think you guys should fix the issue introduced sometime between preview7 and rc1. For you convenience, maybe this can help you on track: https://github.com/dotnet/sdk/commits/main/src/Cli/dotnet/commands/dotnet-test?since=2024-08-13&until=2024-09-10

McMlok commented 4 weeks ago

I did a few experiments and it looks like that the problem is related with logging.

First I tried blame-hang if the problem is not in tests but without luck

''' dotnet test --configuration Release --verbosity n --no-build --no-restore --blame-hang --blame-hang-dump-type full --blame-hang-timeout 1m '''

then I tried

dotnet test --configuration Release --verbosity q --no-build --no-restore

And build on CI agent successfully passed.

gimlichael commented 4 weeks ago

I did a few experiments and it looks like that the problem is related with logging.

First I tried blame-hang if the problem is not in tests but without luck

''' dotnet test --configuration Release --verbosity n --no-build --no-restore --blame-hang --blame-hang-dump-type full --blame-hang-timeout 1m '''

then I tried

dotnet test --configuration Release --verbosity q --no-build --no-restore

And build on CI agent successfully passed.

Interesting - especially because prior to this bug, I always used a verbosity level of quiet. Only after the bug surfaced, I changed verbosity normal, detailed and diagnostic.

GitHub Actions DOES NOT like diagnostics loglevel .. haha .. you cannot even download logs afterwards due to the size.

That said - and as you have discovered - it might still be related to logging. Let's hope the SDK team find a solution 🤞🏼

gimlichael commented 2 weeks ago

@nohwnd - any updates? It fails with both RC1 and RC2. Last working build was Preview 7.

nohwnd commented 2 weeks ago

I am sorry but there are no updates. This is on my work list, but unfortunately not on the top.

gimlichael commented 2 weeks ago

I am sorry but there are no updates. This is on my work list, but unfortunately not on the top.

Fair - thank you for the honest response. Fingers crossed it will be solved with a patch after .NET 9 launch next month 🤞🏼

fabianoliver commented 6 days ago

I am sorry but there are no updates. This is on my work list, but unfortunately not on the top.

Thank you for the update!

We've run into the same issue with RC2, also on Jenkins on an on-prem cluster.

Until fixed this will categorically prevent us from moving over to NET9 whatsoever once released - I'd imagine a good bunch of enterprise users might be in the same boat there, though hard to judge if it just affects some small fraction of CI build setups, or is a wider problem. Either way, thanks of looking into this!

GOATS2K commented 3 days ago

.NET 9 RC2 - dotnet test running on a self-hosted Act runner, running in a Docker container using an Ubuntu image, seems to hang on one host but runs OK on another.

It hangs when I run my tests using this command (works fine in .NET 8):

dotnet test --no-build --verbosity normal

And crashes when I run it with this command

dotnet test --configuration Release --verbosity normal --no-build --no-restore

with the following log

Build started 11/04/2024 19:13:23.
     1>Project "/workspace/Starlight/Starlight/Starlight.sln" on node 1 (VSTest target(s)).
     1>ValidateSolutionConfiguration:
         Building solution configuration "Release|Any CPU".
     1>Project "/workspace/Starlight/Starlight/Starlight.sln" (1) is building "/workspace/Starlight/Starlight/tests/Starlight.Services.Tests/Starlight.Services.Tests.csproj" (6) on node 7 (VSTest target(s)).
     6>_VSTestConsole:
         MSB4181: The "VSTestTask" task returned false but did not log an error.
     6>Done Building Project "/workspace/Starlight/Starlight/tests/Starlight.Services.Tests/Starlight.Services.Tests.csproj" (VSTest target(s)) -- FAILED.
     1>Project "/workspace/Starlight/Starlight/Starlight.sln" (1) is building "/workspace/Starlight/Starlight/tests/Starlight.Updater.Tests/Starlight.Updater.Tests.csproj" (8) on node 2 (VSTest target(s)).
     8>_VSTestConsole:
         MSB4181: The "VSTestTask" task returned false but did not log an error.
     8>Done Building Project "/workspace/Starlight/Starlight/tests/Starlight.Updater.Tests/Starlight.Updater.Tests.csproj" (VSTest target(s)) -- FAILED.
     1>Project "/workspace/Starlight/Starlight/Starlight.sln" (1) is building "/workspace/Starlight/Starlight/tests/Starlight.Models.Tests/Starlight.Models.Tests.csproj" (9) on node 3 (VSTest target(s)).
     9>_VSTestConsole:
         MSB4181: The "VSTestTask" task returned false but did not log an error.
     9>Done Building Project "/workspace/Starlight/Starlight/tests/Starlight.Models.Tests/Starlight.Models.Tests.csproj" (VSTest target(s)) -- FAILED.
     1>Done Building Project "/workspace/Starlight/Starlight/Starlight.sln" (VSTest target(s)) -- FAILED.
Build FAILED.
    0 Warning(s)
    0 Error(s)
Time Elapsed 00:00:01.24

Changing the log level to quiet does not help.

Finally, if I just try to run dotnet test, the test runner only discovers a single test out of many.

I cannot reproduce issue locally on my macOS machine.

nohwnd commented 2 days ago

Finally getting back to this. If someone has a local better repro, binary log or diagnostic logs that they can share, please let me know.

nohwnd commented 2 days ago

I was able to download the original logs from Cuemon (from the github action, they show that vstest task is running but that is about it).

2024-09-23T18:48:38.1132167Z   1:19>Done building target "ShowCallOfVSTestTaskWithParameter" in project "Cuemon.AspNetCore.Mvc.FunctionalTests.csproj".
2024-09-23T18:48:38.1132353Z        Done executing task "CallTarget".
2024-09-23T18:48:38.1133033Z        Using "VSTestTask" task from assembly "/usr/share/dotnet/sdk/9.0.100-rc.1.24452.12/Microsoft.TestPlatform.Build.dll".
2024-09-23T18:48:38.1133182Z        Task "VSTestTask"
2024-09-23T18:48:38.1133375Z          Task Parameter:VSTestBlameHang=True
2024-09-23T18:48:38.1133584Z          Task Parameter:VSTestVerbosity=detailed
2024-09-23T18:48:38.1134227Z          Task Parameter:VSTestTraceDataCollectorDirectoryPath=/home/runner/.nuget/packages/microsoft.codecoverage/17.11.1/build/netstandard2.0/
2024-09-23T18:48:38.1134573Z          Task Parameter:VSTestResultsDirectory=/home/runner/work/_temp/TestResults
2024-09-23T18:48:38.1135189Z          Task Parameter:VSTestTestAdapterPath=/home/runner/.nuget/packages/coverlet.collector/6.0.2/build/netstandard2.0/
2024-09-23T18:48:38.1135410Z          Task Parameter:VSTestBlameHangTimeout=10m
2024-09-23T18:48:38.1135922Z          Task Parameter:VSTestConsolePath=/usr/share/dotnet/sdk/9.0.100-rc.1.24452.12/vstest.console.dll
2024-09-23T18:48:38.1136191Z          Task Parameter:VSTestFramework=.NETCoreApp,Version=v9.0
2024-09-23T18:48:38.1136448Z          Task Parameter:VSTestArtifactsProcessingMode=collect
2024-09-23T18:48:38.1136627Z          Task Parameter:VSTestLogger=trx
2024-09-23T18:48:38.1136935Z          Task Parameter:VSTestCollect=XPlat Code Coverage;Format=opencover
2024-09-23T18:48:38.1137390Z          Task Parameter:VSTestSessionCorrelationId=2566_e2458d0a-316a-4f7c-bc36-809795f7b744
2024-09-23T18:48:38.1138235Z          Task Parameter:TestFileFullPath=/home/runner/work/Cuemon/Cuemon/test/Cuemon.AspNetCore.Mvc.FunctionalTests/bin/Debug/net9.0/Cuemon.AspNetCore.Mvc.FunctionalTests.dll
2024-09-23T18:48:38.1138458Z          Task Parameter:VSTestBlameHangDumpType=mini
2024-09-23T19:00:29.2866221Z ##[error]The operation was canceled.
2024-09-23T19:00:29.2965128Z Post job cleanup.
2024-09-23T19:00:29.3022727Z Post job cleanup.
2024-09-23T19:00:29.3078135Z Post job cleanup.
2024-09-23T19:00:29.3133174Z Post job cleanup.
2024-09-23T19:00:29.4096216Z [command]/usr/bin/git version
2024-09-23T19:00:29.4130361Z git version 2.46.0
2024-09-23T19:00:29.4173260Z Temporarily overriding HOME='/home/runner/work/_temp/a32d568f-9d17-48cb-af35-577639717d18' before making global git config changes
2024-09-23T19:00:29.4175549Z Adding repository directory to the temporary git global config as a safe directory
2024-09-23T19:00:29.4179724Z [command]/usr/bin/git config --global --add safe.directory /home/runner/work/Cuemon/Cuemon
2024-09-23T19:00:29.4214951Z [command]/usr/bin/git config --local --name-only --get-regexp core\.sshCommand
2024-09-23T19:00:29.4245963Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'core\.sshCommand' && git config --local --unset-all 'core.sshCommand' || :"
2024-09-23T19:00:29.4907938Z [command]/usr/bin/git config --local --name-only --get-regexp http\.https\:\/\/github\.com\/\.extraheader
2024-09-23T19:00:29.4909542Z http.https://github.com/.extraheader
2024-09-23T19:00:29.4913024Z [command]/usr/bin/git config --local --unset-all http.https://github.com/.extraheader
2024-09-23T19:00:29.4917309Z [command]/usr/bin/git submodule foreach --recursive sh -c "git config --local --name-only --get-regexp 'http\.https\:\/\/github\.com\/\.extraheader' && git config --local --unset-all 'http.https://github.com/.extraheader' || :"
2024-09-23T19:00:29.5160414Z Cleaning up orphan processes
2024-09-23T19:00:29.5462333Z Terminate orphan process: pid (2566) (dotnet)
2024-09-23T19:00:29.5511293Z Terminate orphan process: pid (2596) (dotnet)
2024-09-23T19:00:29.5581880Z Terminate orphan process: pid (2597) (dotnet)
2024-09-23T19:00:29.5638717Z Terminate orphan process: pid (2598) (dotnet)
2024-09-23T19:00:29.5692699Z Terminate orphan process: pid (2885) (dotnet)
2024-09-23T19:00:29.5781556Z Terminate orphan process: pid (2886) (dotnet)

So pretty much, tests running, some work is being done, and then the parent kills it.

Will try to run on GH, to see if I can get the diag logs;

Youssef1313 commented 2 days ago

This is probably the same as https://github.com/microsoft/vstest/issues/5091 ? cc @nohwnd

nohwnd commented 2 days ago

Image i am seeing the abort when I have build logging to detailed, but not when I set it to normal.

gimlichael commented 2 days ago

i am seeing the abort when I have build logging to detailed, but not when I set it to normal.

This is interesting - and on-par with what @McMlok is also experiencing. However, prior to this bug, my default verbosity level was set to quiet, so there might be some different use cases that ties into logging.

GOATS2K commented 2 days ago

@nohwnd I can probably produce some more detailed logs today - if you tell me the exact dotnet test command you'd want me to run.

I was able to download the original logs from Cuemon (from the github action, they show that vstest task is running but that is about it).

I also saw this with diagnostic logging yesterday, it says a VSTest task is running but it's hung.

nohwnd commented 2 days ago

just add --diag:logs/log.txt to your command and it will produce logs into logs folder in the current dir. Or you can specify any other relative or absolute path. Please grab all the logs in the folder, not just the one called log.txt

nohwnd commented 2 days ago

This is probably the same as microsoft/vstest#5091 ? cc @nohwnd

it might or might not be. Hard to say without consistent repro :/

gimlichael commented 2 days ago

@nohwnd I did yet another CI run .. same result with log level set to normal.

Test completes on Windows; but does not complete on Ubuntu.

Compared to my workaround, two additional thing to mention:

I hope this help you further. Since you are at Microsoft, maybe you can get the full view here: https://github.com/gimlichael/Cuemon/actions/runs/11697884725

logs_30547850241.zip

I did use the -bl swtich, but it does not look like its part of the zip file.

nohwnd commented 2 days ago

thanks for the logs, I don't see any diag logs in there, without them I am pretty blind, I tried yesterday to get them but they did not upload I must have did the paths wrong. I tried to do that on my fork (I am syncing your repro branch to my main). https://github.com/nohwnd/Cuemon/actions/runs/11691191041

gimlichael commented 2 days ago

thanks for the logs, I don't see any diag logs in there, without them I am pretty blind, I tried yesterday to get them but they did not upload I must have did the paths wrong. I tried to do that on my fork (I am syncing your repro branch to my main). https://github.com/nohwnd/Cuemon/actions/runs/11691191041

I can try to enable the --diag:/home/runner/work/_temp/TestResults/logs/log.txt, but I am afraid that I am unable to download from GitHub (for instance, detailed logs cannot be donwloaded due to size - another bug on the GitHub platform) ..

Update: the --diag switch did not look like it provided any additional content.

nohwnd commented 1 day ago

you should be able to upload them as artifacts . I also had better luck clicking the cog wheel and selecting "see raw logs" which opens the whole giant log in browser, which takes some time to load it, but then searches it just fine.

your upload needs to be set to contine after previous step fails btw. (mine was not of course, so I don't have the logs yet)

gimlichael commented 1 day ago

you should be able to upload them as artifacts . I also had better luck clicking the cog wheel and selecting "see raw logs" which opens the whole giant log in browser, which takes some time to load it, but then searches it just fine.

your upload needs to be set to contine after previous step fails btw. (mine was not of course, so I don't have the logs yet)

I have tweaked it further, though I doubt it will make any changes, as I already put failfast; false (hence, everything is a success). I will download the full log as suggested ans ZIP it.

job-logs.zip

gimlichael commented 1 day ago

After a little dicking around, it looks like I need to use always() as per https://docs.github.com/en/actions/writing-workflows/choosing-what-your-workflow-does/evaluate-expressions-in-workflows-and-actions#always .. I leave it here for others to be inspired by.

Here are the logs you requested: logs.zip

nohwnd commented 1 day ago

Thanks for the tip, I was struggling with that as well.