Lombiq / Open-Source-Orchard-Core-Extensions

Looking for some useful Orchard Core extensions? Here's a bundle solution of all of Lombiq's open-source Orchard Core extensions (modules and themes). Clone and try them out!
https://lombiq.com
BSD 3-Clause "New" or "Revised" License
45 stars 18 forks source link

Remove `build-create-binary-log` and `dotnet-test-process-timeout` configurations from OSOCE workflows (OSOE-834) #736

Open dministro opened 5 months ago

dministro commented 5 months ago

After https://github.com/Lombiq/UI-Testing-Toolbox/issues/228, remove the build-create-binary-log and dotnet-test-process-timeout configurations from OSOCE workflows.

Jira issue

Piedone commented 5 months ago

As per https://github.com/Lombiq/UI-Testing-Toolbox/issues/228#issuecomment-2033024965:

Lets keep the current configuration for a whilw (1-2 months).

Then we can also remove it from Lombiq Tenants Core.

Piedone commented 4 months ago

I again see random, but frequent Windows hangs. E.g. the two runs here: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/8976025812?pr=746. These hang here, only telling me that a PS command is hanging, which is the dotnet test one:

image

Then this same code worked here and here, while getting stuck here and here (I disabled hand timeouts deliberately for these two to see what happens).

Do you perhaps have any suggestions?

dministro commented 4 months ago

Twice in a line, it was really rare. I can say.

As I see in the dump, the AsyncStreamReader.ReadBufferAsync is waiting to read something from stdout or stderr of the dotnet process. The dotnet process was not dumped so possibly it was finished before the timeout expired. image

The dotnet process was exited or the ::warning::The process $($process.Id) didn't exit in $Timeout seconds. was not logged for an unknown reason. image

I don't know if it is possible, but maybe this can be caused by a special character or control character coming out earlier on the given stream.

I'll look into the logs and dumps if it happens again. Please report similar failures under this issue.

Piedone commented 4 months ago

Thank you for looking into this! How do you get to see what you show in your first screenshot? This is all I see when opening the dump (Debug with Mixed):

image

I've done three test runs, with hang timeouts: the first one passed, the other two hang, see here and here. The hang dumb has the same.

These are for some new code, see https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/pull/746, that might cause new issues too (I don't see why). And there's no output from the hanging test run.

But! If I skip TimeoutTests, then I get 4 passes in a row (two runs here, then here and here). @sarahelsaig does this ring a bell? Something in the timeout config seems to get tests to stuck.

dministro commented 4 months ago

You can turn on the Tasks window by selecting Debug/Windows/Tasks item. See: image

Once you have the Tasks window double click on the line selected below: image

Also Threads, Watch N, and Call Stack windows come in handy in similar situations.

I'll look at the logs latest tomorrow.

Piedone commented 4 months ago

Ah, thanks!

Looking at these, there's a bunch of Applications Insights references, what seems quite strange (however, no stack trace beyond a depth of 1, so I don't see where those are actually coming from). In any case, I removed the Application Insights module of ours being enabled during tests. It resulted in two passes (here and here, and two hangs (here and here). These still include AI references in the same way. So, it's not our AI, but something by GitHub.

I also did some test runs with reruns of what we have in dev (here and here), and they passed (though the second one failed with an "ENOENT: no such file or directory..." error, that's a problem with the artifact upload).

Piedone commented 3 months ago

BTW it's always the Lombiq.Tests.UI.Samples that hangs. And there's no output, though I thought we'd have that now with streaming test outputs. Locally I can see the same behavior when running Invoke-SolutionTests: no output from a test project until the next one starts.

Having the test run's output show up in the console output would help immensely to pinpoint where this goes wrong.

Piedone commented 3 months ago

That may be an xUnit/dotnet test issue; I've opened https://github.com/xunit/xunit/discussions/2925.

Piedone commented 3 months ago

That's something xUnit needs to add: https://github.com/xunit/visualstudio.xunit/issues/408.

Piedone commented 3 months ago

I added a debug log file that gets updated immediately on every ITestOutputHelper write in the FailuredDumps, and observed a hang here: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/9033609153. That log and the workflow output both show that the tests in Lombiq.OSOCE.Tests.UI finished (49 completed, as it should be), but still we get test failed:

2024-05-10T14:42:10.5686978Z Results File: C:\a\Open-Source-Orchard-Core-Extensions\Open-Source-Orchard-Core-Extensions\test\Lombiq.OSOCE.Tests.UI\TestResults\test-results.trx
2024-05-10T14:42:10.5687128Z 
2024-05-10T14:42:10.5688680Z Test Run Successful.
2024-05-10T14:42:10.5689829Z      Passed: 49
2024-05-10T14:42:10.5691093Z Total tests: 49
2024-05-10T14:42:10.5692141Z  Total time: 3.5414 Minutes
2024-05-10T14:42:10.5713126Z Test failed: test\Lombiq.OSOCE.Tests.UI\Lombiq.OSOCE.Tests.UI.csproj
2024-05-10T14:42:10.6026845Z ##[error]Process completed with exit code 1.
2024-05-10T14:42:10.6151953Z ##[group]Run Merge-FailureDumps -Directory "."
2024-05-10T14:42:10.6152088Z Merge-FailureDumps -Directory "."
2024-05-10T14:42:10.6183772Z shell: C:\Program Files\PowerShell\7\pwsh.EXE -command ". '{0}'"
2024-05-10T14:42:10.6183861Z env:
2024-05-10T14:42:10.6184153Z   CHECKOUT_TOKEN: ***
2024-05-10T14:42:10.6184278Z   DOTNET_SKIP_FIRST_TIME_EXPERIENCE: true
2024-05-10T14:42:10.6184386Z   DOTNET_CLI_TELEMETRY_OPTOUT: true
2024-05-10T14:42:10.6184494Z   DOTNET_ROOT: C:\Program Files\dotnet
2024-05-10T14:42:10.6184592Z   DOTNET_SDK_TO_USE: 8.0.204
2024-05-10T14:42:10.6184869Z   Lombiq_Tests_UI__OrchardCoreUITestExecutorConfiguration__MaxParallelTests: 0
2024-05-10T14:42:10.6184946Z ##[endgroup]
2024-05-10T14:42:10.8788979Z 
2024-05-10T14:42:10.8795785Z     Directory: C:\a\Open-Source-Orchard-Core-Extensions\Open-Source-Orchard-Core-Extensions
2024-05-10T14:42:10.8796453Z 
2024-05-10T14:42:10.8797334Z Mode                 LastWriteTime         Length Name
2024-05-10T14:42:10.8797934Z ----                 -------------         ------ ----
2024-05-10T14:42:10.8944776Z d----           5/10/2024  2:42 PM                FailureDumps

This is with dotnet-test-process-timeout: 600000 and blame-hang-timeout being empty. It looks like some kind of premature task kill to me. Also, Write-Output in StartProcessAndWaitForExit won't work, since in PS output isn't surfaced from functions for some reason. I've fixed that with | Out-Host. Then something strange happend, as you can see under this build:

Worth noting that previously, I've seen different behaviors with actual hangs, where a test project was started but didn't finish. Not this time though.

Piedone commented 3 months ago

It was a PS syntax issue: https://github.com/Lombiq/GitHub-Actions/pull/346/files#diff-c9181b664fb498c4bf7694492ba4c67f5c9fe31a8b86be446ca2a80b0fdde3c6L206 After that and adjusting the timeouts, it seems OK, e.g.: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/9055050994/attempts/1. And randomly the whole build can be horribly slow, but still succeed despite the test process being killed: https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/actions/runs/9055173471/attempts/1. So, this looks good now!

Piedone commented 3 months ago

I think why hangs became more frequent, also generally, is because we switched from GitRunners to GitHub Windows runners after the former shut down. And the performance of these varies wildly, with even the build taking between 5 and 10 minutes. So, if we by chance get a worse runner, then tests will be slower too.

dministro commented 3 months ago

I think why hangs became more frequent, also generally, is because we switched from GitRunners to GitHub Windows runners after the former shut down. And the performance of these varies wildly, with even the build taking between 5 and 10 minutes. So, if we by chance get a worse runner, then tests will be slower too.

This is what I tried to point here and here.

Piedone commented 3 months ago

OK then!