In certain cases UI test hangs indefinitely instead of fails (OSOE-880)

sarahelsaig commented 1 month ago

First I've experienced this on OCC during the OC 2.0 update, so I assumed it's related to OC 2.0 somehow. But now it happened with TDEAL as well. (I still don't know what hanged OCC, I know that for TDEAL it's just the same visual verification Chrome footer thing we had on OSOCE and elsewhere)

Jira issue

Piedone commented 1 month ago

Did you try configuring dotnet-test-process-timeout so the timeout is handled for the test process instead of the whole workflow? (Or even the per-test timeouts.)

sarahelsaig commented 1 month ago

Related/same? https://github.com/Lombiq/UI-Testing-Toolbox/issues/228 (and https://github.com/Lombiq/Open-Source-Orchard-Core-Extensions/issues/736).

The examples I mentioned happened on Ubuntu and restarting did not fix them, so I don't think #228 is related.

Did you try configuring dotnet-test-process-timeout so the timeout is handled for the test process instead of the whole workflow? (Or even the per-test timeouts.)

Thanks, I will try that.

Piedone commented 1 month ago

Despite the title of that issue, this happened many times under Ubuntu too (but first it seemed it's Windows-only). But yeah, that's about random hangs, not consistent ones. That look more like an app-specific issue.

sarahelsaig commented 1 month ago

Adding dotnet-test-process-timeout magically fixed the problem in OCC (https://github.com/OrchardCMS/OrchardCore.Commerce/pull/454). What does that mean?

Piedone commented 1 month ago

This can fix the issue if control gets to this line:

https://github.com/Lombiq/GitHub-Actions/blob/813f2ed0586dce428250d502377e34c17884f2b7/.github/actions/test-dotnet/Invoke-SolutionTests.ps1#L215

Because with this, if the tests complete but the process hangs, the run can succeed. However, the telltale message is not in the output, so this didn't run, and the previous run you linked wasn't hanging after the test run has completed (i.e. all tests produced their outputs) but somewhere before that.

BTW there are a huge number of exceptions in the workflow output, I suggest checking these out, e.g.:

2024-07-10T00:14:22.5934506Z  2024-07-10 00:13:55.5118|Default|00-0b8735f889ef785eacdbe7443da03d71-387064e610ec3783-00||Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware|WARN|The response has already started, the error page middleware will not be executed. 
2024-07-10T00:14:22.5939726Z  2024-07-10 00:13:55.5118|Default|00-8b46da17ed6f9ef1aabca0f86ea1be6c-c7c3e103a1df7f90-00||Microsoft.AspNetCore.Diagnostics.DeveloperExceptionPageMiddleware|ERROR|An unhandled exception has occurred while executing the request. System.InvalidOperationException: Two concurrent threads have been detected accessing the same ISession instance from: 
...

And a lot more, even Shouldly ones, so this should've really failed (though completed).

sarahelsaig commented 1 month ago

Don't question it.

I would never. All praises to the Omnissiah!!

BTW there are a huge number of exceptions in the workflow output, I suggest checking these out, e.g.:

Yes, that's why I said that the test hangs instead of fails. (btw on TDEAL it only hangs with the dev build that uses the standard runner, in PRs if a test fails the run correctly stops) This dotnet-test-process-timeout is really useful, because now I can see the errors (unlike previously) and I can address them.

Piedone commented 1 month ago

I see. Perhaps this is not actually a hang, then? But rather, it runs retries of a lot of tests, which is slow, and it just times out? With 6 hours in TDEAL that would be extreme, but not impossible (on the slow 2-core HDD default runner; OCC, as public repos, use 4-core SSD runners by default).

sarahelsaig commented 1 month ago

I'm 100% certain it's actually a hang. With TDEAL I know that only the visual verification test failed. Just one test. The same run on buildjet only took 12.5 minutes. So if it took up to an hour with the standard runner I could stomach that, but 6 hours is not possible.

Piedone commented 1 month ago

Then I guess it can hang due to some threads deadlocking with just the two cores. We've seen issues like before, and an ASP.NET Core sync-over-async issue, still unfixed (somewhere linked in the other issue I linked), can cause this.

Piedone commented 1 month ago

Nothing else to do here then, though?

Piedone commented 1 month ago

As part of NEST-501 I also experience this: the DotNest UI tests didn't produce any output here and the workflow timed out after an hour. After adding dotnet-test-process-timeout: 600000 I could see the actual failing test.

sarahelsaig commented 1 month ago

So you also had problems with security scanning. In OCC as well, after filtering out the expected error testing, all other error logs were from the full security scan saying "InvalidOperationException System.InvalidOperationException: Two concurrent threads have been detected accessing the same ISession instance". I've removed the dotnet-test-process-timeout and temporarily disabled the test and now the run passed (with no |ERROR| in log) in 8 minutes. I think the security test is accidentally stress testing the runner or YesSql's thread safety by it's starting many requests (nearly) concurrently.

Piedone commented 1 month ago

It's expected for the security scan to start concurrent requests, though the goal is not stress testing and I believe the rate can be adjusted if necessary. However, concurrent requests mustn't result in such a YesSql exception: that only happens if two threads use the same ISession, not simply access the DB independently. This is something to avoid. Each request uses its own ISession, so concurrent requests in itself shouldn't cause this (unless some singleton service keeps using the same ISession under multiple requests, what should be avoided too).

Piedone commented 1 month ago

Under NEST-501 I discovered that security scanning maxing out the CPU (and maybe also RAM) of the runner can cause dotnet test to hang. You can try https://github.com/Lombiq/GitHub-Actions/pull/370 to see CPU and RAM metrics of runs and try to correlate that with issues you encounter (as a run with an oversaturated CPU can do all kinds of funny things).

I still don't think there's a general issue here, maybe some documentation.

Piedone commented 1 month ago

And if there is a general issue, we should get back to https://github.com/Lombiq/UI-Testing-Toolbox/issues/228.

Piedone commented 3 weeks ago

So, closing, since there's no new general issue.

Lombiq / UI-Testing-Toolbox

In certain cases UI test hangs indefinitely instead of fails (OSOE-880) #387