dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.26k stars 4.73k forks source link

[gcstress] HardwareIntrinsics timing out on osx-arm64 and linux-arm #78323

Open jakobbotsch opened 1 year ago

jakobbotsch commented 1 year ago

The hw intrinsics tests seem to time out on aforementioned platforms after #74886. Test run: https://dev.azure.com/dnceng-public/public/_build/results?buildId=81893&view=ms.vss-test-web.build-test-results-tab&runId=1708824&resultId=220189&paneView=debug

Maybe the stripe count needs to be increased @davidwrighton?

dotnet-issue-labeler[bot] commented 1 year ago

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

ghost commented 1 year ago

Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.

Issue Details
The hw intrinsics tests seem to time out on aforementioned platforms after #74886. Test run: https://dev.azure.com/dnceng-public/public/_build/results?buildId=81893&view=ms.vss-test-web.build-test-results-tab&runId=1708824&resultId=220189&paneView=debug Maybe the stripe count needs to be increased @davidwrighton?
Author: jakobbotsch
Assignees: davidwrighton
Labels: `GCStress`, `area-CodeGen-coreclr`, `untriaged`
Milestone: -
davidwrighton commented 1 year ago

@jakobbotsch From looking at the logs, this appears to be caused by some sort of actual deadlock, not a test failure. In particular, the test is making good progress for a short period of time, and then stops making progress. I don't have easy access to the appropriate hardware to easily test this, but I think this needs to be investigated as a product failure, and not just increase the parallelism of the testing.

One thing to be aware of is that in the past, before my change, most hardware intrinsic tests were never run under GCStress. (The tests would mostly be skipped during GCStress execution.)

BruceForstall commented 1 year ago

Another set of failures: https://dev.azure.com/dnceng-public/public/_build/results?buildId=94375&view=ms.vss-test-web.build-test-results-tab

Note that these are GCStress=3 failures, so are generally due to VM, not JIT (or timeout due to GCStress=3 being very slow).

One thing to be aware of is that in the past, before my change, most hardware intrinsic tests were never run under GCStress. (The tests would mostly be skipped during GCStress execution.)

Perhaps we should once again disable them under GCStress.

We currently have GCStressIncompatible. Perhaps we should also have GCStressIncompatible_3/GCStressIncompatible_C for more granularity. Actually, this should already be possible when using the xunit attributes. See https://github.com/dotnet/runtime/blob/main/docs/workflow/ci/disabling-tests.md and https://github.com/dotnet/arcade/blob/main/src/Microsoft.DotNet.XUnitExtensions/src/RuntimeTestModes.cs.

jakobbotsch commented 1 year ago

I don't really see what David was saying above, from the log indeed good progress is made, but we hit the 4 hour Helix timeout specified here after which the Helix job gets killed. I will disable it under gcstress again.