dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
14.53k stars 4.54k forks source link

.NET 9 Arm32 tests are randomly failing from PowerShell hang #104106

Open mthalman opened 1 week ago

mthalman commented 1 week ago

There have been random test jobs failing with either one of two issues: a timeout ~or a HttpRequestException~. I've only seen this happening in the Noble and Bookworm jobs, not Alpine.

This first popped up with this PR: https://github.com/dotnet/dotnet-docker/pull/5587. But I can't imagine it was the cause. The most recent .NET change prior to that was this: https://github.com/dotnet/dotnet-docker/pull/5584.

I've only seen this in public builds but there haven't been many internal builds yet to determine if it's only limited to public.

Example timeout build

[!NOTE] The rest of the information below is not relevant to this particular issue. Keeping it here because the conversation below references it. But this error is not a pervasive error like the timeout issue is.

Example HttpRequestException build

Microsoft.DotNet.Docker.Tests.SdkImageTests.VerifyBlazorWasmScenario

Executing: docker run --name 9.0-app-run-133634530996233443 -d -t -p 8080 9.0-publish_fx_dependent-133634530130912981 dotnet run --urls [http://*:8080](http://%2A:8080/)
 80fef330b2b77c6817b3479302abfbcbcfdf3eff5d94d6ff58798d5ed3ff452a
 Execution Elapsed Time: 00:00:01.4260173
 Executing: docker inspect -f "{{range .NetworkSettings.Networks}}{{.IPAddress}}{{end}}" 9.0-app-run-133634530996233443
 172.17.0.3
 Execution Elapsed Time: 00:00:00.0211357
 Request to http://172.17.0.3:8080/ failed - retrying: System.Net.Http.HttpRequestException: Connection refused (172.17.0.3:8080)
  ---> System.Net.Sockets.SocketException (111): Connection refused
    at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.CreateException(SocketError error, Boolean forAsyncThrow)
    at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.ConnectAsync(Socket socket)
    at System.Net.Sockets.Socket.ConnectAsync(EndPoint remoteEP, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
    at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
    at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
    at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Runtime.CompilerServices.AsyncMethodBuilderCore.Start[TStateMachine](TStateMachine& stateMachine)
    at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(QueueItem queueItem)
    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecutionContextCallback(Object s)
    at System.Threading.ExecutionContext.RunFromThreadPoolDispatchLoop(Thread threadPoolThread, ExecutionContext executionContext, ContextCallback callback, Object state)
    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.MoveNext(Thread threadPoolThread)
    at System.Runtime.CompilerServices.AsyncTaskMethodBuilder`1.AsyncStateMachineBox`1.ExecuteFromThreadPool(Thread threadPoolThread)
    at System.Threading.ThreadPoolWorkQueue.Dispatch()
    at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
 --- End of stack trace from previous location ---
    at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
    --- End of inner exception stack trace ---
    at System.Net.Http.HttpConnectionPool.ConnectToTcpHostAsync(String host, Int32 port, HttpRequestMessage initialRequest, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.ConnectAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.CreateHttp11ConnectionAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.AddHttp11ConnectionAsync(QueueItem queueItem)
    at System.Threading.Tasks.TaskCompletionSourceWithCancellation`1.WaitWithCancellationAsync(CancellationToken cancellationToken)
    at System.Net.Http.HttpConnectionPool.SendWithVersionDetectionAndRetryAsync(HttpRequestMessage request, Boolean async, Boolean doRequestAuth, CancellationToken cancellationToken)
    at System.Net.Http.RedirectHandler.SendAsync(HttpRequestMessage request, Boolean async, CancellationToken cancellationToken)
    at System.Net.Http.HttpClient.<SendAsync>g__Core|83_0(HttpRequestMessage request, HttpCompletionOption completionOption, CancellationTokenSource cts, Boolean disposeCts, CancellationTokenSource pendingRequestsCts, CancellationToken originalCancellationToken)
    at Microsoft.DotNet.Docker.Tests.WebScenario.GetHttpResponseFromContainerAsync(String containerName, DockerHelper dockerHelper, ITestOutputHelper outputHelper, Int32 containerPort, String pathAndQuery, Action`1 validateCallback, AuthenticationHeaderValue authorizationHeader) in /repo/dotnet-docker/tests/Microsoft.DotNet.Docker.Tests/TestScenarios/WebScenario.cs:line 96
mthalman commented 1 week ago

Confirmed that I can still repro with this changed reverted: https://github.com/dotnet/dotnet-docker/pull/5587.

mthalman commented 1 week ago

I've also tried reverting https://github.com/dotnet/dotnet-docker/pull/5584 and the issue still repros.

richlander commented 1 week ago

@wfurt @sbomer

wfurt commented 1 week ago

could this be coincident with some other infrastructure changes? The connect is handled by kernel so it seems like network infrastructure problem to me. is this IP related to Docker anyhow? Looks suspicious to me Connection refused (172.17.0.3:8080)

mthalman commented 1 week ago

The IP address in question is the IP address of the aspnet container being tested. In other words, the app container isn't responding.

mthalman commented 1 week ago

I've attempted to isolate which set of tests was causing this: runtime, aspnet, sdk, etc. I ran jobs that tested each of those sets separately... and they all passed. 🤷

wfurt commented 1 week ago

It could be timing as well e.g. the over is still starting on slow platform. I don't know if there is any synchronization for that.

mthalman commented 1 week ago

It attempts to get a response from the container, retrying 5 times with a 2 second delay for each retry. This test code hasn't changed in years. If it was a machine thing, I would expect to see it in other .NET versions and in the arm64 jobs (since arm32 and arm64 use the same machines). But it's very specific to 9.0 arm32.

mthalman commented 1 week ago

Confirming this same behavior on internal builds. Example build (internal link)

mthalman commented 6 days ago

I think the BlazorWasmScenario test failure that I originally posted about isn't really the issue here. I can't remember how many times I saw that it occur (maybe it was only once). But the real issue is the hang that occurs while running the tests, causing a timeout. That is very prevalent. Probably > 50% of the jobs fail with a timeout.

I added some logging to the tests and was able to identify a test that hangs in two separate jobs: SdkImageTests.VerifyPowerShellScenario_NonDefaultUser. In both cases, that is the first attempt to run PowerShell in the job. Example build

There hasn't been a new drop of PowerShell for 9.0 though in quite a while. This is the last one: https://github.com/dotnet/dotnet-docker/pull/5506. We would have seen this earlier if it was solely a PowerShell thing. My only guess is that it's related to the interaction between PowerShell and a new drop of .NET 9. @adaggarwal - are you aware of any behavior like this? To recap, it seems that execution of PowerShell sporadically hangs when running in an Arm32 Debian/Ubuntu container environment.

mthalman commented 6 days ago

I just realized the reason this only occurs for Debian and Ubuntu and not Alpine or Azure Linux. For Alpine, we don't have PowerShell installed since PowerShell doesn't release binaries for Arm32 linux-musl (this is just further evidence that the issue is related to PowerShell). And for Azure Linux, we don't have any Arm32 images at all.

mthalman commented 4 days ago

Repro Steps

  1. Get a Linux Arm64/Arm32 VM. I created one with Azure DevTest Lab which did the trick. (I could not get this to repro with Apple Silicon. It would run into a QEMU error and crash. That behavior was common across all .NET versions and not specific to this issue.)
  2. Install Docker if it's not already installed.
  3. sudo docker pull mcr.microsoft.com/dotnet/nightly/sdk:9.0.100-preview.6-bookworm-slim-arm32v7
  4. sudo docker run --rm -it mcr.microsoft.com/dotnet/nightly/sdk:9.0.100-preview.6-bookworm-slim-arm32v7
  5. You're now in the shell of the container.
  6. pwsh (You've hit the repro if this hangs. If it doesn't hang and gets to the PowerShell prompt, run exit to close PowerShell and then run pwsh again. Continue until it hangs.)
mthalman commented 4 days ago

I attempted to collect a dump using dotnet-dump but the collect command would just hang. Instead, I used the createdump tool and collected both a minidump and full dump. The minidump is included here: coredump-mini.zip. The full dump is too large to attach so I can provide that offline if necessary.

dotnet-policy-service[bot] commented 4 days ago

Tagging subscribers to this area: @vitek-karas, @agocke, @vsadov See info in area-owners.md if you want to be subscribed.

mthalman commented 4 days ago

I've moved this to the runtime repo to investigate further.

dotnet-policy-service[bot] commented 13 hours ago

Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.

elinor-fung commented 13 hours ago

Looks like the main thread is in System.RuntimeType+BoxCache.GetBoxInfo I'm not set up to grab the native side right now, but it is a QCall into the runtime added in https://github.com/dotnet/runtime/pull/101137. @jkoritzinsky would you be able to take a look?

> clrstack
OS Thread Id: 0x9 (0)
Child SP       IP Call Site
FFE120F4 ffe12080 [InlinedCallFrame: ffe120f4] System.RuntimeType+BoxCache.GetBoxInfo(System.Runtime.CompilerServices.QCallTypeHandle, System.Object (Void*)*, Void**, Int32*, UInt32*)
FFE120F4 e310f340 [InlinedCallFrame: ffe120f4] System.RuntimeType+BoxCache.GetBoxInfo(System.Runtime.CompilerServices.QCallTypeHandle, System.Object (Void*)*, Void**, Int32*, UInt32*)
FFE120D8 E310F340 System.RuntimeType+BoxCache..ctor(System.RuntimeType) [/_/src/coreclr/System.Private.CoreLib/src/System/RuntimeType.BoxCache.cs @ 54]
FFE12158 E310F274 System.RuntimeType+BoxCache.Create(System.RuntimeType) [/_/src/coreclr/System.Private.CoreLib/src/System/RuntimeType.BoxCache.cs @ 17]
FFE12168 F64912A8 System.RuntimeType+IGenericCacheEntry`1[[System.__Canon, System.Private.CoreLib]].CreateAndCache(System.RuntimeType)
FFE121D8 EFFC37F6 System.Runtime.CompilerServices.RuntimeHelpers.Box(Byte ByRef, System.RuntimeTypeHandle) [/_/src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.CoreCLR.cs @ 425]
FFE121F0 EFE0102C System.Enum.InternalBoxEnum(System.RuntimeTypeHandle, Int64)
FFE12208 EFDFE8E2 System.Enum.TryParse(System.Type, System.ReadOnlySpan`1<Char>, Boolean, Boolean, System.Object ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Enum.cs @ 772]
FFE12240 EFDFE71A System.Enum.Parse(System.Type, System.String, Boolean) [/_/src/libraries/System.Private.CoreLib/src/System/Enum.cs @ 585]
FFE12260 EFDFE692 System.Enum.Parse(System.Type, System.String) [/_/src/libraries/System.Private.CoreLib/src/System/Enum.cs @ 551]
FFE12268 E396D4F6 System.Management.Automation.LanguagePrimitives.ConvertStringToEnum(System.Object, System.Type, Boolean, System.Management.Automation.PSObject, System.IFormatProvider, System.Management.Automation.Runspaces.TypeTable)
jkoritzinsky commented 13 hours ago

I believe I actually just fixed this with https://github.com/dotnet/runtime/pull/104126. @mthalman can you test out with a nightly build or do you need to wait until we have a preview with that fix?