Open mthalman opened 1 week ago
Confirmed that I can still repro with this changed reverted: https://github.com/dotnet/dotnet-docker/pull/5587.
I've also tried reverting https://github.com/dotnet/dotnet-docker/pull/5584 and the issue still repros.
@wfurt @sbomer
could this be coincident with some other infrastructure changes? The connect is handled by kernel so it seems like network infrastructure problem to me.
is this IP related to Docker anyhow? Looks suspicious to me Connection refused (172.17.0.3:8080)
The IP address in question is the IP address of the aspnet container being tested. In other words, the app container isn't responding.
I've attempted to isolate which set of tests was causing this: runtime, aspnet, sdk, etc. I ran jobs that tested each of those sets separately... and they all passed. 🤷
It could be timing as well e.g. the over is still starting on slow platform. I don't know if there is any synchronization for that.
It attempts to get a response from the container, retrying 5 times with a 2 second delay for each retry. This test code hasn't changed in years. If it was a machine thing, I would expect to see it in other .NET versions and in the arm64 jobs (since arm32 and arm64 use the same machines). But it's very specific to 9.0 arm32.
Confirming this same behavior on internal builds. Example build (internal link)
I think the BlazorWasmScenario
test failure that I originally posted about isn't really the issue here. I can't remember how many times I saw that it occur (maybe it was only once). But the real issue is the hang that occurs while running the tests, causing a timeout. That is very prevalent. Probably > 50% of the jobs fail with a timeout.
I added some logging to the tests and was able to identify a test that hangs in two separate jobs: SdkImageTests.VerifyPowerShellScenario_NonDefaultUser
. In both cases, that is the first attempt to run PowerShell in the job. Example build
There hasn't been a new drop of PowerShell for 9.0 though in quite a while. This is the last one: https://github.com/dotnet/dotnet-docker/pull/5506. We would have seen this earlier if it was solely a PowerShell thing. My only guess is that it's related to the interaction between PowerShell and a new drop of .NET 9. @adaggarwal - are you aware of any behavior like this? To recap, it seems that execution of PowerShell sporadically hangs when running in an Arm32 Debian/Ubuntu container environment.
I just realized the reason this only occurs for Debian and Ubuntu and not Alpine or Azure Linux. For Alpine, we don't have PowerShell installed since PowerShell doesn't release binaries for Arm32 linux-musl (this is just further evidence that the issue is related to PowerShell). And for Azure Linux, we don't have any Arm32 images at all.
sudo docker pull mcr.microsoft.com/dotnet/nightly/sdk:9.0.100-preview.6-bookworm-slim-arm32v7
sudo docker run --rm -it mcr.microsoft.com/dotnet/nightly/sdk:9.0.100-preview.6-bookworm-slim-arm32v7
pwsh
(You've hit the repro if this hangs. If it doesn't hang and gets to the PowerShell prompt, run exit
to close PowerShell and then run pwsh
again. Continue until it hangs.)I attempted to collect a dump using dotnet-dump
but the collect
command would just hang. Instead, I used the createdump
tool and collected both a minidump and full dump. The minidump is included here: coredump-mini.zip. The full dump is too large to attach so I can provide that offline if necessary.
Tagging subscribers to this area: @vitek-karas, @agocke, @vsadov See info in area-owners.md if you want to be subscribed.
I've moved this to the runtime repo to investigate further.
Tagging subscribers to this area: @mangod9 See info in area-owners.md if you want to be subscribed.
Looks like the main thread is in System.RuntimeType+BoxCache.GetBoxInfo
I'm not set up to grab the native side right now, but it is a QCall into the runtime added in https://github.com/dotnet/runtime/pull/101137. @jkoritzinsky would you be able to take a look?
> clrstack
OS Thread Id: 0x9 (0)
Child SP IP Call Site
FFE120F4 ffe12080 [InlinedCallFrame: ffe120f4] System.RuntimeType+BoxCache.GetBoxInfo(System.Runtime.CompilerServices.QCallTypeHandle, System.Object (Void*)*, Void**, Int32*, UInt32*)
FFE120F4 e310f340 [InlinedCallFrame: ffe120f4] System.RuntimeType+BoxCache.GetBoxInfo(System.Runtime.CompilerServices.QCallTypeHandle, System.Object (Void*)*, Void**, Int32*, UInt32*)
FFE120D8 E310F340 System.RuntimeType+BoxCache..ctor(System.RuntimeType) [/_/src/coreclr/System.Private.CoreLib/src/System/RuntimeType.BoxCache.cs @ 54]
FFE12158 E310F274 System.RuntimeType+BoxCache.Create(System.RuntimeType) [/_/src/coreclr/System.Private.CoreLib/src/System/RuntimeType.BoxCache.cs @ 17]
FFE12168 F64912A8 System.RuntimeType+IGenericCacheEntry`1[[System.__Canon, System.Private.CoreLib]].CreateAndCache(System.RuntimeType)
FFE121D8 EFFC37F6 System.Runtime.CompilerServices.RuntimeHelpers.Box(Byte ByRef, System.RuntimeTypeHandle) [/_/src/coreclr/System.Private.CoreLib/src/System/Runtime/CompilerServices/RuntimeHelpers.CoreCLR.cs @ 425]
FFE121F0 EFE0102C System.Enum.InternalBoxEnum(System.RuntimeTypeHandle, Int64)
FFE12208 EFDFE8E2 System.Enum.TryParse(System.Type, System.ReadOnlySpan`1<Char>, Boolean, Boolean, System.Object ByRef) [/_/src/libraries/System.Private.CoreLib/src/System/Enum.cs @ 772]
FFE12240 EFDFE71A System.Enum.Parse(System.Type, System.String, Boolean) [/_/src/libraries/System.Private.CoreLib/src/System/Enum.cs @ 585]
FFE12260 EFDFE692 System.Enum.Parse(System.Type, System.String) [/_/src/libraries/System.Private.CoreLib/src/System/Enum.cs @ 551]
FFE12268 E396D4F6 System.Management.Automation.LanguagePrimitives.ConvertStringToEnum(System.Object, System.Type, Boolean, System.Management.Automation.PSObject, System.IFormatProvider, System.Management.Automation.Runspaces.TypeTable)
I believe I actually just fixed this with https://github.com/dotnet/runtime/pull/104126. @mthalman can you test out with a nightly build or do you need to wait until we have a preview with that fix?
There have been random test jobs failing with either one of two issues: a timeout ~or a
HttpRequestException
~. I've only seen this happening in the Noble and Bookworm jobs, not Alpine.This first popped up with this PR: https://github.com/dotnet/dotnet-docker/pull/5587. But I can't imagine it was the cause. The most recent .NET change prior to that was this: https://github.com/dotnet/dotnet-docker/pull/5584.
I've only seen this in public builds but there haven't been many internal builds yet to determine if it's only limited to public.
Example timeout build
Example
HttpRequestException
buildMicrosoft.DotNet.Docker.Tests.SdkImageTests.VerifyBlazorWasmScenario