dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.09k stars 4.7k forks source link

[linux-arm64] Random and rare runtime crash System.ArgumentOutOfRangeException (System.Net.Sockets) #72365

Closed NQ-Brewir closed 1 year ago

NQ-Brewir commented 2 years ago

Description

Description Random and rare crashes with this exception:

Unhandled exception. System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values. (Parameter 'state')
  at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)
  at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _)
  at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
  at System.Threading.ThreadPoolWorkQueue.Dispatch()
  at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()

It seems to append only on loaded applications.

Exit signal: Abort (6)

Reproduction Steps

We don't have any reproduction yet. We probably need to heavily stress network! It seems to be a race condition.

Expected behavior

don't crash the runtime when we are using sockets...

Actual behavior

random and rare crashes of the runtime

Regression?

No response

Known Workarounds

No response

Configuration

Dotnet runtime version: 6.0.6 OS : GNU/Linux Debian 11 Bullseye CPU: ARM64 Graviton 2 (AWS) We are using Orleans with this application

Other information

follow up of https://github.com/dotnet/runtime/issues/70486 we triple checked all usages of ValueTask and removed all usages of it, just to be sure this time, this is notn some ValueTasks awaited twice

ghost commented 2 years ago

Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.

Issue Details
### Description Description Random and rare crashes with this exception: ``` Unhandled exception. System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values. (Parameter 'state') at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow) at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _) at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute() at System.Threading.ThreadPoolWorkQueue.Dispatch() at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart() ``` It seems to append only on loaded applications. Exit signal: Abort (6) ### Reproduction Steps We don't have any reproduction yet. We probably need to heavily stress network! It seems to be a race condition. ### Expected behavior don't crash the runtime when we are using sockets... ### Actual behavior random and rare crashes of the runtime ### Regression? _No response_ ### Known Workarounds _No response_ ### Configuration Dotnet runtime version: 6.0.6 OS : GNU/Linux Debian 11 Bullseye CPU: ARM64 Graviton 2 (AWS) We are using Orleans with this application ### Other information follow up of https://github.com/dotnet/runtime/issues/70486 we triple checked all usages of ValueTask and removed all usages of it, just to be sure this time, this is notn some ValueTasks awaited twice
Author: NQ-Brewir
Assignees: -
Labels: `area-System.Net.Sockets`, `untriaged`
Milestone: -
karelz commented 2 years ago

@NQ-Brewir are you working on getting a repro, or some more actionable information? In current state, the bug is not actionable for us -- same arguments apply as in https://github.com/dotnet/runtime/issues/70486#issuecomment-1155443313 Moreover, you are the only customer hitting the problem so far.

I would recommend to close the issue until there is info which is actionable.

ghost commented 2 years ago

This issue has been marked needs-author-action and may be missing some important information.

ghost commented 2 years ago

This issue has been automatically marked no-recent-activity because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity.

paulquinn commented 2 years ago

I'm also getting the same error. It looks like it's also appearing here: https://github.com/aws/aws-lambda-dotnet/issues/1244.

My config: Dotnet runtime version: v7.0.100-preview.7 (ARM64) OS : macOS 12.5 (Monterey) CPU: ARM64 Apple Silicon M1 Max

This happens (again, intermittently) when I'm running/debugging a few microservices (on Kestrel - was unsure if this was a Kestrel issue, but saw this reported here).

One additional piece of info is that in the framework method that throws the exception:

throw new ArgumentOutOfRangeException(GetArgumentName(argument));

The argument parameter is always (int) 40 .

Just like the linked AWS issue, none of my exception handlers seem to be catching the error.

Any ideas on next steps? I can't seem to isolate the exception for repo...

paulquinn commented 2 years ago

I decided to run exactly the same code in the same way on a Windows machine:

Dotnet runtime version: v7.0.100-preview.7 (x64) OS: Windows 11 (22H2) CPU: AMD Ryzen 9 3900X 12-Core Processor

...it's been running now for 24hr without error. I'll keep it running, but I'd normally get that ^ exception thrown within a couple of hours on ARM64/macOS - so more of a platform issue?

NQ-Brewir commented 2 years ago

the issue is still happening, but way less ofter since we removed all ValueTask from our codebase. we are still not able to create a clear repro, and the problem seems totally random anyways, managed code should not crash like that for this kind of problem

NQ-Brewir commented 2 years ago

Hello, I still have this problem, and we removed all the usages of our ValueTasks we had. We have no real way to find a clear repro of this issue, but it is quite problematic as our code is running on a production environment. Is it possible to add more info in the context to better track where this issue could come from? Regards

am11 commented 2 years ago

@NQ-Brewir, could you try catching the unhandled exception via the AppDomain event handler and dump the full exception object to the logger (with the inner exception)? Note that it can get too noisy and costly in the production environment, so you may want to filter which exception object to dump.

The call stack in top post resembles the lower part of exception @BrennanConroy logged here: https://github.com/aspnet/SignalR/pull/1703#issuecomment-377129969. I'm not sure if it is the same (mysterious) issue. If it is, then going by the SignalR's call stack, the inner exception of ArgumentOutOfRangeException seems to be InvalidOperationException coming from ThrowMultipleContinuationsException() under high concurrency: https://github.com/dotnet/runtime/blob/3d74b00659fec817506e2888f87936518556e01c/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs#L1266-L1283 why it is happening frequently on unix arm64 than the others is unclear.

BruceForstall commented 2 years ago

I saw what looks like this issue in a CI pipeline run on osx/arm64:

Unhandled exception. System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values. (Parameter 'state')
   at System.Threading.ThreadPool.<>c.<.cctor>b__78_0(Object state)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)
   at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _)
   at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
   at System.Threading.ThreadPoolWorkQueue.Dispatch()
   at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
   at System.Threading.Thread.StartCallback()

https://dev.azure.com/dnceng-public/public/_build/results?buildId=42417&view=ms.vss-test-web.build-test-results-tab&runId=857504&resultId=111076&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab

note: it looks like a crash dump was created

antonfirsov commented 2 years ago

This can be a bug, we should investigate.

note: it looks like a crash dump was created

Unfortunately, I see the following:

DumpFileToLarge: The dump /cores/core.84631 is 4671242240 bytes, which is larger than the supported 1610612736.0, so was not uploaded.

@dotnet/dnceng any chance the limit can be increased?

@paulquinn it's been a while, but any chance you can produce a dump for us?

MattGal commented 2 years ago

https://github.com/orgs/dotnet/teams/dnceng any chance the limit can be increased?

Unfortunately this limitation comes about from our support of on-premises machines; these tend to cost us lots of time and money uploading large dumps which are often ignored despite this time/financial cost.

If you need to check out a machine with the same specifications as the test one, that can likely be arranged, we'd just need to know the specific queue that this work item ran on (or have its full log linked, etc)

antonfirsov commented 2 years ago

I just noticed I missed the start of the conversation and the fact this is technically duplicate of #70486. Might worth to keep it open because of the number of the reports we see.

wfurt commented 2 years ago

We really should compress the dumps @MattGal. They are often full of zeros and we can probable get 10:1 gain

MattGal commented 2 years ago

We really should compress the dumps @MattGal. They are often full of zeros and we can probable get 10:1 gain

This was discussed in https://github.com/dotnet/dnceng/issues/1219, feel free to reopen it and make your case.

NQ-Brewir commented 1 year ago

@am11 I tried logging more info using the AppDomain eventhandler, but it seems to not go through it.

NQ-Brewir commented 1 year ago

We had to remigrate to amd64 du to some other reasons, and the server is not crashing anymore. This issue is thus really due to ARM, and not to any wrongly used ValueTask

stephentoub commented 1 year ago

the server is not crashing anymore

Thanks for the update. I'll close this and we can reopen if it reoccurs and we're able to get more information for debugging.

karelz commented 1 year ago

Problem in .NET identified after all - duplicate of #84407

Fixed in 7.0.7 in PR #84641 and in 6.0.18 in PR #84432. Main (8.0) is not affected - see description in #84432.