Closed NQ-Brewir closed 1 year ago
Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.
Author: | NQ-Brewir |
---|---|
Assignees: | - |
Labels: | `area-System.Net.Sockets`, `untriaged` |
Milestone: | - |
@NQ-Brewir are you working on getting a repro, or some more actionable information? In current state, the bug is not actionable for us -- same arguments apply as in https://github.com/dotnet/runtime/issues/70486#issuecomment-1155443313 Moreover, you are the only customer hitting the problem so far.
I would recommend to close the issue until there is info which is actionable.
This issue has been marked needs-author-action
and may be missing some important information.
This issue has been automatically marked no-recent-activity
because it has not had any activity for 14 days. It will be closed if no further activity occurs within 14 more days. Any new comment (by anyone, not necessarily the author) will remove no-recent-activity
.
I'm also getting the same error. It looks like it's also appearing here: https://github.com/aws/aws-lambda-dotnet/issues/1244.
My config: Dotnet runtime version: v7.0.100-preview.7 (ARM64) OS : macOS 12.5 (Monterey) CPU: ARM64 Apple Silicon M1 Max
This happens (again, intermittently) when I'm running/debugging a few microservices (on Kestrel - was unsure if this was a Kestrel issue, but saw this reported here).
One additional piece of info is that in the framework method that throws the exception:
throw new ArgumentOutOfRangeException(GetArgumentName(argument));
The argument parameter is always (int) 40 .
Just like the linked AWS issue, none of my exception handlers seem to be catching the error.
Any ideas on next steps? I can't seem to isolate the exception for repo...
I decided to run exactly the same code in the same way on a Windows machine:
Dotnet runtime version: v7.0.100-preview.7 (x64) OS: Windows 11 (22H2) CPU: AMD Ryzen 9 3900X 12-Core Processor
...it's been running now for 24hr without error. I'll keep it running, but I'd normally get that ^ exception thrown within a couple of hours on ARM64/macOS - so more of a platform issue?
the issue is still happening, but way less ofter since we removed all ValueTask from our codebase. we are still not able to create a clear repro, and the problem seems totally random anyways, managed code should not crash like that for this kind of problem
Hello, I still have this problem, and we removed all the usages of our ValueTasks we had. We have no real way to find a clear repro of this issue, but it is quite problematic as our code is running on a production environment. Is it possible to add more info in the context to better track where this issue could come from? Regards
@NQ-Brewir, could you try catching the unhandled exception via the AppDomain event handler and dump the full exception object to the logger (with the inner exception)? Note that it can get too noisy and costly in the production environment, so you may want to filter which exception object to dump.
The call stack in top post resembles the lower part of exception @BrennanConroy logged here: https://github.com/aspnet/SignalR/pull/1703#issuecomment-377129969. I'm not sure if it is the same (mysterious) issue. If it is, then going by the SignalR's call stack, the inner exception of ArgumentOutOfRangeException
seems to be InvalidOperationException
coming from ThrowMultipleContinuationsException()
under high concurrency: https://github.com/dotnet/runtime/blob/3d74b00659fec817506e2888f87936518556e01c/src/libraries/System.Net.Sockets/src/System/Net/Sockets/Socket.Tasks.cs#L1266-L1283 why it is happening frequently on unix arm64 than the others is unclear.
I saw what looks like this issue in a CI pipeline run on osx/arm64:
Unhandled exception. System.ArgumentOutOfRangeException: Specified argument was out of the range of valid values. (Parameter 'state')
at System.Threading.ThreadPool.<>c.<.cctor>b__78_0(Object state)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.InvokeContinuation(Action`1 continuation, Object state, Boolean forceAsync, Boolean requiresExecutionContextFlow)
at System.Net.Sockets.Socket.AwaitableSocketAsyncEventArgs.OnCompleted(SocketAsyncEventArgs _)
at System.Net.Sockets.SocketAsyncEngine.System.Threading.IThreadPoolWorkItem.Execute()
at System.Threading.ThreadPoolWorkQueue.Dispatch()
at System.Threading.PortableThreadPool.WorkerThread.WorkerThreadStart()
at System.Threading.Thread.StartCallback()
note: it looks like a crash dump was created
This can be a bug, we should investigate.
note: it looks like a crash dump was created
Unfortunately, I see the following:
DumpFileToLarge: The dump /cores/core.84631 is 4671242240 bytes, which is larger than the supported 1610612736.0, so was not uploaded.
@dotnet/dnceng any chance the limit can be increased?
@paulquinn it's been a while, but any chance you can produce a dump for us?
https://github.com/orgs/dotnet/teams/dnceng any chance the limit can be increased?
Unfortunately this limitation comes about from our support of on-premises machines; these tend to cost us lots of time and money uploading large dumps which are often ignored despite this time/financial cost.
If you need to check out a machine with the same specifications as the test one, that can likely be arranged, we'd just need to know the specific queue that this work item ran on (or have its full log linked, etc)
I just noticed I missed the start of the conversation and the fact this is technically duplicate of #70486. Might worth to keep it open because of the number of the reports we see.
We really should compress the dumps @MattGal. They are often full of zeros and we can probable get 10:1 gain
We really should compress the dumps @MattGal. They are often full of zeros and we can probable get 10:1 gain
This was discussed in https://github.com/dotnet/dnceng/issues/1219, feel free to reopen it and make your case.
@am11 I tried logging more info using the AppDomain eventhandler, but it seems to not go through it.
We had to remigrate to amd64 du to some other reasons, and the server is not crashing anymore. This issue is thus really due to ARM, and not to any wrongly used ValueTask
the server is not crashing anymore
Thanks for the update. I'll close this and we can reopen if it reoccurs and we're able to get more information for debugging.
Problem in .NET identified after all - duplicate of #84407
Fixed in 7.0.7 in PR #84641 and in 6.0.18 in PR #84432. Main (8.0) is not affected - see description in #84432.
Description
Description Random and rare crashes with this exception:
It seems to append only on loaded applications.
Exit signal: Abort (6)
Reproduction Steps
We don't have any reproduction yet. We probably need to heavily stress network! It seems to be a race condition.
Expected behavior
don't crash the runtime when we are using sockets...
Actual behavior
random and rare crashes of the runtime
Regression?
No response
Known Workarounds
No response
Configuration
Dotnet runtime version: 6.0.6 OS : GNU/Linux Debian 11 Bullseye CPU: ARM64 Graviton 2 (AWS) We are using Orleans with this application
Other information
follow up of https://github.com/dotnet/runtime/issues/70486 we triple checked all usages of ValueTask and removed all usages of it, just to be sure this time, this is notn some ValueTasks awaited twice