Closed noahfalk closed 2 years ago
Tagging subscribers to this area: @JulieLeeMSFT, @jakobbotsch See info in area-owners.md if you want to be subscribed.
Author: | noahfalk |
---|---|
Assignees: | - |
Labels: | `area-CodeGen-coreclr` |
Milestone: | - |
Judging by the stacktrace and the job itself it's mono-interp
Tagging subscribers to this area: @brzvlad See info in area-owners.md if you want to be subscribed.
Author: | noahfalk |
---|---|
Assignees: | - |
Labels: | `untriaged`, `area-Codegen-Interpreter-mono` |
Milestone: | - |
I pasted stacks here. It appears that the mail code, or underlying networking code, is attempting to use a file descriptor of -1, which I assume is invalid. https://github.com/dotnet/runtime/issues/72818#issuecomment-1194893236
Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.
Author: | noahfalk |
---|---|
Assignees: | - |
Labels: | `area-System.Net`, `blocking-clean-ci`, `untriaged` |
Milestone: | - |
cc: @tmds
seems like we do closing magic on invalid socket ... (stack bellow is copy from @danmoseley post)
top of stack looks like
#10 0x00007f4c7b4ab5cd in ToFileDescriptor (fd=-1) at /__w/1/s/src/native/libs/Common/pal_utilities.h:86
#11 0x00007f4c7b4abf95 in SystemNative_FcntlGetFD (fd=-1) at /__w/1/s/src/native/libs/System.Native/pal_io.c:611
...
at <unknown> <0xffffffff>
at Fcntl:<GetFD>g____PInvoke|5_0 <0x00020>
at Fcntl:GetFD <0x00020>
at System.Net.Sockets.SafeSocketHandle:TryUnblockSocket <0x0003a>
at System.Net.Sockets.SafeSocketHandle:CloseAsIs <0x000f4>
at System.Net.Sockets.Socket:Dispose <0x00426>
at System.Net.Sockets.Socket:Dispose <0x000a4>
at System.Net.Sockets.Socket:Close <0x00098>
at System.Net.Sockets.TcpClient:Dispose <0x0011c>
at System.Net.Sockets.TcpClient:Dispose <0x0001a>
at System.Net.Mail.SmtpConnection:ShutdownConnection <0x00184>
at System.Net.Mail.SmtpConnection:Abort <0x00012>
at System.Net.Mail.SmtpTransport:Abort <0x00078>
at System.Net.Mail.SmtpClient:Abort <0x0001c>
at System.Net.Mail.SmtpClient:SendAsyncCancel <0x00088>
at <>c:<SendMailAsync>b__84_1 <0x0001c>
at System.Threading.CancellationTokenSource:Invoke <0x00042>
or
#9 0x00007fbc510f8102 in __GI___assert_fail (assertion=0x7fbc4dcdc8a6 "0 <= fd && fd < sysconf(_SC_OPEN_MAX)", file=0x7fbc4dcddbb1 "/__w/1/s/src/native/libs/Common/pal_utilities.h", line=86, function=0x7fbc4dcdd527 "int ToFileDescriptor(intptr_t)") at assert.c:101
#10 0x00007fbc4dceb7cd in ToFileDescriptor (fd=-1) at /__w/1/s/src/native/libs/Common/pal_utilities.h:86
#11 0x00007fbc4dcebd8e in SystemNative_SetLingerOption (socket=-1, option=0x7fbc45d84d18) at /__w/1/s/src/native/libs/System.Native/pal_networking.c:1278
...
at <unknown> <0xffffffff>
at Sys:<SetLingerOption>g____PInvoke|34_0 <0x00024>
at Sys:SetLingerOption <0x00068>
at System.Net.Sockets.SocketPal:SetLingerOption <0x00092>
at System.Net.Sockets.Socket:SetLingerOption <0x00022>
at System.Net.Sockets.Socket:SetSocketOption <0x00248>
at System.Net.Sockets.Socket:set_LingerState <0x00024>
at System.Net.Sockets.TcpClient:set_LingerState <0x00022>
at System.Net.Mail.SmtpConnection:ShutdownConnection <0x000f0>
at System.Net.Mail.SmtpConnection:Abort <0x00012>
at System.Net.Mail.SmtpTransport:Abort <0x00078>
at System.Net.Mail.SmtpClient:Abort <0x0001c>
at System.Net.Mail.SmtpClient:SendAsyncCancel <0x00088>
at <>c:<SendMailAsync>b__84_1 <0x0001c>
at System.Threading.CancellationTokenSource:Invoke <0x00042>
Tagging subscribers to this area: @dotnet/ncl See info in area-owners.md if you want to be subscribed.
Author: | noahfalk |
---|---|
Assignees: | - |
Labels: | `area-System.Net.Sockets`, `blocking-clean-ci`, `untriaged` |
Milestone: | - |
@rzikm can you please check how often it happens? Thanks!
Very often, 98 hits in the last 14 days. Curiously, none of these are on main
Aside from some authentication changes, #70046 would be biggest suspect. It may not be root cause as the assert is in Sockets. I tried to reproduce it (main on Linux) but I did not get hit. We can probably look at some of the Linux/Windows core files to see what particular tests are running.
If it is happening that often perhaps we should disable the test for now to avoid noise in CI ... @rzikm thoughts?
I think it was the same test we disabled in #73452 (SendMailAsync_CanBeCanceled_CancellationToken)
Update: We are re-enabling the test in main via PR #74545 with changes that will hopefully give us more info on the root cause.
BadExit may be misleading in queries...
This is not event PR but main
----- start Thu Aug 25 12:40:46 UTC 2022 =============== To repro directly: =====================================================
pushd .
/root/helix/work/correlation/dotnet exec --runtimeconfig System.Net.Mail.Functional.Tests.runtimeconfig.json --depsfile System.Net.Mail.Functional.Tests.deps.json xunit.console.dll System.Net.Mail.Functional.Tests.dll -xml testResults.xml -nologo -nocolor -trait category=OuterLoop -notrait category=IgnoreForCI -notrait category=failing
popd
===========================================================================================================
/root/helix/work/workitem/e /root/helix/work/workitem/e
./RunTests.sh: line 168: /root/helix/work/correlation/dotnet: No such file or directory
/root/helix/work/workitem/e
@wfurt I found out that you need to specifically filter for 134 and 139 exit codes to get true process crashes
ok. I updated my query. My original one was based on the runfo
issues. So far so good.
Not blocking CI as the test was disabled in #73452
I font understand the comment @karelz. The suspected test was enabled again in https://github.com/dotnet/runtime/pull/74545 I did not see occurrences since. (still may come)
Sure, but right now it is not blocking CI
Status: After re-enabling the test, we got some hits on PRs. @rzikm has actionable dump link.
(lldb) clrstack -a
OS Thread Id: 0xfe8 (1)
Child SP IP Call Site
00007F1D4AAD3008 00007f5e86cf9387 [InlinedCallFrame: 00007f1d4aad3008] Interop+Sys.<SetLingerOption>g____PInvoke|34_0(IntPtr, LingerOption*)
00007F1D4AAD3008 00007f5e08a331ed [InlinedCallFrame: 00007f1d4aad3008] Interop+Sys.<SetLingerOption>g____PInvoke|34_0(IntPtr, LingerOption*)
00007F1D4AAD3000 00007F5E08A331ED ILStubClass.IL_STUB_PInvoke(IntPtr, LingerOption*)
PARAMETERS:
<no data>
<no data>
00007F1D4AAD30D0 00007F5E08D61E09 Interop+Sys.SetLingerOption(System.Runtime.InteropServices.SafeHandle, LingerOption*) [/_/src/libraries/System.Net.Sockets/src/Microsoft.Interop.LibraryImportGenerator/Microsoft.Interop.LibraryImportGenerator/LibraryImports.g.cs @ 772]
PARAMETERS:
socket (0x00007F1D4AAD3118) = 0x00007f1df583ad88
option (0x00007F1D4AAD3110) = 0x00007f1d4aad3188
LOCALS:
0x00007F1D4AAD3108 = 0xffffffffffffffff
0x00007F1D4AAD3104 = 0xffffffff00007f5e
0x00007F1D4AAD30F8 = 0x0000000000000001
0x00007F1D4AAD30F4 = 0x0000000000000000
0x00007F1D4AAD30F0 = 0x0000000000000000
00007F1D4AAD3130 00007F5E08D61CF0 System.Net.Sockets.SocketPal.SetLingerOption(System.Net.Sockets.SafeSocketHandle, System.Net.Sockets.LingerOption) [/_/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketPal.Unix.cs @ 1534]
PARAMETERS:
handle (0x00007F1D4AAD3198) = 0x00007f1df583ad88
optionValue (0x00007F1D4AAD3190) = 0x00007f1df584a790
LOCALS:
0x00007F1D4AAD3188 = 0x0000000000000001
0x00007F1D4AAD3184 = 0x0000000100000000
0x00007F1D4AAD3178 = 0x0000000000000001
0x00007F1D4AAD3174 = 0x0000000100007f1d
when I dump the SafeHandle, it has reasonable value...
=============================================================================
(lldb) dumpobj 0x7f1df583ad88
Name: System.Net.Sockets.SafeSocketHandle
MethodTable: 00007f5e08a797f8
EEClass: 00007f5e08a82d00
Size: 80(0x50) bytes
File: /mnt/work/B1FC09CD/p/shared/Microsoft.NETCore.App/8.0.0/System.Net.Sockets.dll
Fields:
MT Field Offset Type VT Attr Value Name
00007f5e07bc9a80 4001127 8 System.IntPtr 1 instance 00000000000000A3 handle
00007f5e07b3f0f0 4001128 10 System.Int32 1 instance 8 _state
00007f5e07b3bbf0 4001129 14 System.Boolean 1 instance 1 _ownsHandle
00007f5e07b3bbf0 400112a 15 System.Boolean 1 instance 1 _fullyInitialized
00007f5e08a78c20 4000106 20 System.Int32 1 instance -559038737 _closeSocketResult
00007f5e08a78c20 4000107 24 System.Int32 1 instance -559038737 _closeSocketLinger
00007f5e07b3f0f0 4000108 28 System.Int32 1 instance 0 _closeSocketThread
00007f5e07b3f0f0 4000109 2c System.Int32 1 instance 0 _closeSocketTick
00007f5e07b3f0f0 400010a 30 System.Int32 1 instance 0 _ownClose
00007f5e07b3bbf0 400010b 3c System.Boolean 1 instance 1 <OwnsHandle>k__BackingField
00007f5e07b3bbf0 400010c 3d System.Boolean 1 instance 0 _released
00007f5e07b3bbf0 400010d 3e System.Boolean 1 instance 0 _hasShutdownSend
00007f5e07b3f0f0 400010e 34 System.Int32 1 instance -1 _receiveTimeout
00007f5e07b3f0f0 400010f 38 System.Int32 1 instance -1 _sendTimeout
00007f5e07b3bbf0 4000110 3f System.Boolean 1 instance 0 _nonBlocking
00007f5e08aae4b0 4000111 18 ...ocketAsyncContext 0 instance 00007f1df583aef8 _asyncContext
00007f5e08a796a0 4000112 16 System.Int16 1 instance 2 _trackedOptions
00007f5e07b3bbf0 4000113 40 System.Boolean 1 instance 0 <LastConnectFailed>k__BackingField
00007f5e07b3bbf0 4000114 41 System.Boolean 1 instance 1 <DualMode>k__BackingField
00007f5e07b3bbf0 4000115 42 System.Boolean 1 instance 0 <ExposedHandleOrUntrackedConfiguration>k__BackingField
00007f5e07b3bbf0 4000116 43 System.Boolean 1 instance 0 <PreferInlineCompletions>k__BackingField
00007f5e07b3bbf0 4000117 44 System.Boolean 1 instance 1 <IsSocket>k__BackingField
00007f5e07b3bbf0 4000118 45 System.Boolean 1 instance 0 <IsDisconnected>k__BackingField
any idea how that becomes invalid @jkotas ?
I could not confirm the -1
above but the first threads looks suspicious
(lldb) thread select 1
* thread #1, name = 'dotnet', stop reason = signal SIGABRT
frame #0: 0x00007f5e86cf9387
-> 0x7f5e86cf9387: cmpq $-0x1000, %rax ; imm = 0xF000
0x7f5e86cf938d: ja 0x7f5e86cf93ad
0x7f5e86cf938f: rep retq
0x7f5e86cf9391: nopl (%rax)
(lldb) bt
* thread #1, name = 'dotnet', stop reason = signal SIGABRT
* frame #0: 0x00007f5e86cf9387
since this was last reported on macOS, it seems unlikely related to #73972.
@AaronRobinsonMSFT Could you please take a look?
The SafeHandle.handle
is a good value (00000000000000A3), but it somehow turned into -1 by the time it got into the P/Invoke.
since this was last reported on macOS
I can't seem to reproduce this on an M1. I will try Linux-x64 next.
We spent days with @rzikm to reproduce it without any luck @AaronRobinsonMSFT. We really only have some dumps from CI.
I think, this is the root cause of this issue. Because it's the only place that we can set the handle to -1 temporarily (via CreateSocket
) outside of the Socket
constructor. In this case, we're trying to close the socket (or set linger option) while trying to replace the handle.
SafeSocketHandle oldHandle = _handle;
SafeSocketHandle newHandle;
SocketError errorCode = SocketPal.CreateSocket(_addressFamily, _socketType, _protocolType, out newHandle);
_handle = newHandle;
Something like this should fix it.
I can't quite see how that fixes it. It's an out parameter, so how is what you wrote different to the existing code?
The problem is race condition, actually. At the beginning of the CreateSocket
function we have something like this:
Which means we're replacing the current Socket
instance's handle with default constructed SafeSocketHandle
, at this point we have a chance to hit -1
as handle's file descriptor if we're running code in multi-thread/async environment.
In the same function we have another line to update file descriptor:
Edit: I deleted the wrong information
Ah, I didn't realize it's multithreaded.
Where is the code in SendAsync
@liveans? I only saw some replacement during connect. I think this can still be a problem in cancellation & cleanup kicks in on thread pool. I would be probably worth of trying proposed change.
Where is the code in
SendAsync
@liveans? I only saw some replacement during connect. I think this can still be a problem in cancellation & cleanup kicks in on thread pool. I would be probably worth of trying proposed change.
Yesterday evening we were discussing it with @antonfirsov as well, after that we noticed that I mistracked the path and ReplaceHandle
is not using in SendAsync
path (my bad), it's using via SendPacketsAsync
path, but I'm still thinking the proposed change can fix this issue, because in the whole code this is the only place that we're changing the handle without using constructor.
I mistracked the path and ReplaceHandle is not using in SendAsync path
I haven't looked at the code, but I think there probably is such a path. SmptClient SendAsync establishes the connection, and connect calls ReplaceHandle on Unix to try multiple IP addresses.
The fix in https://github.com/dotnet/runtime/pull/70046 (previously mentioned by @wfurt) was about making SendAsync keep using an open connection, see https://github.com/dotnet/runtime/issues/49340#issuecomment-1141867717. So that may have triggered the issue.
It's definitely possible we seeing a race between connect replacing the handle, and SmtpClient Abort observing this half-initialized handle.
I mistracked the path and ReplaceHandle is not using in SendAsync path
I haven't looked at the code, but I think there probably is such a path.
We should double check it then, thanks for correcting me!
It's definitely possible we seeing a race between connect replacing the handle, and SmtpClient Abort observing this half-initialized handle.
Proposed fix worth to try then.
So far it has not been reported by external customers. The reports came in only from our CI. It is a race condition when we close socket in parallel during its creation (being created in a batch of socket handles via array overload) -- it is a rare stress scenario with very small time window for it to happen. Impact on customer (on release builds without asserts) -- memory leak of 1 handle.
Not worth servicing 7.0.x, until we get reports from customers.
@karelz @wfurt FYI this failure happened again today in 7.0. Based on the last comment, I won't reopen the issue, but I am pasting all the information here so this gets linked with the affected PR, and to preserve history.
Libraries Test Run release mono Linux x64 Debug
Description
System.Net.Mail.Functional.Tests are failing with this assert in CI:
https://helixre107v0xdeko0k025g8.blob.core.windows.net/dotnet-runtime-refs-pull-72664-merge-3f079befb6de4fac81/System.Net.Mail.Functional.Tests/1/console.08ced1c9.log?helixlogtype=result
Reproduction Steps
Example CI build: https://dev.azure.com/dnceng/public/_build/results?buildId=1897299&view=ms.vss-test-web.build-test-results-tab
Expected behavior
Test doesn't fail in CI
Actual behavior
Test does fail in CI, see description.
Regression?
Unknown
Known Workarounds
Unknown
Configuration
Linux Debug x64 Mono Interpreter
Other information
No response
Report
Summary