Closed BruceForstall closed 4 years ago
This looks similar to the test failure that #39129 fixed. The dumps generated on macOS don't have the native bits in them, so I can't do much until I can repro this locally. Once #39858 is in, these failures should provide more insight into what's happening here. I'll see if running under stress modes causes this to happen more frequently.
Managed to catch it locally and collect a dump. I've got it under a debugger and will be picking away at it. It isn't the same issue as #38156 (which was fixed with #39129). The stress log doesn't show the "too many open files" error, and there aren't any orphaned FDs lying around. The symptom is the same, but the frequency and determinism of hitting it is lower. It is somehow exacerbated by the outer loop tests, so I'm inclined to think it might be a timing issue. I'll update in this issue as i investigate.
CC @tommcdon
A couple updates on this investigation:
System.Net.Sockets
implementation. I wrote a simple reverse server in C++ that does the socket
, bind
, listen
, accept
, read
, shutdown
, close
, and unlink
synchronously and it will eventually happen. The C++ synchronous version runs far faster and stressed the system more, so I would expect the issue to reproduce faster, but it doesn't.poll
should be idempotent so long as you don't change the socket by doing something like call read
or write
on it between calls to poll
. In other words, poll
should return POLLHUP
or POLLINV
on all subsequent calls to poll
if we don't manipulate the socket after observing the HUP
. This means that the socket that is being waited on isn't hung up or invalid.Given all the above, it seems like the runtime is somehow getting an orphaned connection to the reverse server Unix domain socket. The reverse server is using the same filename for the Unix domain socket each time it binds a socket, so I'm hypothesizing that there may be some form of race between close
and unlink
that can happen if unlink
happens after close
. close
technically isn't synchronous according to the man page (socket(7)):
When [SO_LINGER is] enabled, a close(2) or shutdown(2) will not return until all queued messages for the socket have been successfully sent or the linger timeout has been reached. Otherwise, the call returns immediately and the closing is done in the background. When the socket is closed as part of exit(2), it always lingers in the background.
[emphasis mine]
I've slightly modified the order of API calls in the C++ version of the reverse server so that the server Unix domain socket is unlink
ed before it close
s the sockets, i.e.,
unlink(unixDomainFile);
shutdown(clientSocket, SHUT_RDWR);
close(clientSocket);
close(serverSocket);
instead of
shutdown(clientSocket, SHUT_RDWR);
close(clientSocket);
close(serverSocket);
unlink(unixDomainFile);
The close
would cause the runtime to go into the HUP
logic and attempt to reconnect
to the server. If the runtime's call to connect
beats the reverse server's call to unlink
and close
hasn't fully finished in the background, then the call to appears to succeed since the file still exists and it is still bound by the system. Without debugging into or being able to read macOS kernel code, I'm not sure how I can validate this hypothesis besides experimenting and stress testing. I'll see if I can verify the behavior by putting breakpoints on both sides and forcing the race while these stress loops run (8 consoles running the loop infinitely).
I've been running this small modification for just over an hour and the hang hasn't reproduced.
I'll leave this running long enough to feel confident that it is indeed the solution and then put the changes in a PR.
As for why this only happens on macOS, I'm not sure. Since macOS is BSD based, it is quite possible that this is simply a subtle variation in how the close
, unlink
, and connect
APIs work when compared to Linux.
CC - @sywhang @noahfalk @dotnet/dotnet-diag
Stress run is going on 2+ hours now without an issue. I'll prepare the PR and submit it as a fix to the issue.
The tracing/eventpipe/reverse/reverse/reverse.sh and tracing/eventpipe/reverseouter/reverseouter/reverseouter.sh tests are failing on OSX with timeouts in the
runtime-coreclr jitstress-isas-x86
pipeline in various stress modes. I don't know if the stress modes here matter, because the tests seem to fail in different stress modes in different runs.https://dev.azure.com/dnceng/public/_build/results?buildId=746866&view=ms.vss-test-web.build-test-results-tab&runId=23151350&resultId=109366&paneView=debug
E.g., tracing/eventpipe/reverseouter/reverseouter/reverseouter.sh,
CoreCLR OSX x64 Checked jitstress_isas_x86_nosse2 @ OSX.1013.Amd64.Open
:@josalem