Test failures: reverseouter.sh, reverse.sh

BruceForstall commented 4 years ago

The tracing/eventpipe/reverse/reverse/reverse.sh and tracing/eventpipe/reverseouter/reverseouter/reverseouter.sh tests are failing on OSX with timeouts in the runtime-coreclr jitstress-isas-x86 pipeline in various stress modes. I don't know if the stress modes here matter, because the tests seem to fail in different stress modes in different runs.

https://dev.azure.com/dnceng/public/_build/results?buildId=746866&view=ms.vss-test-web.build-test-results-tab&runId=23151350&resultId=109366&paneView=debug

E.g., tracing/eventpipe/reverseouter/reverseouter/reverseouter.sh, CoreCLR OSX x64 Checked jitstress_isas_x86_nosse2 @ OSX.1013.Amd64.Open:

export COMPlus_TieredCompilation=0
export COMPlus_EnableSSE2=0

...

Collected crash dump: /cores/crashdump_3455.dmp
    tracing/eventpipe/reverseouter/reverseouter/reverseouter.sh [FAIL]

      cmdLine:/private/tmp/helix/working/BD010A2A/w/B3D209C3/e/tracing/eventpipe/reverseouter/reverseouter/reverseouter.sh Timed Out (timeout in milliseconds: 1800000 from variable __TestTimeout, start: 7/27/2020 1:03:40 PM, end: 7/27/2020 1:33:41 PM)

      Return code:      -100
      Raw output file:      /private/tmp/helix/working/BD010A2A/w/B3D209C3/e/tracing/eventpipe/Reports/tracing.eventpipe/reverseouter/reverseouter/reverseouter.output.txt
      Raw output:
      BEGIN EXECUTION
      /tmp/helix/working/BD010A2A/p/corerun reverseouter.dll ''
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-26988-1595846210-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-26989-1595846210-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48544-1595858520-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48567-1595858524-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48568-1595858524-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48582-1595858526-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48583-1595858526-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48591-1595858527-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48593-1595858527-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48597-1595858528-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48598-1595858528-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48599-1595858528-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48604-1595858529-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48608-1595858530-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48610-1595858530-socket
        0.0s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48612-1595858531-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48614-1595858531-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48615-1595858532-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48616-1595858532-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48621-1595858532-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48631-1595858533-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48632-1595858534-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48633-1595858534-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48678-1595858540-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48683-1595858540-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48684-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48685-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48686-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48693-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48694-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48695-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48696-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48697-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48707-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48708-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48709-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48753-1595858542-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-54475-1595863684-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-54476-1595863684-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-76157-1595873819-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-76158-1595873820-socket
        0.1s: Deleted
        0.1s: ::== Running test: TEST_ReverseConnectionCanRecycleWhileTracing
        0.1s: Server name is '/var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/DOTNET_TRACE_TESTS_xzdzwgmz.ypl'
        0.5s: running sub-process: /private/tmp/helix/working/BD010A2A/p/corerun /private/tmp/helix/working/BD010A2A/w/B3D209C3/e/tracing/eventpipe/reverseouter/reverseouter/reverseouter.dll 0
        1.0s: subprocess started: True
        1.0s: subprocess PID: 3462
        1.2s: Starting EventPipeSession over standard connection
        1.2s: Waiting for connection
        1.5s: Started EventPipeSession over standard connection with session id: 0x7f82d7533340
        1.5s: Got a connection
        1.8s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        1.8s: Waiting for connection
        1.8s: Got a connection
        1.8s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        1.8s: Waiting for connection
        1.9s: Got a connection
        1.9s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        1.9s: Waiting for connection
        1.9s: Got a connection
        1.9s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        1.9s: Waiting for connection
        1.9s: Got a connection
        1.9s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        1.9s: Waiting for connection
        1.9s: Got a connection
        1.9s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        1.9s: Waiting for connection
        1.9s: Got a connection
        1.9s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        1.9s: Waiting for connection
        2.0s: Got a connection
        2.0s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        2.0s: Waiting for connection
        2.0s: Got a connection
        2.0s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        2.0s: Waiting for connection
        2.0s: Got a connection
        2.0s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        2.0s: Waiting for connection
        2.0s: Got a connection
        2.0s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        2.0s: Waiting for connection
        2.0s: Got a connection
        2.0s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        2.0s: Waiting for connection
        2.0s: Got a connection
        2.0s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        2.0s: Waiting for connection
        2.2s: Stopping EventPipeSession over standard connection
        2.4s: Stopped EventPipeSession over standard connection

      cmdLine:/private/tmp/helix/working/BD010A2A/w/B3D209C3/e/tracing/eventpipe/reverseouter/reverseouter/reverseouter.sh Timed Out (timeout in milliseconds: 1800000 from variable __TestTimeout, start: 7/27/2020 1:03:40 PM, end: 7/27/2020 1:33:41 PM)
      Test Harness Exitcode is : -100

@josalem

josalem commented 4 years ago

This looks similar to the test failure that #39129 fixed. The dumps generated on macOS don't have the native bits in them, so I can't do much until I can repro this locally. Once #39858 is in, these failures should provide more insight into what's happening here. I'll see if running under stress modes causes this to happen more frequently.

josalem commented 4 years ago

Managed to catch it locally and collect a dump. I've got it under a debugger and will be picking away at it. It isn't the same issue as #38156 (which was fixed with #39129). The stress log doesn't show the "too many open files" error, and there aren't any orphaned FDs lying around. The symptom is the same, but the frequency and determinism of hitting it is lower. It is somehow exacerbated by the outer loop tests, so I'm inclined to think it might be a timing issue. I'll update in this issue as i investigate.

CC @tommcdon

josalem commented 4 years ago

A couple updates on this investigation:

the frequency of this is rather low in CI, but I can reliably get it to reproduce locally without any stress configuration if I run the test in a loop for a long enough amount of time (usually <1 hour) on my late-2017 macbook pro
This only appears to happen on macOS. Similar stress runs on Linux don't result in the same behavior.
It is not being caused by the System.Net.Sockets implementation. I wrote a simple reverse server in C++ that does the socket, bind, listen, accept, read, shutdown, close, and unlink synchronously and it will eventually happen. The C++ synchronous version runs far faster and stressed the system more, so I would expect the issue to reproduce faster, but it doesn't.
The state of the runtime doesn't seem erroneous at all:
- The stresslog shows that the runtime has successfully been seeing the hangups from the server.
- The runtime is assigned the same socket repeatedly to connect to the reverse server, but this seems fine for 99% of the time
- the runtime logs that it successfully connects and sends the advertise data, then is waiting. This means that the file descriptor it has is valid, since a call to poll should be idempotent so long as you don't change the socket by doing something like call read or write on it between calls to poll. In other words, poll should return POLLHUP or POLLINV on all subsequent calls to poll if we don't manipulate the socket after observing the HUP. This means that the socket that is being waited on isn't hung up or invalid.

Given all the above, it seems like the runtime is somehow getting an orphaned connection to the reverse server Unix domain socket. The reverse server is using the same filename for the Unix domain socket each time it binds a socket, so I'm hypothesizing that there may be some form of race between close and unlink that can happen if unlink happens after close. close technically isn't synchronous according to the man page (socket(7)):

When [SO_LINGER is] enabled, a close(2) or shutdown(2) will not return until all queued messages for the socket have been successfully sent or the linger timeout has been reached. Otherwise, the call returns immediately and the closing is done in the background. When the socket is closed as part of exit(2), it always lingers in the background.

[emphasis mine]

I've slightly modified the order of API calls in the C++ version of the reverse server so that the server Unix domain socket is unlinked before it closes the sockets, i.e.,

unlink(unixDomainFile);
shutdown(clientSocket, SHUT_RDWR);
close(clientSocket);
close(serverSocket);

instead of

shutdown(clientSocket, SHUT_RDWR);
close(clientSocket);
close(serverSocket);
unlink(unixDomainFile);

The close would cause the runtime to go into the HUP logic and attempt to reconnect to the server. If the runtime's call to connect beats the reverse server's call to unlink and close hasn't fully finished in the background, then the call to appears to succeed since the file still exists and it is still bound by the system. Without debugging into or being able to read macOS kernel code, I'm not sure how I can validate this hypothesis besides experimenting and stress testing. I'll see if I can verify the behavior by putting breakpoints on both sides and forcing the race while these stress loops run (8 consoles running the loop infinitely).

I've been running this small modification for just over an hour and the hang hasn't reproduced.

I'll leave this running long enough to feel confident that it is indeed the solution and then put the changes in a PR.

As for why this only happens on macOS, I'm not sure. Since macOS is BSD based, it is quite possible that this is simply a subtle variation in how the close, unlink, and connect APIs work when compared to Linux.

CC - @sywhang @noahfalk @dotnet/dotnet-diag

josalem commented 4 years ago

Stress run is going on 2+ hours now without an issue. I'll prepare the PR and submit it as a fix to the issue.

dotnet / runtime

Test failures: reverseouter.sh, reverse.sh #39979