dotnet / runtime

.NET is a cross-platform runtime for cloud, mobile, desktop, and IoT apps.
https://docs.microsoft.com/dotnet/core/
MIT License
15.39k stars 4.75k forks source link

Test failures: reverseouter.sh, reverse.sh #39979

Closed BruceForstall closed 4 years ago

BruceForstall commented 4 years ago

The tracing/eventpipe/reverse/reverse/reverse.sh and tracing/eventpipe/reverseouter/reverseouter/reverseouter.sh tests are failing on OSX with timeouts in the runtime-coreclr jitstress-isas-x86 pipeline in various stress modes. I don't know if the stress modes here matter, because the tests seem to fail in different stress modes in different runs.

https://dev.azure.com/dnceng/public/_build/results?buildId=746866&view=ms.vss-test-web.build-test-results-tab&runId=23151350&resultId=109366&paneView=debug

E.g., tracing/eventpipe/reverseouter/reverseouter/reverseouter.sh, CoreCLR OSX x64 Checked jitstress_isas_x86_nosse2 @ OSX.1013.Amd64.Open:

export COMPlus_TieredCompilation=0
export COMPlus_EnableSSE2=0

...

Collected crash dump: /cores/crashdump_3455.dmp
    tracing/eventpipe/reverseouter/reverseouter/reverseouter.sh [FAIL]

      cmdLine:/private/tmp/helix/working/BD010A2A/w/B3D209C3/e/tracing/eventpipe/reverseouter/reverseouter/reverseouter.sh Timed Out (timeout in milliseconds: 1800000 from variable __TestTimeout, start: 7/27/2020 1:03:40 PM, end: 7/27/2020 1:33:41 PM)

      Return code:      -100
      Raw output file:      /private/tmp/helix/working/BD010A2A/w/B3D209C3/e/tracing/eventpipe/Reports/tracing.eventpipe/reverseouter/reverseouter/reverseouter.output.txt
      Raw output:
      BEGIN EXECUTION
      /tmp/helix/working/BD010A2A/p/corerun reverseouter.dll ''
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-26988-1595846210-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-26989-1595846210-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48544-1595858520-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48567-1595858524-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48568-1595858524-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48582-1595858526-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48583-1595858526-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48591-1595858527-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48593-1595858527-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48597-1595858528-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48598-1595858528-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48599-1595858528-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48604-1595858529-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48608-1595858530-socket
        0.0s: Deleted
        0.0s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48610-1595858530-socket
        0.0s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48612-1595858531-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48614-1595858531-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48615-1595858532-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48616-1595858532-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48621-1595858532-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48631-1595858533-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48632-1595858534-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48633-1595858534-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48678-1595858540-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48683-1595858540-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48684-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48685-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48686-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48693-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48694-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48695-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48696-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48697-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48707-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48708-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48709-1595858541-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-48753-1595858542-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-54475-1595863684-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-54476-1595863684-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-76157-1595873819-socket
        0.1s: Deleted
        0.1s: Attempting to delete the zombied pipe: /var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/dotnet-diagnostic-76158-1595873820-socket
        0.1s: Deleted
        0.1s: ::== Running test: TEST_ReverseConnectionCanRecycleWhileTracing
        0.1s: Server name is '/var/folders/6z/dfqrw_852fn8p5lzhtcttmp00000gy/T/DOTNET_TRACE_TESTS_xzdzwgmz.ypl'
        0.5s: running sub-process: /private/tmp/helix/working/BD010A2A/p/corerun /private/tmp/helix/working/BD010A2A/w/B3D209C3/e/tracing/eventpipe/reverseouter/reverseouter/reverseouter.dll 0
        1.0s: subprocess started: True
        1.0s: subprocess PID: 3462
        1.2s: Starting EventPipeSession over standard connection
        1.2s: Waiting for connection
        1.5s: Started EventPipeSession over standard connection with session id: 0x7f82d7533340
        1.5s: Got a connection
        1.8s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        1.8s: Waiting for connection
        1.8s: Got a connection
        1.8s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        1.8s: Waiting for connection
        1.9s: Got a connection
        1.9s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        1.9s: Waiting for connection
        1.9s: Got a connection
        1.9s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        1.9s: Waiting for connection
        1.9s: Got a connection
        1.9s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        1.9s: Waiting for connection
        1.9s: Got a connection
        1.9s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        1.9s: Waiting for connection
        1.9s: Got a connection
        1.9s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        1.9s: Waiting for connection
        2.0s: Got a connection
        2.0s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        2.0s: Waiting for connection
        2.0s: Got a connection
        2.0s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        2.0s: Waiting for connection
        2.0s: Got a connection
        2.0s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        2.0s: Waiting for connection
        2.0s: Got a connection
        2.0s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        2.0s: Waiting for connection
        2.0s: Got a connection
        2.0s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        2.0s: Waiting for connection
        2.0s: Got a connection
        2.0s: { Magic=System.Byte[]; ClrInstanceId=67504b96-f064-3b71-8e0d-ca36f0212490; ProcessId=3462; Unused=0; }
        2.0s: Waiting for connection
        2.2s: Stopping EventPipeSession over standard connection
        2.4s: Stopped EventPipeSession over standard connection

      cmdLine:/private/tmp/helix/working/BD010A2A/w/B3D209C3/e/tracing/eventpipe/reverseouter/reverseouter/reverseouter.sh Timed Out (timeout in milliseconds: 1800000 from variable __TestTimeout, start: 7/27/2020 1:03:40 PM, end: 7/27/2020 1:33:41 PM)
      Test Harness Exitcode is : -100

@josalem

josalem commented 4 years ago

This looks similar to the test failure that #39129 fixed. The dumps generated on macOS don't have the native bits in them, so I can't do much until I can repro this locally. Once #39858 is in, these failures should provide more insight into what's happening here. I'll see if running under stress modes causes this to happen more frequently.

josalem commented 4 years ago

Managed to catch it locally and collect a dump. I've got it under a debugger and will be picking away at it. It isn't the same issue as #38156 (which was fixed with #39129). The stress log doesn't show the "too many open files" error, and there aren't any orphaned FDs lying around. The symptom is the same, but the frequency and determinism of hitting it is lower. It is somehow exacerbated by the outer loop tests, so I'm inclined to think it might be a timing issue. I'll update in this issue as i investigate.

CC @tommcdon

josalem commented 4 years ago

A couple updates on this investigation:

Given all the above, it seems like the runtime is somehow getting an orphaned connection to the reverse server Unix domain socket. The reverse server is using the same filename for the Unix domain socket each time it binds a socket, so I'm hypothesizing that there may be some form of race between close and unlink that can happen if unlink happens after close. close technically isn't synchronous according to the man page (socket(7)):

When [SO_LINGER is] enabled, a close(2) or shutdown(2) will not return until all queued messages for the socket have been successfully sent or the linger timeout has been reached. Otherwise, the call returns immediately and the closing is done in the background. When the socket is closed as part of exit(2), it always lingers in the background.

[emphasis mine]

I've slightly modified the order of API calls in the C++ version of the reverse server so that the server Unix domain socket is unlinked before it closes the sockets, i.e.,

unlink(unixDomainFile);
shutdown(clientSocket, SHUT_RDWR);
close(clientSocket);
close(serverSocket);

instead of

shutdown(clientSocket, SHUT_RDWR);
close(clientSocket);
close(serverSocket);
unlink(unixDomainFile);

The close would cause the runtime to go into the HUP logic and attempt to reconnect to the server. If the runtime's call to connect beats the reverse server's call to unlink and close hasn't fully finished in the background, then the call to appears to succeed since the file still exists and it is still bound by the system. Without debugging into or being able to read macOS kernel code, I'm not sure how I can validate this hypothesis besides experimenting and stress testing. I'll see if I can verify the behavior by putting breakpoints on both sides and forcing the race while these stress loops run (8 consoles running the loop infinitely).

I've been running this small modification for just over an hour and the hang hasn't reproduced.

I'll leave this running long enough to feel confident that it is indeed the solution and then put the changes in a PR.

As for why this only happens on macOS, I'm not sure. Since macOS is BSD based, it is quite possible that this is simply a subtle variation in how the close, unlink, and connect APIs work when compared to Linux.

CC - @sywhang @noahfalk @dotnet/dotnet-diag

josalem commented 4 years ago

Stress run is going on 2+ hours now without an issue. I'll prepare the PR and submit it as a fix to the issue.