apple / swift-nio

Event-driven network application framework for high performance protocol servers & clients, non-blocking.
https://swiftpackageindex.com/apple/swift-nio/documentation
Apache License 2.0
7.92k stars 644 forks source link

EventLoopTest.testShuttingDownFailsRegistration shouldn't fail #333

Closed weissi closed 6 years ago

weissi commented 6 years ago

Expected behavior

EventLoopTest.testShuttingDownFailsRegistration should be stable and not randomly fail.

Actual behavior

18:32:26 Test Case 'EventLoopTest.testShuttingDownFailsRegistration' started at 2018-04-18 17:32:26.373
18:32:26 /code/Tests/NIOTests/EventLoopTest.swift:192: error: EventLoopTest.testShuttingDownFailsRegistration : XCTAssertFalse failed - 
18:32:26 Test Case 'EventLoopTest.testShuttingDownFailsRegistration' failed (0.007 seconds)

from here: https://ci.swiftnio.io:4433/job/swift-nio-master/195/console

Steps to reproduce

dunno, run test in a loop on Linux?

If possible, minimal yet complete reproducer code (or URL to code)

test should be fairly minimal

SwiftNIO version/commit hash

9360bd5c4940f61f301e71497606ec853efa2a8d

Swift & OS version (output of swift --version && uname -a)

CI, Swift 4.0.3

weissi commented 6 years ago

yes, reproes by just looping it:

run_in_docker bash -c 'while swift test --filter NIOTests.EventLoopTest/testShuttingDownFailsRegistration; do true; done'

did it after 137 runs.

Lukasa commented 6 years ago

This is the same issue as #334. I'll do a quick sweep over the tests to confirm we aren't making this mistake anywhere else.

Lukasa commented 6 years ago

Sorry, I don't know why I said that: looking at the code again doesn't seem to back up the idea that this is the same issue.

Lukasa commented 6 years ago

I'm suspicious of the fact that the first channel registration+connect got wrapped in submit. I wonder if this ends up creating a race where Channel.close doesn't get called, because the connection attempt races with the loop.closeGently in the other line.

As a general rule, you're waiting on submit blocks, but not waiting on the futures that the calls inside those blocks create. That is probably the wrong thing to do.

Lukasa commented 6 years ago

Hrm, that theory doesn't seem to be right though, as adding a wait for that connection does not make the problem go away.

Lukasa commented 6 years ago

Ok, got it.

The issue here is a simple one of timing. The premise of the test is that the WedgeOpenHandler will keep the channel open by catching calls to close, which will prevent the selector closing the channel during graceful shutdown. This is true!

However, the WedgeOpenHandler only prevents that for the client channel. However, there is a server side to that channel too, and that channel is in the same event loop as the client side. That channel is not protected against being closed by the selector, so it will be. That opens a timing window where the selector may cause the client channel to read EOF from the socket. EOF from the socket does not cause a close to travel through the pipeline, and so this will actually close the client channel directly (via calling close0 with .eof as the error).

There are a few ways we can deal with this, but the easiest may be to set the client channel to allow remote half closure. This should prevent the reading of EOF causing a channel close. However, it's a bit perplexing and would definitely require a comment (and may not function on all OSes, it depends a bit on how intelligent the kernel is: for example, this definitely would not work if we were using AF_UNIX sockets). The other option is to wedge both the server and client channels open, and then ungate both of them at the end of the test. That one is more complex, but also more likely to be understandable and correct.

normanmaurer commented 6 years ago

Wedge both sound like a better option to me for the reason you said