Server's sockets are not shut down in an orderly fashion; sending heartbeats can occasionally fail

I've been working on the node nailgun client jvmpin — updating it to work with the current nailgun server.

I've noticed that occasionally I encounter an error from writing to a closed socket, caused when the client sends a heartbeat chunk as the server sends its exit chunk. The server doesn't wait for the client to close its side of the socket before shutting down its socket. Because the nailgun protocol has no way for the client to acknowledge the end of an exchange before the server closes the connection, the connection state must be used to terminate an exchange gracefully.

Currently there's a race condition in that the client may send a heartbeat as the server sends an exit chunk and closes its socket, resulting in the server sending back a TCP RST. To some extent this can be mitigated by not sending heartbeats after the client sends its EOF chunk, but then if the nail takes longer than the heartbeat timeout interval to finish executing then it'll fail with a timeout.

Currently, when a nail is done, the NGCommunicator calls exit() which sends the exit chunk, and calls shutdownInput() and shutdownOutput() on the socket. Immediately after, NGCommunicator.runImpl() returns, and NGCommunicator.close() is called via the try-with-resources managing the NGCommunicator being left. It's this close() call that close()s the socket. At this point, the socket is closed, but the client may still be sending a heartbeat.

Instead, and following the advice in Orderly Versus Abortive Connection Release in Java I believe the exit() method should only call shutdownOutput() on the socket, then wait (with a suitable timeout) for a read() on the socket's input stream to return -1 to indicate that the client has closed its end of the socket. At that point, close() can be called on the socket as neither party are sending any further data.

The client could also close its side of the socket after sending its EOF, in which case the server could close the connection as soon as the exit chunk is sent.

facebookarchive / nailgun

Server's sockets are not shut down in an orderly fashion; sending heartbeats can occasionally fail #175