docker / libchan

Like Go channels over the network
Apache License 2.0
2.47k stars 142 forks source link

Dead session detection #57

Open lynxbat opened 9 years ago

lynxbat commented 9 years ago

This PR adds a feature to handle unreliable client connections within a server implementation.

If a client loses connection or closes their connection poorly a server goroutine can get stuck waiting on a blocking method. Over a period of time these waiting routines stack up. In a server that handles a high frequency of sessions even a smaller session failure rate can add up quickly.

This PR adds the following features:

This provides the ability to handle client dead session for each of the blocking methods run inside a server goroutine allowing the goroutine for that session to exit gracefully.

Notes:

Docker-DCO-1.1-Signed-off-by: Nicholas Weaver lynxbat@gmail.com (github: lynxbat)

lynxbat commented 9 years ago

Not sure if needed but since this is a big add I can spend time on Hangouts with someone demoing or doing a code review if desired.

dmcgowan commented 9 years ago

Spending some time on code review will be useful. I will slot some time later in the week to review and test this change.

lynxbat commented 9 years ago

Feel free to ping me if needed.

dmcgowan commented 9 years ago

There seems to be two distinct changes in this PR, atleast from the review perspective. Let me address the accept timeout first and then heartbeat/dead session detection second.

Do you have a specific use case which is requiring the accept timeout? I have not seen this pattern widely used since normally accept is done until an explicit close. An explicit close on the listener is still allowed which would cause any further accepts to throw an error, effectively ending any accept loop. As for the code, when using the timeout method in that case I would advocate for setting the timeoutChan to nil when the timeout value is 0. Making the channel nil will have the same block forever effect in the select statement. On a more trivial note the waitForTimeout also seems to have the same functionality as time.After. Likewise errors.New(fmt.Sprintf()) can be replaced by fmt.Errorf, although in the case of timeouts creating an ErrTimeout variable maybe more useful for the caller to easily check if the error was a timeout vs an error caused by a closed listener.

I like the idea behind detecting the dead sessions and heartbeat. Will need more time to fully review and test the code. I am wondering if there is a different way to do the teardown upon bad connection detection. When the transport goes bad such as what is detected by the dead session detector, any attempt to receive should already cause EOF. If it doesn't then the normal connection teardown should begin so any callers of receive will be able to end their receive loops. In the case of a detected bad session, it may be useful to store the error received somewhere that can be checked by the transport initiator, but not necessarily throwing a non-EOF error to callers of receive. Would have to think some more about how this error is expected to be handled by the caller. In an ideal world the libchan transport would be able to fallback to use another connection, but that would be in the future with quite a bit more work.

lynxbat commented 9 years ago

I debated even adding the listener timeout since it seems like it would be more specific to a server implementation. I left it in since I made it work basically. I can understand removing from the heartbeat work and approaching separately or maybe not at all right now.