davidstump / SwiftPhoenixClient

Connect your Phoenix and iOS applications through WebSockets!
MIT License
506 stars 146 forks source link

Reconnect after heartbeat timeout #217

Closed bhyzy closed 2 years ago

bhyzy commented 2 years ago

Problem

Socket does not correctly reconnect if the server fails to respond to a heartbeat in time, but acknowledges the subsequent client-initiated disconnection request.

Root cause analysis

Every heartbeatInterval the socket tries to push a heartbeat message to the server (sendHeartbeat). If a response to the previous one has not been received yet (pendingHeartbeatRef not nil), a timeout is proclaimed. This in turn triggers an abnormal closure of the socket: abnormalClose("heartbeat timeout"):

if let _ = self.pendingHeartbeatRef {
  // ...
  self.abnormalClose("heartbeat timeout")
  return
}

The socket then closes the connection to the server with CloseCode.normal (1000):

internal func abnormalClose(_ reason: String) {
  self.closeWasClean = false

  /*
   We use NORMAL here since the client is the one determining to close the
   connection. However, we keep a flag `closeWasClean` set to false so that
   the client knows that it should attempt to reconnect.
   */
  self.connection?.disconnect(code: CloseCode.normal.rawValue, reason: reason)
}

The transport layer (self.connection) simply drops the connection (showing URLSessionTransport):

public func disconnect(code: Int, reason: String?) {
  // ...
  self.readyState = .closing
  self.task?.cancel(with: closeCode, reason: reason?.data(using: .utf8))
}

Two things can happen now:

  1. Server is unreachable: the disconnection request is not processed, the transport layer reports an abnormal closure to the socket (self.delegate):

    private func abnormalErrorReceived(_ error: Error) {
    // ...
    // An abnormal error is results in an abnormal closure, such as internet getting dropped
    // so inform the delegate that the Transport has closed abnormally. This will kick off
    // the reconnect logic.
    self.delegate?.onClose(code: Socket.CloseCode.abnormal.rawValue)
    }
  2. Server is reachable: the disconnection request is processed, the transport layer reports a closure event back to the socket (self.delegate), passing in the close code that was sent to the server (1000 = CloseCode.normal):

    public func urlSession(_ session: URLSession,
                       webSocketTask: URLSessionWebSocketTask,
                       didCloseWith closeCode: URLSessionWebSocketTask.CloseCode,
                       reason: Data?) {
    // ...
    self.delegate?.onClose(code: closeCode.rawValue)
    }

Both of these scenarios end up calling the Socket.onClose method:

public func onClose(code: Int) {
  self.closeWasClean = code != CloseCode.abnormal.rawValue
  self.onConnectionClosed(code: code)
}

In scenario 1. self.closeWasClean is set to false. In scenario 2. though, it's set to true, overriding the false value set earlier in abnormalClose. This in turn causes the reconnection logic to be skipped:

internal func onConnectionClosed(code: Int?) {
  // ...
  // Only attempt to reconnect if the socket did not close normally
  if (!self.closeWasClean) {
    self.reconnectTimer.scheduleTimeout()
  }
  // ...
}

Solution

The proposed solution introduces another state variable closingAbnormally to be used instead of closeWasClean in the abnormal closure scenario. By keeping this information separately from the more generic closeWasClean state, Socket is able to tell whether a 1000 code disconnection was caused by normal operation or an abnormal condition (heartbeat timeout), and execute the reconnection logic accordingly.

bhyzy commented 2 years ago

Apologies, this PR was meant to go to our fork for now.

dsrees commented 2 years ago

@bhyzy no worries, wonderful write up. If further testing on your fork proves this as a solution then I'd be happy to take a PR to the main repo.

bhyzy commented 2 years ago

@dsrees Perfect, thanks, I'll be sure to follow up with a PR here then. We're also adding some logging, it will be interesting to see how widespread this issue has been.