davidstump / SwiftPhoenixClient

Connect your Phoenix and iOS applications through WebSockets!
MIT License
506 stars 146 forks source link

Fix not always automatically reconnecting after hearbeat timeout #219

Closed bhyzy closed 2 years ago

bhyzy commented 2 years ago

Problem

Socket does not correctly reconnect if the server fails to respond to a heartbeat in time, but acknowledges the subsequent client-initiated disconnection request.

Root cause analysis

Every heartbeatInterval the socket tries to push a heartbeat message to the server (sendHeartbeat). If a response to the previous one has not been received yet (pendingHeartbeatRef not nil), a timeout is proclaimed. This in turn triggers an abnormal closure of the socket: abnormalClose("heartbeat timeout"):

if let _ = self.pendingHeartbeatRef {
  // ...
  self.abnormalClose("heartbeat timeout")
  return
}

The socket then closes the connection to the server with CloseCode.normal (1000):

internal func abnormalClose(_ reason: String) {
  self.closeWasClean = false

  /*
   We use NORMAL here since the client is the one determining to close the
   connection. However, we keep a flag `closeWasClean` set to false so that
   the client knows that it should attempt to reconnect.
   */
  self.connection?.disconnect(code: CloseCode.normal.rawValue, reason: reason)
}

The transport layer (self.connection) simply drops the connection (showing URLSessionTransport):

public func disconnect(code: Int, reason: String?) {
  // ...
  self.readyState = .closing
  self.task?.cancel(with: closeCode, reason: reason?.data(using: .utf8))
}

Two things can happen now:

  1. Server is unreachable: the disconnection request is not processed, the transport layer reports an abnormal closure to the socket (self.delegate):

    private func abnormalErrorReceived(_ error: Error) {
    // ...
    // An abnormal error is results in an abnormal closure, such as internet getting dropped
    // so inform the delegate that the Transport has closed abnormally. This will kick off
    // the reconnect logic.
    self.delegate?.onClose(code: Socket.CloseCode.abnormal.rawValue)
    }
  2. Server is reachable: the disconnection request is processed, the transport layer reports a closure event back to the socket (self.delegate), passing in the close code that was sent to the server (1000 = CloseCode.normal):

    public func urlSession(_ session: URLSession,
                       webSocketTask: URLSessionWebSocketTask,
                       didCloseWith closeCode: URLSessionWebSocketTask.CloseCode,
                       reason: Data?) {
    // ...
    self.delegate?.onClose(code: closeCode.rawValue)
    }

Both of these scenarios end up calling the Socket.onClose method:

public func onClose(code: Int) {
  self.closeWasClean = code != CloseCode.abnormal.rawValue
  self.onConnectionClosed(code: code)
}

In scenario 1. self.closeWasClean is set to false. In scenario 2. though, it's set to true, overriding the false value set earlier in abnormalClose. This in turn causes the reconnection logic to be skipped:

internal func onConnectionClosed(code: Int?) {
  // ...
  // Only attempt to reconnect if the socket did not close normally
  if (!self.closeWasClean) {
    self.reconnectTimer.scheduleTimeout()
  }
  // ...
}

Solution

The proposed solution introduces an enum CloseStatus to be used instead of a boolean closeWasClean, which allows the abnormal closure information to be sticky and to avoid it being overridden to true.

bhyzy commented 2 years ago

Just for some context, we have already rolled out this fix to our users and have been seeing this bug occur roughly 15,000 times/day across our entire user base. We have also been able to consistently reproduce it by enabling Network Link Conditioner on iOS and setting it to Very Bad Network. Our hypothesis is that substantial packet dropping on poor connections is the key ingredient here.

dsrees commented 2 years ago

@bhyzy Glad this worked out for you! Code looks good so far. I'll update here when I get a chance to test this. Thanks