fastly / pushpin

A proxy server for adding push to your API, used at the core of Fastly's Fanout service
https://pushpin.org
Apache License 2.0
3.66k stars 153 forks source link

connmgr: don't send keep-alives during handoff #48072

Closed jkarneges closed 1 month ago

jkarneges commented 1 month ago

Whenever proxy or handler want to hand off a session, they send a handoff-start message to connmgr, and connmgr replies with handoff-proceed. After this, connmgr is supposed to stay silent until it receives another message, at which point it assumes whoever sent the message is the new owner of the session and communication can resume as normal. However, in rare cases connmgr might send keep-alive messages to the old owner while it is waiting to hear from the new owner, leading to broken sessions and "received message out of sequence" warnings. This PR fixes connmgr to not send keep-alives during handoff.

There are two fixes:

  1. In accept_handoff(), for both server and client mode, unset the to_addr (i.e. the known peer address) of the session. Keep-alives are only sent for sessions with a known peer address. This is an easy one line fix per mode.
  2. In keep_alives_task(), for both server and client mode, account for the possibility that to_addr might get unset for sessions during batch preparation and processing. This is a bit more complicated, see below.

The keep_alives_task() tries to address keep-alive messages to multiple sessions at once, to reduce the number of messages sent. It does this by processing sessions in batches. It first decides which sessions it will be sending keep-alives for, and then it proceeds to generate and send the minimal number of messages with optimal packing. Notably, it waits for message sendability before generating and sending each message, and we need to tolerate to_addr getting unset for any sessions during this waiting. We do this by making it possible for Batch::take_group() to skip sessions via its get_id closure argument and updating the closure provided by next_batch_message() to skip sessions that don't have a to_addr.

The Batch type, including tests, is redundantly implemented twice (once in connmgr::client and once in connmgr::server). This PR updates both copies of the code. In the future we should consider sharing a single implementation of Batch to avoid having to make redundant updates.