implement backoff for reconnect

BrendanBall commented 2 years ago

related to #111.

The current reconnect feature doesn't support backoff, so it's not really useful since spamming reconnects is not good behaviour. This introduces a backoff feature, taking inspiration from https://hex.pm/packages/connection.

This was meant to be a small change, however with the existing async and handle_initial_conn_failure options, the solution turned out to be incoherent and also uncovered existing bugs with reconnect during init that could indefinitely block the supervisor startup process.

As a result, this change makes breaking changes to the api, and introduces some new apis to recover some previous functionality. In particular, async init is now the default and only behaviour, ie. a network connection is only made after finalizing the special process init.

Additionally, I've removed send_frame from the selective receive during connecting status since it is not expected for a connection to be in this status for very long and this removes the need for an extra api to wait for the connection being connected.

You can now do the following:

def handle_disconnect(%{attempt: attempt}, state) do
  {:backoff, calculate_backoff(attempt), state}
end

I realised that currently handle_cast and handle_info messages are only handled when in the websocket_loop, ie. when connected because those can both return {:reply, frame, state} which would fail if not connected at that point. This is probably currently fine since open_loop and close_loop aren't expected to last that long, so it's fine for a message to wait in the inbox until the connection is established. However since I introduced a backoff_loop this is a problem because either

handle_cast and handle_info messages must be handled in backoff_loop and :reply has to somehow be handled differently when not connected or
handle_cast and handle_info messages must NOT be handled in backoff_loop which means messages could wait in the inbox for potentially a long time, especially since the backoff is given by the library user, which could conceivably be > 10 s if exponential backoff is implemented (this flexibility is provided precisely because we want the user to be able to choose their own backoff strategy for their use case).

Using {:reply, frame, state} already makes sending messages unreliable in the sense that the user isn't given feedback on whether the message was actually successfully sent, despite websocket itself being a reliable transport. This means that using this could result in messages unexpectedly getting lost, although currently if :reply results in any error then the connection is closed, and in practice this probably mostly happens because of connection errors. I'm not sure how many people think about this when choosing between send_frame and :reply, but it is suggested to use :reply instead of send_frame (#34), which I don't think is a good thing if you want reliability.

I've added an additional callback, handle_send_result, which is called on success or error of processing a :reply action. Note that I've not named it handle_reply_result because I think that the :reply action should be renamed since it only makes sense in handle_frame, but not in handle_cast and handle_info. This allows sending a message via :reply to be just as reliable as send_frame since the user is given feedback on whether the message was sent or not.

Azolo commented 2 years ago

I think we have some disagreements on the underlying reliability of the WebSocket protocol. I tried to explain my point of view.

I also feel like you are gutting some important functionality for my personal use case, so I tried to point that out.

On the other hand, I understand the desire to want to get rid of some of the complexity that results from some of the features implemented for ease of use.

I think that unfortunately, there aren't as many changes that you can make without removing behavior that someone like me depends upon. The other problem is that WebSocket servers are so diverse in their implementations that we can't actually assume that the protocol is implemented 100% correctly.

I do really appreciate this though, and hope we can continue having a conversation on how to get this functionality into WebSockex.

BrendanBall commented 2 years ago

I wasn't planning on making so many changes when I set out to add backoff support. Some of the changes I made were for one of the following reasons:

I needed it to get tests passing
Keeping both features added a lot of complexity which seemed unnecessary to me, especially since it meant more things to test
Some features (e.g. reconnect) I considered to be broken as is and would require small changes to upgrade to the new version (and since this library isn't at v1 yet, I see this as acceptable)

However I do prefer small PRs so if we can agree on a subset of the changes then I can try split it up into smaller PRs. I think the main issue was that adding backoff had cascading effects for tests and certain assumptions/implications which prevented me from isolating it as a small PR.

Lets see what we can agree on

Azolo / websockex

implement backoff for reconnect #112