Azolo / websockex

An Elixir Websocket Client
MIT License
520 stars 100 forks source link

Websockex is causing my whole supervisor to shutdown #106

Closed kuon closed 2 years ago

kuon commented 2 years ago

I have a websockex client inside a phoenix app, and if I turn the debug info, I have the following logs:

*DBG* #PID<0.632.0> sending frame: {:text, "2"}
*DBG* #PID<0.632.0> received frame: {:text, "3"}
*DBG* #PID<0.632.0> had the connection closed unexpectedly by the remote server
*DBG* #PID<0.819.0> attempting to connect
*DBG* #PID<0.822.0> attempting to connect
*DBG* #PID<0.825.0> attempting to connect
*DBG* #PID<0.829.0> attempting to connect
*DBG* #PID<0.833.0> attempting to connect
*DBG* #PID<0.837.0> attempting to connect
[notice] Application box exited: shutdown

What is strange is that my whole Phoenix app (called Box) is stopped.

Also I am wondering why the websocket does not reconnect.

If I call Application.ensure_all_started(:box) from thex iex prompt, it reconnects immediately.

My tree is:

Box.Application - default phoenix supervisor app, I just added Box.SocketClient to the children list. Box.SocketClient - is a genserver, and in init I call Box.SocketIO.Socket.start_link() Box.Socket - has use WebSockex

defmodule Box.SocketIO.Socket do
  use WebSockex
  require Logger

  def start_link(client, opts \\ []) do
    sockopts =
      if Logger.level() == :debug do
        [debug: [:trace]]
      else
        []
      end

    opts = Enum.into(opts, %{})
    endpoint = "wss://#{opts.endpoint}/socket.io/"

    uri = URI.parse(endpoint)

    params =
      Map.get(opts, :params, %{})
      |> Map.merge(%{
        transport: "websocket",
        EIO: 3
      })

    uri = %{uri | query: URI.encode_query(params)}
    uri = URI.to_string(uri)

    WebSockex.start_link(
      uri,
      __MODULE__,
      client,
      sockopts
    )
  end

  def handle_connect(_conn, client) do
    IO.inspect(client)
    {:ok, client}
  end

  def handle_frame({:text, msg}, client) do
    send(client, {:frame, msg})
    {:ok, client}
  end
end
kuon commented 2 years ago

I removed the genserv and use websockex state directly and the issue is gone. I guess websockex must be doing something that is not compatible with it being nested in a genserv.

kuon commented 2 years ago

I finaly understood what was happening, it was not the genserv fault. It was just a coincidence.

The websocket endpoint was temporarily offline for a sec, this triggered a disconnection, which killed the process. The supervisor would restart immediately the websocket process but the endpoint wasn't back up, which would kill the process again and then the supervisor would restart it again. This would hit the supervisor "max_restart" and kill the whole application.

The solution I ended up using is to "wrap" the websocket process with an intermediate process. This intermediate process is trapping exit and is managing an exponential backoff for retrying reconnection.

Azolo commented 2 years ago

Sorry, I've been sick/busy for the last week or so.

There is actually an actionable item here and it's better logging when the process crashes. Right now if it's under a supervisor, then the error is eaten by the supervisor trap and never shown.

There was an issue with getting stack traces correctly that prevented a good error message with a stacktrace when I first wrote WebSockex. That's since been fixed and needs to be implemented.