Azolo / websockex

An Elixir Websocket Client
MIT License
515 stars 97 forks source link

Handling {:remote, :closed} ? #43

Closed mnussbaumer closed 6 years ago

mnussbaumer commented 6 years ago

Hey, I've been using websockex in a project and it's working great (thanks for writing it). I do have one problem though.

I initiate a Supervisor with a simple_one_for_one that is responsible for initializing a websockex. It calls websockex with start (not start_link), but whenever the socket in which I'm listening is closed websockex will raise an exception, that in production gets logged as this:

CRASH REPORT==== 11-Nov-2017::12:47:29 === crasher: initial call: Elixir.WebSockex:init/5 pid: <0.1340.0> registered_name: [] exception exit: {remote,closed} in function 'Elixir.WebSockex':terminate/4 (lib/websockex.ex, line 870) ancestors: ['Elixir.Channeler.Monitor','Elixir.Channeler.Supervisor', <0.1284.0>] message_queue_len: 0 messages: [] links: [] dictionary: [] trap_exit: false status: running heap_size: 28690 stack_size: 27 reductions: 33897241 neighbours:

I have a terminate callback and I can see on the reason {:remote, :closed}, but would like to know how I can handle the :closed event so that it doesn't technically "crash"? I'm fine with writing the explanation to the docs and PR'ing if you would like, along with a section making notice of both start and start_link. Thanks

Azolo commented 6 years ago

Alright so, from the standpoint of the WebSocket spec a {:remote, :closed} is an exceptional event, since the spec requires you to negotiate closing a socket.

However, this was asked somewhere before and there are the situations where an error like this isn't log worthy. So basically, my advice was to exit(:normal) from the terminate/2 callback.

But this behavior is supported by the fact that I don't catch any errors in terminate/2 and report them in a nice format like the other callbacks. So I need to make sure that there's at least a test for this behavior if I'm recommending it. 🤐

I don't know where you would document that, but that's probably because of my indecisiveness. If you were to document it where it makes the most sense to you and submit a PR, that would be awesome for me.

mnussbaumer commented 6 years ago

Hey, so I'm trying to write a test for it but I'm not able to make it work:

defmodule TerminateClient do
    use WebSockex
    def start(url, state, opts \\ []) do
      WebSockex.start(url, __MODULE__, state, opts)
    end
    def terminate(_reason, _state) do
      IO.puts("no no")
      exit(:normal)
    end
  end

  test "doesnt_raise_error_on_exit_normal", context do
    {:ok, _} = TerminateClient.start(context.url, %{})
    WebSockex.TestServer.shutdown(context.server_ref)
    :timer.sleep 500
  end
Micaels-MacBook-Pro:websockex mnussbaumer$ mix test
...................................................................................................................................no no

  1) test doesnt_raise_error_on_exit_normal (WebSockexTest)
     test/websockex_test.exs:1375
     ** (EXIT from #PID<0.7888.0>) {:remote, :closed}

.............................................................

Finished in 2.7 seconds
193 tests, 1 failure

Randomized with seed 393944

But I'm 100% positive that I no longer get crashes on my production server, where basically it's exactly the same thing I'm doing there.

also: the :timer call is just to make sure it has enough time to receive the close event, since I don't want to trap exits, probably should just use your catch_attr? If I remove the timer, then it just concludes, most of the times without error.

I've added the module and the test to the end of websockex_test.exs

Any idea?

Azolo commented 6 years ago

Yeah, just as a general rule I wouldn't recommend using timers to wait for events to happen.

But we don't actually NEED to close the socket. We can just make it LOOK like it's closed. Get the socket from the conn during handle_connect/2 and then send(client_pid, {:tcp_closed, socket}).

Then if we monitor the started process we can listen for a :DOWN message to ensure that we exited normally.

But the reason that you're getting a failure is because there's a linked WebSockex process that started during setup that is exiting when you shutdown the server. You would need to unlink that or not shutdown the server.

mnussbaumer commented 6 years ago

@Azolo thanks! Indeed that makes sense. So I've got a passing test, that fails whenever I remove exit(:normal) from the terminate/2 callback. But I've noticed a weird behaviour that I'm not sure I understand.

defmodule TerminateClient do
    use WebSockex
    def start(url, state, opts \\ []) do
      WebSockex.start(url, __MODULE__, state, opts)
    end

    def handle_info(:get_socket, state) do
      {:noreply, state}
    end

    def terminate(_reason, _state) do
      exit(:normal)
    end
  end

  test "doesnt_raise_error_on_exit_normal", context do
    {:ok, pid} = TerminateClient.start(context.url, %{})
    Process.monitor(pid)
    send(pid, {:tcp_closed, :socket})
    assert_receive {:DOWN, _, :process, ^pid, :normal}
  end

This passes, although I'm not setting any handler for {:tcp_closed, :socket}. But if I comment out def handle_info(:get_socket....) ... end I get the following error:

1) test doesnt_raise_error_on_exit_normal (WebSockexTest)
     test/websockex_test.exs:1383
     No message matching {:DOWN, _, :process, ^pid, :normal} after 100ms.
     The following variables were pinned:
       pid = #PID<0.301.0>
     Process mailbox:
       #PID<0.303.0>
     code: assert_receive {:DOWN, _, :process, ^pid, :normal}
     stacktrace:
       test/websockex_test.exs:1389: (test)

     The following output was logged:

     12:15:31.579 [error] No handle_info/2 clause in Elixir.WebSockexTest.TerminateClient provided for {:tcp_closed, :socket}

I see the pid of the process in the mailbox is different than the socket pid, I'm assuming that's the pid monitoring? But why does it pass when I simply make a handle_info that is unrelated to :tcp_closed?

Azolo commented 6 years ago

So you have a lot of things going on there, heh.

The reason why your test is passing is actually because of a FunctionClauseError.

The default implementation from __using__ in WebSockex sets handle_info/2 as a pass through.

Which is why when you define and override it you get the error, which terminates and causes the terminate/2 callback to trigger where you exit(:normal). If you were to change it to terminate({:remote, :closed}, _) then you would see the FunctionClauseError.

The pid in your mailbox is actually the server process pid. The server is setup so that it sends it's socket pid down and you can manipulate it.

But {:tcp_closed, socket} is actually handled by WebSockex and wouldn't make it to the handle_info/2 callback. But you need the socket variable from handle_connect/2.

HOWEVER, you just reminded me of something just now. Try this:

test "can exit from terminate", %{pid: orig_pid} = context do
  defmodule TerminateClient do
    use WebSockex
    def start(url, state, opts \\ []) do
      WebSockex.start(url, __MODULE__, state, opts)
    end

    def terminate(:test_reason, _state) do
      exit(:normal)
    end
  end

  Process.unlink(orig_pid)
  ref = Process.monitor(orig_pid)

  :sys.terminate(orig_pid, :test_reason)
  assert_receive {:DOWN, ^ref, :process, ^orig_pid, :test_reason}

  {:ok, pid} = TerminateClient.start(context.url, %{})
  ref = Process.monitor(pid)

  :sys.terminate(pid, :test_reason)
  assert_receive {:DOWN, ^ref, :process, ^pid, :normal}
end

:sys.terminate/2 should hit system_terminate/4 and test our case without any unnecessary manipulation.

Azolo commented 6 years ago

Closed via #44