Azolo / websockex

An Elixir Websocket Client
MIT License
515 stars 97 forks source link

Supervised Clients don't raise crash errors #116

Open RaphSfeir opened 2 years ago

RaphSfeir commented 2 years ago

Hello,

If a Websockex client crashes in an unsupervised instance, the log catches the exit error, for instance using a simplified version of the echo_client.exs example:

defmodule EchoClient do
  use WebSockex
  require Logger

  def start_link(opts \\ []) do
    WebSockex.start_link("wss://ws.postman-echo.com/raw", __MODULE__, :fake_state, opts)
  end

  @spec echo(pid, String.t) :: :ok
  def echo(client, message) do
    Logger.info("Sending message: #{message}")
    WebSockex.send_frame(client, {:text, message})
  end

  def echo_supervised(message) do
    Logger.info("Sending message: #{message}")
    WebSockex.send_frame(__MODULE__, {:text, message})
  end

  def handle_connect(_conn, state) do
    Logger.info("Connected!")
    {:ok, state}
  end

  def handle_frame({:text, msg}, :fake_state) do
    Logger.info("Received Message: #{msg}")
    if :rand.uniform() > 0.5 do
      Logger.info("Raising random error to crash the client")
      raise ArgumentError
    end
    {:ok, :fake_state}
  end
end

{:ok, pid} = EchoClient.start_link()

EchoClient.echo(pid, "Yo Homies!")
EchoClient.echo(pid, "This and That!")
EchoClient.echo(pid, "Can you please reply yourself?")

Returns something like

[info] Raising random error to crash the client
** (EXIT from #PID<0.95.0>) {%ArgumentError{message: "argument error"}, []}

Now using a supervised version, the Client when crashing just silently restarts:

For instance using the Websockex Application, just for this example:

defmodule WebSockex.Application do
  use Application
  """
  def start(_type, _args) do
    unless URI.default_port("ws"), do: URI.default_port("ws", 80)
    unless URI.default_port("wss"), do: URI.default_port("wss", 443)

    Supervisor.start_link(
      [
        {EchoClient, name: EchoClient}
      ],
      strategy: :one_for_one,
      name: WebSockex.Supervisor
    )
  end
end

Then we can run

iex -S mix
iex(0)> EchoClient.echo_supervised("hi")

When the error raises, we just get

[info] Raising random error to crash the client
[info] Connected!

Shouldn't it also catch the crash log like in the unsupervised method? I tried using the debug: [:trace] option, but this logs everything and often is not convenient with logs overflowing. Perhaps I'm missing an option?

Thanks!

mgwidmann commented 1 year ago

I am using https://hex.pm/packages/parent to supervise my websocket because its particularly unstable and have been noticing the same things. I'll have to try with a regular supervisor to rule out parent, but can confirm seeing a similar issue.

mgwidmann commented 1 year ago

Something to add here... When it gets restarted by parent, it logs the :EXIT message which was trapped, which is:

[info]04:03:03.453 [error] GenServer #PID<0.5829.0> terminating
[info]** (stop) :too_many_restarts
[info]    (parent 0.12.1) lib/parent.ex:822: Parent.give_up!/3
[info]    (parent 0.12.1) lib/parent/restart.ex:34: anonymous fn/2 in Parent.Restart.record_restart/2
[info]    (elixir 1.14.1) lib/enum.ex:1288: anonymous fn/3 in Enum.flat_map_reduce/3
[info]    (elixir 1.14.1) lib/enum.ex:4751: Enumerable.List.reduce/3
[info]    (elixir 1.14.1) lib/enum.ex:1287: Enum.flat_map_reduce/3
[info]    (parent 0.12.1) lib/parent/restart.ex:12: Parent.Restart.perform/3
[info]    (parent 0.12.1) lib/parent.ex:781: Parent.handle_stopped_children/4
[info]    (parent 0.12.1) lib/parent.ex:496: Parent.handle_message/1
[info]Last message: {:EXIT, #PID<0.5836.0>, {%FunctionClauseError{module: nil, function: nil, arity: nil, kind: nil, args: nil, clauses: nil}, []}}

It's just a blank %FunctionClauseError{} struct, and [] no arguments. It doesn't appear like anything is actually crashing.

mgwidmann commented 1 year ago

This statement leads me to believe that there is bug crashing the process but that it is being ignored for some Erlang reason:

Unlike in "plain Erlang", proc_lib processes will not generate error reports, which are written to the terminal by the emulator. All exceptions are converted to exits which are ignored by the default logger handler.

https://www.erlang.org/doc/man/proc_lib.html

Unsure what "plain Erlang" means in the context of Erlang documentation 🤔

mgwidmann commented 1 year ago

I tracked my exit to this intentional terminate call here, the message above it is a bit concerning https://github.com/Azolo/websockex/blob/master/lib/websockex.ex#L893

In my terminate/2 callback, I am getting the following args:

[{%FunctionClauseError{module: nil, function: nil, arity: nil, kind: nil, args: nil, clauses: nil}, []}, :ok]

This makes me believe that for some reason, Websockex is terminating the process and exiting normally, resulting in no logs being printed.

mgwidmann commented 1 year ago

I have found a fix that at least exposes the error for the time being. Though, something is going on which I don't fully understand so I'm not sure its the correct fix so I won't open a PR. Reason being is, about 10% of the tests all fail since the parent process seems to be expected to receive the exit signal and log the error for you in this case. Instead, after failure in my branch, websockex is logging the error and then exiting. I'm not sure if this is the correct behavior or not but as you've pointed out the supervisor isn't logging the error either.

https://github.com/mgwidmann/websockex/tree/report-errors

Azolo commented 1 year ago

Sorry, I've been dealing with some family issues.

There was something wrong with the Elixir Logger in older versions of Elixir and while it would capture the Erlang Logger it didn't act as expected when using it directly. I don't remember the exact details.

However, since then I know the Logger has changed and is much integrated with the underlying BEAM logging. A simple error log before the error termination is probably enough now. Much the same way GenServer provides an exit stacktrace and an exit error message.

RaphSfeir commented 1 year ago

Sorry but I'm not sure I understand, how to know where the error termination will occur, especially if the functions call other modules?

RaphSfeir commented 1 year ago

Is there any more information on this?

We are considering using this library as a websocket client at my company but this issue is a big blocker. Just some hints on how to deal with it would already be great, I wouldn't mind working on a PR if I get the time.

Thanks!

mgwidmann commented 1 year ago

I've noticed my websockex process becoming a zombie while running in my "production" system (which is not really production but runs all the time on a server). When I took a look, it had only done 2 reductions and was in some sort of error state but didn't say what had happened. I killed the process before taking note of which line in websockex.ex it had failed at. Other than that, I've mostly been getting errors reported properly to the logger with my above branch. However, I wouldn't quite say its ready for real production use until this is sorted out fully.

Azolo commented 1 year ago

@RaphSfeir So if we take a look at the Erlang gen_server module it happens in terminate under specific conditions.

So, we would look at the terminate function in WebSockex. For now we can just use Exception.format/3.

By adding something like these clauses to the reason case statements:

{%_{__exception__: true} = exception, trace} ->
  Logger.error(Exception.format(:error, exception, trace))
%_{__exception__: true} = exception ->
  Logger.error(Exception.format(:error, exception))

_NOTE: It looks like I also need to wrap the mod.terminate statement in another case statement in case there's a failure in the using module's terminate callback. Which would basically replace the reason.

Literally, that plus tests should be all that we would need to get proper crash logging in place. But it's kind of an educated guess for me. If I get it done soon.

But this is probably another 3-4 hour fix for me, and I'm barely able to take an hour to research this. 😢

Azolo commented 1 year ago

When I took a look, it had only done 2 reductions and was in some sort of error state but didn't say what had happened. I killed the process before taking note of which line in websockex.ex it had failed at.

@mgwidmann I'm pretty sure this is an instance of #96 and I would kill for any more information on it. I have been completely unable to replicate it.

mgwidmann commented 1 year ago

@Azolo I've posted on the elixir forum some more detailed information. See if that looks to be the same to you. https://elixirforum.com/t/debugging-stuck-zombie-process/54301/2

RaphSfeir commented 1 year ago

@RaphSfeir So if we take a look at the Erlang gen_server module it happens in terminate under specific conditions.

So, we would look at the terminate function in WebSockex. For now we can just use Exception.format/3.

By adding something like these clauses to the reason case statements:

{%_{__exception__: true} = exception, trace} ->
  Logger.error(Exception.format(:error, exception, trace))
%_{__exception__: true} = exception ->
  Logger.error(Exception.format(:error, exception))

_NOTE: It looks like I also need to wrap the mod.terminate statement in another case statement in case there's a failure in the using module's terminate callback. Which would basically replace the reason.

Literally, that plus tests should be all that we would need to get proper crash logging in place. But it's kind of an educated guess for me. If I get it done soon.

But this is probably another 3-4 hour fix for me, and I'm barely able to take an hour to research this. 😢

Thanks, finally found the time to try your suggestion and dove a bit in the code, and it seems to be working. Took a look at @mgwidmann's fork, and it seems he is using a similar solution to solve this.

I'd be happy to work on a PR, but I noticed a lot of them are not merged or getting an answer. Is this repo still maintained, or shall I go work on my fork?

On this topic, related to another PR that was opened, I would be very happy to work more actively on this repo and I imagine I am not alone in this case, to keep up the pace if @Azolo is understandably busy with other things. But of course we'd need your greenlight for that, and closure on the contribution guidelines PR. I am using Websocket and Elixir quite often these days so I'd really like that repo to be well maintained, along with the help of the community.

Thanks!