Open RaphSfeir opened 2 years ago
I am using https://hex.pm/packages/parent to supervise my websocket because its particularly unstable and have been noticing the same things. I'll have to try with a regular supervisor to rule out parent, but can confirm seeing a similar issue.
Something to add here... When it gets restarted by parent, it logs the :EXIT
message which was trapped, which is:
[info]04:03:03.453 [error] GenServer #PID<0.5829.0> terminating
[info]** (stop) :too_many_restarts
[info] (parent 0.12.1) lib/parent.ex:822: Parent.give_up!/3
[info] (parent 0.12.1) lib/parent/restart.ex:34: anonymous fn/2 in Parent.Restart.record_restart/2
[info] (elixir 1.14.1) lib/enum.ex:1288: anonymous fn/3 in Enum.flat_map_reduce/3
[info] (elixir 1.14.1) lib/enum.ex:4751: Enumerable.List.reduce/3
[info] (elixir 1.14.1) lib/enum.ex:1287: Enum.flat_map_reduce/3
[info] (parent 0.12.1) lib/parent/restart.ex:12: Parent.Restart.perform/3
[info] (parent 0.12.1) lib/parent.ex:781: Parent.handle_stopped_children/4
[info] (parent 0.12.1) lib/parent.ex:496: Parent.handle_message/1
[info]Last message: {:EXIT, #PID<0.5836.0>, {%FunctionClauseError{module: nil, function: nil, arity: nil, kind: nil, args: nil, clauses: nil}, []}}
It's just a blank %FunctionClauseError{}
struct, and []
no arguments. It doesn't appear like anything is actually crashing.
This statement leads me to believe that there is bug crashing the process but that it is being ignored for some Erlang reason:
Unlike in "plain Erlang", proc_lib processes will not generate error reports, which are written to the terminal by the emulator. All exceptions are converted to exits which are ignored by the default logger handler.
https://www.erlang.org/doc/man/proc_lib.html
Unsure what "plain Erlang" means in the context of Erlang documentation 🤔
I tracked my exit to this intentional terminate call here, the message above it is a bit concerning https://github.com/Azolo/websockex/blob/master/lib/websockex.ex#L893
In my terminate/2
callback, I am getting the following args:
[{%FunctionClauseError{module: nil, function: nil, arity: nil, kind: nil, args: nil, clauses: nil}, []}, :ok]
This makes me believe that for some reason, Websockex
is terminating the process and exiting normally, resulting in no logs being printed.
I have found a fix that at least exposes the error for the time being. Though, something is going on which I don't fully understand so I'm not sure its the correct fix so I won't open a PR. Reason being is, about 10% of the tests all fail since the parent process seems to be expected to receive the exit signal and log the error for you in this case. Instead, after failure in my branch, websockex is logging the error and then exiting. I'm not sure if this is the correct behavior or not but as you've pointed out the supervisor isn't logging the error either.
Sorry, I've been dealing with some family issues.
There was something wrong with the Elixir Logger in older versions of Elixir and while it would capture the Erlang Logger it didn't act as expected when using it directly. I don't remember the exact details.
However, since then I know the Logger has changed and is much integrated with the underlying BEAM logging. A simple error log before the error termination is probably enough now. Much the same way GenServer provides an exit stacktrace and an exit error message.
Sorry but I'm not sure I understand, how to know where the error termination will occur, especially if the functions call other modules?
Is there any more information on this?
We are considering using this library as a websocket client at my company but this issue is a big blocker. Just some hints on how to deal with it would already be great, I wouldn't mind working on a PR if I get the time.
Thanks!
I've noticed my websockex process becoming a zombie while running in my "production" system (which is not really production but runs all the time on a server). When I took a look, it had only done 2 reductions and was in some sort of error state but didn't say what had happened. I killed the process before taking note of which line in websockex.ex
it had failed at. Other than that, I've mostly been getting errors reported properly to the logger with my above branch. However, I wouldn't quite say its ready for real production use until this is sorted out fully.
@RaphSfeir So if we take a look at the Erlang gen_server
module it happens in terminate
under specific conditions.
So, we would look at the terminate
function in WebSockex. For now we can just use Exception.format/3
.
By adding something like these clauses to the reason
case statements:
{%_{__exception__: true} = exception, trace} ->
Logger.error(Exception.format(:error, exception, trace))
%_{__exception__: true} = exception ->
Logger.error(Exception.format(:error, exception))
_NOTE: It looks like I also need to wrap the mod.terminate
statement in another case
statement in case there's a failure in the using module's terminate callback. Which would basically replace the reason.
Literally, that plus tests should be all that we would need to get proper crash logging in place. But it's kind of an educated guess for me. If I get it done soon.
But this is probably another 3-4 hour fix for me, and I'm barely able to take an hour to research this. 😢
When I took a look, it had only done 2 reductions and was in some sort of error state but didn't say what had happened. I killed the process before taking note of which line in
websockex.ex
it had failed at.
@mgwidmann I'm pretty sure this is an instance of #96 and I would kill for any more information on it. I have been completely unable to replicate it.
@Azolo I've posted on the elixir forum some more detailed information. See if that looks to be the same to you. https://elixirforum.com/t/debugging-stuck-zombie-process/54301/2
@RaphSfeir So if we take a look at the Erlang
gen_server
module it happens interminate
under specific conditions.So, we would look at the
terminate
function in WebSockex. For now we can just useException.format/3
.By adding something like these clauses to the
reason
case statements:{%_{__exception__: true} = exception, trace} -> Logger.error(Exception.format(:error, exception, trace)) %_{__exception__: true} = exception -> Logger.error(Exception.format(:error, exception))
_NOTE: It looks like I also need to wrap the
mod.terminate
statement in anothercase
statement in case there's a failure in the using module's terminate callback. Which would basically replace the reason.Literally, that plus tests should be all that we would need to get proper crash logging in place. But it's kind of an educated guess for me. If I get it done soon.
But this is probably another 3-4 hour fix for me, and I'm barely able to take an hour to research this. 😢
Thanks, finally found the time to try your suggestion and dove a bit in the code, and it seems to be working. Took a look at @mgwidmann's fork, and it seems he is using a similar solution to solve this.
I'd be happy to work on a PR, but I noticed a lot of them are not merged or getting an answer. Is this repo still maintained, or shall I go work on my fork?
On this topic, related to another PR that was opened, I would be very happy to work more actively on this repo and I imagine I am not alone in this case, to keep up the pace if @Azolo is understandably busy with other things. But of course we'd need your greenlight for that, and closure on the contribution guidelines PR. I am using Websocket and Elixir quite often these days so I'd really like that repo to be well maintained, along with the help of the community.
Thanks!
Hello,
If a
Websockex
client crashes in an unsupervised instance, the log catches the exit error, for instance using a simplified version of theecho_client.exs
example:Returns something like
Now using a supervised version, the Client when crashing just silently restarts:
For instance using the
Websockex
Application
, just for this example:Then we can run
When the error raises, we just get
Shouldn't it also catch the crash log like in the unsupervised method? I tried using the
debug: [:trace]
option, but this logs everything and often is not convenient with logs overflowing. Perhaps I'm missing an option?Thanks!