erlang-lager / lager

A logging framework for Erlang/OTP
Apache License 2.0
1.12k stars 456 forks source link

Lager exception killing connections on RabbitMQ server #423

Closed dimas closed 6 years ago

dimas commented 6 years ago

Hello. We are using RabbitMQ server with lager which is configured as

  {lager, [
      {error_logger_hwm, undefined},
      {handlers, [
        {lager_file_backend, [{file, "/var/log/rabbitmq/lager.log"}, {level, info}, {size, 10240000}, {count, 1} ]},
        {lager_logstash_backend, [{level, info}, {output, {tcp, "logstashhost", 1234}}, {json_encoder, jsx}, {format, json}]}
      ]}
    ]},

we have observed multiple instances of the server losing all connections to all its clients simultaneously which is usually accompanied with an error similar to this:

2017-11-13 10:13:40.650 [error] <0.152.0> Lager event handler lager_logstash_backend exited with reason {'EXIT',{badarg,[{jsx_parser,value,4,[{file,"src/jsx_parser.erl"},{line,139}]},{lager_logstash_json_formatter,format,2,[{file,"src/lager_logstash_json_formatter.erl"},{line,39}]},{lager_logstash_backend,handle_log,2,[{file,"src/lager_logstash_backend.erl"},{line,116}]},{lager_logstash_backend,handle_event,2,[{file,"src/lager_logstash_backend.erl"},{line,104}]},{gen_event,server_update,4,[{file,"gen_event.erl"},{line,533}]},{gen_event,server_notify,4,[{file,"gen_event.erl"},{line,515}]},{gen_event,...},...]}}

I raised question with RabbitMQ (see https://groups.google.com/forum/#!topic/rabbitmq-users/dK7r10t9jYI) and this is the response I got there:

You run into a Lager exception that trips up all inbound connection the moment the server has to log something on them. The JSX parser seems to be given a number (4) as an input.

I am not sure how exactly is designed to work in Erlang (I know very little about Erlang tbh), but in Java world it one of the requirements that logging library does not let any of the exceptions raised in logging backends to hit the code that is trying to log. And if I read RabbitMQ's Michael response above correctly, it looks like exactly the case.

It is lager-3.5.1, jsx-2.7.0 and lager_logstash-0.1.3

Cheers

jadeallenx commented 6 years ago

Hi, I'm sorry you're having this problem. The stack trace you included points the finger pretty clearly at your JSON encoder and/or the message that you're asking it to encode.

I'm not exactly sure what you expect lager to do differently. It is your 3rd party logstash backend which is propagating the exception from the json encoder (jsx), and lager is merely passing it along.

As I see it you have 3 options:

dimas commented 6 years ago

@mrallen1, thanks for the super-prompt reply! I understand that problem originates in the backend and not in lager itself. I was just curious if it is expected that error thrown by backend is not suppressed somehow by lager and reaches the actual code that tried logging. I can only repeat that I know very little of Erlang/lager so please excuse my ignorance but in Java it is considered very dangerous if logging framework can become a source of additional errors so there are two design choices:

If lager is designed with the latter in mind, then logstash backend is clearly at fault but if there is no such requirement, then it feels that a little bit more could be done by lager itself.

Thank you for suggestions, I will look into first two options. We probably cannot stop using logstash as we need to collect logs from multiple machines in one central location and we already have ELK stack.

Thanks!

jadeallenx commented 6 years ago

Philosophically, Erlang tends to favor a "let it crash" philosophy. This is long but it explains the assumptions and mindset of how most Erlang developers think errors ought to be handled.

Since the exception in jsx makes RabbitMQ do bad things, probably the right solution to this issue is the first one - make json encoding errors in the logstash backend less brittle. If the exception you describe were happening in one of the "out of the box" lager backends (console or file), I know that is certainly what I would want to provide as a fix.

Vagabond commented 6 years ago

I think @dimas is right, this should not happen, or perhaps it is actually not happening and the cause/effect is being misdiagnosed.

Lager uses the gen_event behaviour to house its logging backends. The calling process uses gen_event:notify (and sometime sync_notify) to deliver the messages from the calling process. This should not fail. The documentation says:

notify/1 does not fail even if the specified event manager does
not exist, unless it is specified as Name.

So, let's test this. Lager has a 'crash backend' in its test suite, lets modify it slightly and add this clause:

handle_event({log, Message}, State) ->
    case lager_msg:message(Message) of
        "boom" ->
            erlang:error(boom);
        _ ->
            ok
    end,
    {ok, State};

This will crash when you try to log a message with the payload of boom, but not otherwise. Quickly force a compile of the module with make test (you can abort the command once it starts running tests) and lets load up an erlang shell with it in our path:

erl -pa ebin deps/*/ebin .eunit

OK, now we have an erlang shell with everything in the code path, lets configure and start lager:

Erlang/OTP 20 [erts-9.1] [source] [64-bit] [smp:8:8] [ds:8:8:10] [async-threads:10] [kernel-poll:false]

Eshell V9.1  (abort with ^G)
1> application:set_env(lager, handlers, [{lager_console_backend, [{level, debug}]}, {lager_crash_backend, [undefined, undefined]}]).
ok
2> lager:start().
09:49:39.666 [debug] Lager installed handler lager_crash_backend into lager_event
09:49:39.669 [debug] Lager installed handler error_logger_lager_h into error_logger
09:49:39.676 [debug] Supervisor gr_param_sup started gr_param:start_link(gr_lager_default_tracer_params) at pid <0.83.0>
09:49:39.676 [debug] Supervisor gr_counter_sup started gr_counter:start_link(gr_lager_default_tracer_counters) at pid <0.84.0>
09:49:39.676 [debug] Supervisor gr_manager_sup started gr_manager:start_link(gr_lager_default_tracer_params_mgr, gr_lager_default_tracer_params, []) at pid <0.85.0>
09:49:39.677 [debug] Supervisor gr_manager_sup started gr_manager:start_link(gr_lager_default_tracer_counters_mgr, gr_lager_default_tracer_counters, [{input,0},{filter,0},{output,0},{job_input,0},{job_run,0},{job_time,0},{job_error,0}]) at pid <0.86.0>
09:49:39.785 [info] Application lager started on node nonode@nohost
ok
3> lager:status().
Lager status:
Console (lager_event) at level debug
Active Traces:
Tracing Reductions:
Tracing Statistics:
  input: 0
 output: 0
 filter: 0
ok
4> gen_event:which_handlers(lager_event).
[lager_crash_backend,lager_console_backend,
 lager_backend_throttle]

Alright, we can see lager is running with the 'crash' backend installed, ready to explode. Let's see how it looks:

5> self().
<0.60.0>
6> lager:log(info, self(), "hello world").
09:52:35.272 [info] hello world
ok
7> lager:log(info, self(), "boom").
09:52:44.472 [info] boom
09:52:44.472 [error] Lager event handler lager_crash_backend exited with reason {'EXIT',{boom,[{lager_crash_backend,handle_event,2,[{file,"test/lager_crash_backend.erl"},{line,58}]},{gen_event,server_update,4,[{file,"gen_event.erl"},{line,573}]},{gen_event,server_notify,4,[{file,"gen_event.erl"},{line,555}]},{gen_event,handle_msg,6,[{file,"gen_event.erl"},{line,296}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}}
ok
09:52:44.472 [debug] Lager installed handler lager_crash_backend into lager_event
8> 09:52:44.472 [error] gen_event lager_crash_backend installed in lager_event terminated with reason: boom in lager_crash_backend:handle_event/2 line 58
8> self().
<0.60.0>
9> gen_event:which_handlers(lager_event).
[lager_crash_backend,lager_console_backend,
 lager_backend_throttle]

So, as we can see, logging a normal message does not crash, logging a 'boom' does make it go boom, but, crucially, the pid of the calling process (our shell) is not killed. Also note that afterwards the handler is re-installed (and reinitialized).

Now, we are in the async path, let's check the sync path:

10> lager_config:get({lager_event, async}).
true
11> lager_config:set({lager_event, async}, false).
true
12> lager_config:get({lager_event, async}).
false
13> lager:log(info, self(), "hello world").
09:55:14.370 [info] hello world
ok
14> lager:log(info, self(), "boom").
09:55:19.108 [info] boom
ok
15> 09:55:19.109 [error] Lager event handler lager_crash_backend exited with reason {'EXIT',{boom,[{lager_crash_backend,handle_event,2,[{file,"test/lager_crash_backend.erl"},{line,58}]},{gen_event,server_update,4,[{file,"gen_event.erl"},{line,573}]},{gen_event,server_notify,4,[{file,"gen_event.erl"},{line,555}]},{gen_event,handle_msg,6,[{file,"gen_event.erl"},{line,299}]},{proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,247}]}]}}
09:55:19.109 [debug] Lager installed handler lager_crash_backend into lager_event
09:55:19.110 [error] gen_event lager_crash_backend installed in lager_event terminated with reason: boom in lager_crash_backend:handle_event/2 line 58

15> self().
<0.60.0>
16> gen_event:which_handlers(lager_event).
[lager_crash_backend,lager_console_backend,
 lager_backend_throttle]

As we can see, in neither path does the crash in the backend propagate up to the calling process.

Now, what does this mean for your problem? I have two theories:

Hopefully this helps. Feel free to follow up with any questions or if you think I've missed something.

dimas commented 6 years ago

First of all, guys, thanks a lot for your time to write all that. I cannot guarantee I understood all of it (Erlang is not my speciality) but I tried.

@mrallen1, regarding "let it crash" - we are used to a familiar concept of "fail fast" so I understand the approach in general. I just thought of logging as some secondary, supporting functionality not a business critical one so thought it is strange that exceptions during logging can break the business critical things. But apparently Erlang uses very different concepts to everything I used to (I actually read the post you linked in full and I admit I knew nothing of these supervisor hierarchies, links and how to correctly build reliable Erlang apps before) so not sure if my arguments make any sense there.

@Vagabond, is there a way to confirm if it was the case of

the logstash backend is crashing, trying to reinitialize and getting hung

Will there be some special error logged anywhere by some core library or VM itself?

Regarding the second suspicion:

your app is crashing for an unrelated reason and logstash is crashing trying to log the message and you're mixing up cause and effect

I am going to follow up on RabbitMQ mailing list because it was their initial assessment that "You run into a Lager exception that trips up all inbound connection the moment the server has to log something on them."

I am also going to raise these exceptions with https://github.com/rpt/lager_logstash but given the project has last commit 3 years ago, not holding my hopes too high.

What makes it slightly problematic is that I have no idea how to reproduce the issue, it only happens on production under load and very infrequently. A typical heisenbug according to the post above...

Thank you very much again.

Vagabond commented 6 years ago

Oh, interesting, this is not the backend I thought you were using. I was looking at https://libraries.io/hex/lager_logstash_backend which is quite different.

Looking at the version you're using, I have some ideas of how it could be blocking. What mode are you using it in? It looks like it supports file, tcp and udp modes.

My big guess is gen_tcp:connect() or gen_tcp:send() hanging because of some network congestion issues. The backend is not using a separate process to send with, nor is it using a send_timeout, it is only using TCP keepalives. Sadly, TCP keepalives are not really useful as the length of time that they can take to notice a dead connection (depending on kernel parameters) can be quite large. Once the local send buffer fills up, a call to send() will block, eventually lager will switch from async to sync mode because of the mailbox size growth and then log calls will start to block too.

I'm not familiar with logstash, but I could probably propose some simple changes to that backend to make it behave better in adverse conditions.

dimas commented 6 years ago

Yeah... the fact there are at least three backends with the same name does not help much :) I am not even 100% sure I found the right one but I checked that line numbers in the stack trace make sense...

We are using TCP and your idea that the issue is caused by some congestion issues is in line with suspicion I have that there was some logstash server outages around that time.

I am going to follow up with RabbitMQ guys and I also raised https://github.com/rpt/lager_logstash/issues/5

If lager behaves as designed and the issue is with backend, I guess I should not be overstaying my welcome here and you guys can close this issue. If you have any suggestions for the backend, I would appreciate if you commented on backend ticket.

Thank you very much again!

jadeallenx commented 6 years ago

I'm going to close this ticket now, since it seems like things may be resolved. If not, please reopen or open a new ticket. Thanks.