elixir-lang / elixir

Elixir is a dynamic, functional language for building scalable and maintainable applications
https://elixir-lang.org/
Apache License 2.0
24.31k stars 3.35k forks source link

Logger never gets out of :discard mode #9028

Closed epsylonix closed 5 years ago

epsylonix commented 5 years ago

Environment

Erlang/OTP 21 Elixir 1.8.1

Current behavior

When the message queue size exceeds the discard threshold Logger enters into the discard mode with a message like that:

Logger has 661 messages in its queue, which is above :discard_threshold. Messages will be discarded until the message queue goes back to 75% of the threshold size

From that moment on all log messages are discarded until the app is restarted

> pid = :erlang.whereis(Logger)           
> Process.info(pid, :message_queue_len)   

{:message_queue_len, 0}

> :sys.get_state(pid)                  
[
  {LoggerFileBackend, :file,
   %{...}},
  {Logger.Config, false,
   {%{
      level: :debug,
      mode: :discard,
      translators: [{Logger.Translator, :translate}],
      truncate: 8096,
      utc_log: false
    },
    %{
      async_threshold: 15,
      discard_threshold: 500,
      keep_threshold: 375,
      sync_threshold: 20
    }}}
]

> require Logger
> Logger.info("test")
# nothing is logged

# manually change logger mode
> st =  :ets.lookup_element(Logger.Config, :data, 2)
%{
level: :debug,
mode: :discard,
translators: [{Logger.Translator, :translate}],
truncate: 8096,
utc_log: false
}

> st = %{st | mode: :async}
> :ets.update_element(Logger.Config, :data, {2, st})

# now logging resumes

I believe I found what the problem is. Logger's mode is updated when a new event arrives here: https://github.com/elixir-lang/elixir/blob/v1.8/lib/logger/lib/logger/config.ex#L78-L102

But when Logger enters the :discard mode, all log messages are dropped and no event get generated: https://github.com/elixir-lang/elixir/blob/v1.8/lib/logger/lib/logger.ex#L672-L678

So it seems that after entering a :discard mode Logger can't get out of it and all the log messages are dropped.

Expected behavior

After the Logger's message queue gets under the discard threshold Logger's mode should be updated appropriately.

josevalim commented 5 years ago

Note we only discard future messages. Whatever is in the inbox will still have to be processed and that should set the threshold back. Unless your whole inbox is filled with messages that are not log messages. But that would be a very pathological case. Do you have any idea why the system could be running into this issue?

josevalim commented 5 years ago

Btw, note this issue does not happen in master as we have descentralized the mode computation.

epsylonix commented 5 years ago

Indeed, I didn't think of that. Now thinking about it I believe we encountered this issue first around the time Sentry log backend was added (https://hex.pm/packages/sentry).

It goes like this:

  1. external service is down
  2. multiple crashes happen simultaneously
  3. Logger gets into the discard mode and never gets back unless the state is changed manually or the app is restarted.

I assumed we started to experience log drops when crashes happen because SentryLogger was slowing down log events processing by reporting each crash to Sentry. This is fine for a short time but obviously I expected logging to resume.

I looked into the Senty client code (https://github.com/getsentry/sentry-elixir/blob/master/lib/sentry/client.ex) - they use Task to send messages so I believe it is possible that each reported crash pushed another non-log messages into the inbox.

josevalim commented 5 years ago

@mitchellhenke Since you are using Task.async (or Task.Supervisor.async_nolink) in Sentry, can you make sure you are calling await? Because if you don't call await, then it will leak messages to the inbox once the task is done. If you are not awaiting, you should use Task.Supervisor.start_child instead.

mitchellhenke commented 5 years ago

Oops, yep!

josevalim commented 5 years ago

@mitchellhenke perfect! Let me know if you want me to review something. I also have a PR to fix this issue on Elixir side here: #9029. Once that is merged I will release a new v1.8.