Synchronize log rolling

TwP commented 4 years ago

There is a bug in the log rolling code where multiple threads in the same process can contend on rolling the files. The result is that we will delete log files and drop log messages. I grabbed the example from @lzap and modified the code a bit to get a better handle on the error.

require 'logging'

filename = "tmp/rolling.log"

Logging.raise_errors = true
Logging.appenders.rolling_file filename,
  keep: 20,
  size: 400_000,
  age: 2,
  layout: Logging.layouts.pattern(pattern: "[%d] %T %5l -- %c: %m\n")

log = Logging.logger.root
log.add_appenders(Logging.appenders[filename])
log.level = :debug

threads = []

(1..30).each { |thread_count|
  threads << Thread.new do
    Thread.current[:name] = "T-%02d" % thread_count
    (1..1_000).each { |n|
      log.debug "#{n} a very nice little debug message"
      log.warn "#{n} this is your last warning"
    }
  end
}

threads.each {|t| t.join}

This little script will produce a total of 60,000 log lines. We can verify that all log lines are present by running this script and the counting the lines in the generated log files ...

> ruby -I lib roly.rb
> cat tmp/*.log | wc -l
60000

And now we are getting the expected 60000 log lines

refs #206 /cc @lzap

lzap commented 4 years ago

Thanks for looking into this. I've modified the script to use /tmp which is tmpfs on my Fedora, just to prevent SSD wearing. However I see some entries being dropped :-(

$ cat /tmp/*.log | wc -l
59962

It's random:

$ cat /tmp/*.log | wc -l
59971

lzap commented 4 years ago

Okay it looks like those lines are being droped around the time when files are rolled. Without much understanding of the code, I just blindly made this change:

diff --git a/lib/logging/appenders/rolling_file.rb b/lib/logging/appenders/rolling_file.rb
index a187fbc..a8eced2 100644
--- a/lib/logging/appenders/rolling_file.rb
+++ b/lib/logging/appenders/rolling_file.rb
@@ -157,7 +157,9 @@ module Logging::Appenders
       return self if @io.nil?

       str = str.force_encoding(encoding) if encoding && str.encoding != encoding
-      @io.flock_sh { @io.write str }
+      @mutex.synchronize {
+        @io.flock_sh { @io.write str }
+      }

       if roll_required?
         @mutex.synchronize {

And it fixes the problem, I am able to scale up to 100,000 thousands of lines on my 16core AMD CPU without a single line being dropped.

lzap commented 4 years ago

Which actually raises an important question - it looks like this is a common bug across all appenders. The mutex attribute is created in the Appender base class, but then it's used across appenders like in reopen in File, Syslog or String appenders. Problem is, most of these classes use write method from IO base class which does not have synchronized block on the mutex from the Appender class. If I am not mistaken, all of the classes are subject of the similar problem - when output is being reopened, logs can be dropped.

I think it would be safer to declare the mutex for each individual appender. This would encourage to take more care wrapping all methods into synchonized blocks per individual class. Also, the shared mutex for all appenders represents a bottleneck - if I have multiple appenders and one of the appenders is blocking the mutex because of some expensive operation, all of the other appenders can't write. Now, I am on thin ice here and maybe it's not the case.

The mutex could be also defined in another abstract class via decorator design patter, that would make sure all of the important methods (e.g. write) are synchronized. It would declare the mutex too, so it's only available to its children.

lzap commented 4 years ago

Yup, I am able to reproduce:

require_relative 'lib/logging'

filename = "/tmp/file.log"

Logging.raise_errors = true
appender = Logging.appenders.file filename, layout: Logging.layouts.pattern(pattern: "[%d] %T %5l -- %c: %m\n")

log = Logging.logger.root
log.add_appenders(Logging.appenders[filename])
log.level = :debug

threads = []

(1..16).each { |thread_count|
  threads << Thread.new do
    Thread.current[:name] = "T-%02d" % thread_count
    (1..1_000).each { |n|
      log.debug "#{n} a very nice little debug message"
      log.warn "#{n} this is your last warning"
    }
  end
}

appender.reopen
appender.reopen

threads.each {|t| t.join}

When I try this, I am actually getting an exception because other threads try to write while stream is closed:

/home/lzap/work/logging/lib/logging/appenders/io.rb:77:in `write': closed stream (IOError)
Traceback (most recent call last):
        11: from /home/lzap/work/logging/lib/logging/diagnostic_context.rb:474:in `block in create_with_logging_context'
        10: from test-reopen.rb:17:in `block (2 levels) in <main>'
         9: from test-reopen.rb:17:in `each'
         8: from test-reopen.rb:18:in `block (3 levels) in <main>'
         7: from /home/lzap/work/logging/lib/logging/logger.rb:93:in `debug'
         6: from /home/lzap/work/logging/lib/logging/logger.rb:406:in `log_event'
         5: from /home/lzap/work/logging/lib/logging/logger.rb:406:in `each'
         4: from /home/lzap/work/logging/lib/logging/logger.rb:406:in `block in log_event'
         3: from /home/lzap/work/logging/lib/logging/appender.rb:81:in `append'
         2: from /home/lzap/work/logging/lib/logging/appenders/buffering.rb:285:in `write'
         1: from /home/lzap/work/logging/lib/logging/appenders/io.rb:77:in `canonical_write'
/home/lzap/work/logging/lib/logging/appenders/io.rb:77:in `write': closed stream (IOError)

Something like this (this is really ugly) fixes it:

diff --git a/lib/logging/appenders/file.rb b/lib/logging/appenders/file.rb
index 9ed328a..f48c97e 100644
--- a/lib/logging/appenders/file.rb
+++ b/lib/logging/appenders/file.rb
@@ -94,5 +94,12 @@ module Logging::Appenders
     rescue Errno::EEXIST
       open_file
     end
+
+    def canonical_write( str )
+      @mutex.synchronize {
+        super(str)
+      }
+      self
+    end
   end
 end

All good now:

$ cat /tmp/*.log | wc -l
64000

lzap commented 4 years ago

Quickly looking at the code, I think it is reasonable to implement a SynchronizedIO appender defining the mutex. Only the Console appender does not need synchronization, I mean does it make sense to reopen stdout? During process daemonization programs reopen stdin stream itself and not the appender, so I think I am right. Therefore all the appenders except Console would use the new SynchronizedIO abstract class instead.

TwP commented 4 years ago

I saw the missing log line as well. I'm going to merge this PR since it address your original issue, and I will tackle the synchronized writing (and a few other bugs) in separate PRs. This is largely to help me keep the release notes notes straight.

More fixes forthcoming :smile_cat:

TwP / logging

Synchronize log rolling #216