lsyncd spinning 100% CPU, writing "Normal: waiting for 1 more child processes" indefinitely to the log

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Some event over long periods of runtime triggers this.  It is often 
accompanied by a defunct rsync process, so that seems like the likely problem.  
It appears lsyncd is busy waiting for a process that doesn't exist any longer 
or is defunct.

What is the expected behavior? What does Lsyncd do instead?

lsyncd should not be spinning waiting on a non-existent or defunct rsync 
process.  Instead, the spinning consumes lots of CPU and this never resolves 
itself leaving the logs to fill up the filesystem.  Paste of the output of an 
strace on the process when it is spinning:

...
times(NULL)                             = 678493293
open("/var/log/lsyncd/lsyncd.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=6052297879, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x2b1498d46000
fstat(4, {st_mode=S_IFREG|0644, st_size=6052297879, ...}) = 0
lseek(4, 6052297879, SEEK_SET)          = 6052297879
write(4, "Thu Aug 25 15:03:39 2011 Normal: waiting for 1 more child 
processes.\n", 69) = 69
close(4)                                = 0
munmap(0x2b1498d46000, 4096)            = 0
times(NULL)                             = 678493294
wait4(0, 0x7fff629f1168, WNOHANG, NULL) = 0
times(NULL)                             = 678493294
open("/var/log/lsyncd/lsyncd.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=6052297948, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x2b1498d46000
fstat(4, {st_mode=S_IFREG|0644, st_size=6052297948, ...}) = 0
lseek(4, 6052297948, SEEK_SET)          = 6052297948
write(4, "Thu Aug 25 15:03:39 2011 Normal: waiting for 1 more child 
processes.\n", 69) = 69
close(4)                                = 0
munmap(0x2b1498d46000, 4096)            = 0
times(NULL)                             = 678493294
wait4(0, 0x7fff629f1168, WNOHANG, NULL) = 0
times(NULL)                             = 678493294
open("/var/log/lsyncd/lsyncd.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=6052298017, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x2b1498d46000
fstat(4, {st_mode=S_IFREG|0644, st_size=6052298017, ...}) = 0
lseek(4, 6052298017, SEEK_SET)          = 6052298017
write(4, "Thu Aug 25 15:03:39 2011 Normal: waiting for 1 more child 
processes.\n", 69) = 69
close(4)                                = 0
munmap(0x2b1498d46000, 4096)            = 0
times(NULL)                             = 678493294
wait4(0, 0x7fff629f1168, WNOHANG, NULL) = 0
times(NULL)                             = 678493294
open("/var/log/lsyncd/lsyncd.log", O_WRONLY|O_CREAT|O_APPEND, 0666) = 4
fstat(4, {st_mode=S_IFREG|0644, st_size=6052298086, ...}) = 0
mmap(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 
0x2b1498d46000
fstat(4, {st_mode=S_IFREG|0644, st_size=6052298086, ...}) = 0
lseek(4, 6052298086, SEEK_SET)          = 6052298086
write(4, "Thu Aug 25 15:03:39 2011 Normal: waiting for 1 more child 
processes.\n", 69) = 69
close(4)                                = 0
munmap(0x2b1498d46000, 4096)            = 0
times(NULL)                             = 678493294
...

What version of Lsyncd are you using? On what operating system (host and
target(s)?

2.0.4 Centos 5.6, with a single custom patch to specify the rsync bin path and 
fix an issue where a spin occurs when rsync binary isn't available or dest 
hostname is invalid.  Relevant portion of the patch:

@@ -3069,7 +3075,7 @@
    [ 25] = "die",
    [ 30] = "again",
    [ 35] = "again",
-   [255] = "again",
+-- [255] = "again", -- dont, consisten failure, if e.g. rsync binary not 
available or dest hostname invalid
 }

Full patch attached

If applicable please copy/paste your Lsyncd configuration file.

--- This file is generated by puppet
settings = {
  logfile    = "/var/log/lsyncd/lsyncd.log",
  statusFile = "/var/run/lsyncd/lsyncd.status",
  nodaemon   = false,
}
sync{default.rsyncssh, source="/home", host="user@server", targetdir="/home", 
rsyncOps={"-au", "--exclude", "*.lock", "--exclude", "*.newlock"}, 
exclude={"dovecot-uidlist.lock", "dovecot.index.log.newlock", 
"dovecot.index.cache.lock", "subscriptions.lock", "dovecot-keywords.lock"}, 
init=false}

If applicable please copy/paste a logfile generated with '-log all' of a
minimal testcase that raises this problem.

Please provide any additional information below.

Everything runs fine for tens of thousands of updates until it gets to this 
point.  It's intermittent when the failure occurs, but it seems reasonable a 
fix can be made to not indefinitely wait on something that will never 
return...at least not busy wait indefinitely and eat up CPU.

Original issue reported on code.google.com by ari...@gmail.com on 26 Aug 2011 at 7:25

Attachments:

lsyncd-rsync_path_and_failures.txt

GoogleCodeExporter commented 9 years ago

There are several things you put into 1 issue :-)

Error level 255 is also returned when the network connection fails - it is a 
classic temporary error.

Please check if the recently released Lsyncd 2.0.5 fixes this two problems. It 
differentiates between startup and normal operation where an errorlevel 255 
during startup will make a fail, while during normal operation will run well. 
The rsyncBinary has also been made configureable.

"waiting for ",np," more child processes." happens when Lsyncd is shutting 
down. Do you have a log with this? And before this is happening? Preferable 
with "-log all" turned on. I suppose your inotify queue overflowed and Lsyncd 
thus resets.

Original comment by axk...@gmail.com on 26 Aug 2011 at 8:14

GoogleCodeExporter commented 9 years ago

I was just trying to give you any relevant information regarding our patch to 
lsyncd in case there was something we did that might have instigated the issue 
:)

I'll check into 2.0.5.

The lsync log fragment I pasted you is a section of a 35GB file with more of 
the same.  We ended up deleting the file as it filled up our /var/log, but when 
we see it again I will try to get you what happened just before the spin.  

We've tuned out inotify queues pretty high.  We previously had issues with that 
and tuned them to be quite large, since initially we pushed LOTs of data which 
generated tons of inotify events (we handle all of it fine now).  The traffic 
has actually tapered off quite a bit, so it's probably not inotify queues 
overflowing any longer.  It seems that lsyncd is waiting indefinitely for 
processes that may have died unexpectedly or been killed off?

Thanks for the quick feedback!

Original comment by ari...@gmail.com on 26 Aug 2011 at 8:28

GoogleCodeExporter commented 9 years ago

Should not be possible that Lsyncd waits for a process that doesn't exist. 
Either it still runs, hangs, or it exists as zombie waiting to be collected 
from Lsyncd. If the pid isn't there but Lsyncd still didn't collect the 
process, there must something quite amiss.

The 100% CPU thing while restarting or shutting down should be fixed with 
following patch (to lsyncd 2.0.5). Please try:

axel@prospectionist:~/lsyncd$ svn diff
Index: lsyncd.lua
===================================================================
--- lsyncd.lua  (revision 587)
+++ lsyncd.lua  (working copy)
@@ -2920,6 +2920,9 @@
 --         times ... the alarm time (only read if number is 1)
 --
 function runner.getAlarm()
+       if lsyncdStatus ~= "run" then
+               return false
+       end
        local alarm = false
        ----
        -- checks if current nearest alarm or a is earlier

Original comment by axk...@gmail.com on 26 Aug 2011 at 8:41

GoogleCodeExporter commented 9 years ago

Thanks, I will give it a whirl.

Original comment by ari...@gmail.com on 30 Aug 2011 at 2:41

GoogleCodeExporter commented 9 years ago

Done with 2.0.5

Original comment by axk...@gmail.com on 16 Sep 2011 at 10:12

Changed state: Fixed

antirek / lsyncd

lsyncd spinning 100% CPU, writing "Normal: waiting for 1 more child processes" indefinitely to the log #80