log-courier holding many files open

pol commented 7 years ago

We have an issue where log-courier will intermittently hold hundreds or thousands of files open. Restarting log-courier resolves the issue, but that seems like an inappropriate log-term resolution to the issue.

Unfortunately, I don't know how to reproduce the issue. The behavior is that sometimes log-courier will keep hundreds of paths in a 100% completed state (due to log volume, we rotate logs very frequently), and there will be thousands of files listed in lsof | grep deleted held open by log-courier.

For example, let's say the path is:

/app/myapp/etc/resque/resque-worker-statused,highmem,high,medhigh,medium-8/log/main/current

running this: lc-admin -config /etc/log-courier/log-courier.conf status | grep path | sort | uniq -c will show a line like this:

237 path: /app/myapp/etc/resque/resque-worker-statused,highmem,high,medhigh,medium-8/log/main/current

indicating that the log-courier status is listing that path 237 times. Likewise, grepping for 'completion: 100' will show hundreds of lines.

At the same time, running lsof | grep deleted will show hundreds or thousands of lines of the same path being held open.

The consequences of this are that: 1) These files use up disk space, and are only released when log-courier is restarted, so disks will eventually run out of space; and 2) we need to increase the number of file handles that the log-courier service is allowed to have open (service configuration of LimitNOFile=65556) to prevent it from locking up.

There is no indication that log-courier is broken or in an error state. It is just using up many more system resources than necessary.

Any help would be appreciated.

pol commented 7 years ago

And as an aside, is there any reason to use log-courier over filebeat at this time? Other than "if it ain't broke..."? I inherited the log-courier setup and am not deeply familiar with the differences between the two projects; there may be critical features of log-courier that make it more appropriate than filebeat for some situations.

pol commented 7 years ago

I think this is probably similar (or related) to a previous posted issue (#355 ). If the completion value is 100, how would I determine the reason for the file being held as deleted? The logstash servers are not being overwhelmed, is there some other debugging that could be done to determine what is happening?

pol commented 7 years ago

I am continuing to work on this and it looks like most of the files being held open are working as intended (log-courier trying to ship logs to logstash, and logstash is not keeping up with the volume).

But I have some lines that I think are problematic (matching my original issue). There are a bunch of files that are at completion: 100 with a status: idle and various values for dead_timer: foo. We have the dead time config set to 10m. Here's a sample:

lc-admin -config /etc/log-courier/log-courier.conf status | grep -B 4 -A 12 'completion: 100' | grep dead_timer | tail
        dead_timer: 48m12.696290565s
        dead_timer: 18m1.75158368s
        dead_timer: 12m7.437940649s
        dead_timer: 56m40.013881409s
        dead_timer: 38m28.133104811s
        dead_timer: 6m45.432244864s
        dead_timer: 34m42.476957265s
        dead_timer: 47m5.400176294s
        dead_timer: 12m57.468869139s
        dead_timer: 40m38.394606093s

The only one that is there that should be there is the 6m one. The rest should be released, right? If they are idle and orphaned?

pol commented 7 years ago

Another example, what scenario would cause this to happen:

lc-admin -config /etc/log-courier/log-courier.conf status | grep -A 16 '0xc8229257a0:'                                                        Thu Nov 30 01:49:32 2017

    0xc8229257a0:
      error: n/a
      harvester:
        codecs: none
        completion: 100
        current_offset: 999018
        dead_timer: 25m42.428614567s
        last_eof_offset: 999018
        last_known_size: 999018
        processed_lines: 5582
        speed_bps: 0
        speed_lps: 0
        stale_bytes: 0
        status: idle
      id: 0xc8229257a0
      orphaned: yes

25 minutes, just sitting there. Based on my understanding, this would only happen if the processed_lines were not all shipped to logstash, so it is waiting to do so (maybe it isn't one of the 10 payload processes)? Is that what is happening?

PMDubuc commented 7 years ago

Is it hanging on to files which have been deleted? Try setting "dead time": "5m" In the "files" section of the config.

pol llovet wrote:

Another example, what scenario would cause this to happen:

|lc-admin -config /etc/log-courier/log-courier.conf status | grep -A 16 '0xc8229257a0:' Thu Nov 30 01:49:32 2017 0xc8229257a0: error: n/a harvester: codecs: none completion: 100 current_offset: 999018 dead_timer: 25m42.428614567s last_eof_offset: 999018 last_known_size: 999018 processed_lines: 5582 speed_bps: 0 speed_lps: 0 stale_bytes: 0 status: idle id: 0xc8229257a0 orphaned: yes |

25 minutes, just sitting there. Based on my understanding, this would only happen if the processed_lines were not all shipped to logstash, so it is waiting to do so (maybe it isn't one of the 10 payload processes)? Is that what is happening?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/driskell/log-courier/issues/364#issuecomment-348057413, or mute the thread https://github.com/notifications/unsubscribe-auth/AK_Mqh_9LhUCbcBIPgnW1TjPw8p7HFhNks5s7gpzgaJpZM4Qt20Q.

pol commented 7 years ago

So, the dead time is set to 10m at the moment, but we have lots of files like the last message I posted. Completed, Orphaned, with a dead_timer of way past the "dead time" setting. To repeat, my assumption is that it is waiting for logstash to ingest the pending data, but how do I confirm that?

driskell commented 7 years ago

I will look into this and what causes might result in your symptoms (thanks for the amazing details it will help a lot!)

With regards to filebeat - I'm unsure of differences. A few years ago it had some major issues when you hit pipeline saturation so I built Log Courier specifically to cope with high saturation. Log Courier 2 brought in some smart load balancing so it could connect to multiple LogStash of varying degrees of performance and smartly balance the data across them, sending more to faster servers. Filebeat might have caught up and surpassed by now so it's definitely worth looking at and it likely will bring lots of features with it (IIRC it has modules to parse common application logs really simply.)

I still maintain Log Courier though for my own use as it's easier than upgrading something that currently works fine 👍

I'll look into this issue

driskell commented 7 years ago

OK, so dead timer is actually the remaining time before the file becomes dead.

Statistics per harvester routine (file ingestion) take place every second. Every 10 seconds (roughly every 10th statistic calculation) it will examine the file status to check for truncation (so it can warn and restart) and also check if the dead time is reached (which means no successful read from the file in X minutes as per configuration). If the dead time is reached it then stops and closes the file.

I can see if you are setting 10m for dead time this isn't being paid attention to. Can you paste that part of the config? Possibly it's not parsing it for some reason and not reporting it as invalid. Also if you can let me know version of Log Courier that will help too.

If the dead timer is not reaching 0s (which will trigger file to close imminently) then they may be getting modified. It will be worth checking those files modification times to check they are indeed ending.

driskell commented 7 years ago

Maybe check lsof to see if the writing application still has those files open too as well as Log Courier. If the writing application is still writing to it, even though it's deleted, then Log Courier will keep reading from it and resetting the dead timer, even though it's deleted.

I did have a plan a long time ago to make this smarter so you can enable a reaper timer that will force death on a routine if the file is deleted. I was toying with letting this be based on available disk space or size of the file too so it kicks in when it's needed - when disk space gets too high.

pol commented 7 years ago

So, more info, it appears that the real problem in our setup is really elasticsearch. It is not ingesting log events fast enough, and logstash is piling up, and in turn, log-courier. So my initial concern that log-courier was doing something wrong is likely itself wrong. I think everything is working right, it's just that there isn't much feedback about it. I currently have a "worker" server (one of the heavier log generators) on which log-courier has 3600+ files open.

On first look, the situation is similar to the initial post, with lots of files held open (log courier is watching ~80 files on this server, with most having ~20 files open, and 6 having over 200 open. Here's one:

$ lc-admin -config /etc/log-courier/log-courier.conf status | grep path | sort | uniq -c
<SNIP>
260       path: /app/myapp/etc/sv/resque-worker-statused,long_high,high,medhigh,medium,long_medium-8/log/main/current
$ lsof | grep deleted | wc -l
105212
$ lsof | grep /app/myapp/etc/sv/resque-worker-statused,long_high,high,medhigh,medium,long_medium-8/log/main/current | wc -l
26
$ lc-admin -config /etc/log-courier/log-courier.conf status | grep -B 4 'path: /app/myapp/etc/sv/resque-worker-statused,long_high,high,medhigh,medium,long_medium-8/log/main/current' | grep orphaned | sort | uniq -c
      1       orphaned: no
    259       orphaned: yes

I am not sure what it means that log-courier is working on 260 files, but lsof only sees 26 of them? My gut thinks that this is what major logstash backlog looks like, but from the point of view of log-courier, I don't know what to do about it (other than try to figure out how to either increase log ingestion or decrease logging).

pol commented 7 years ago

A follow up that indicates that maybe log-courier may be doing something wrong.

So, in the prior example, the disparity between lsof deleted files and the total number of active files confused me. Here are similar results from two different servers (the prior one from the example and one other):

# lc-admin -config /etc/log-courier/log-courier.conf status | grep path | wc -l
3769
# lsof | grep deleted | awk '{print $1}' | sort | uniq -c
      8 consul
 107154 log-couri
### Other Instance
# lc-admin -config /etc/log-courier/log-courier.conf status | grep path | wc -l
830
# lsof | grep deleted | awk '{print $1}' | sort | uniq -c
     11 consul
   9252 log-couri

What is going on there? Why does log-courier think that it has 3000 or 800 files open, but lsof thinks that it has 100000 or 9000 files open? That seems very suspicious.

pol commented 7 years ago

I was running out of disk space, so I had to kill log-courier. Here's what that looked like in practice, with a bit of extra info:

lc-admin -config /etc/log-courier/log-courier.conf status | grep path | wc -l
917
root@foo-j-3 ~ # lsof | grep deleted | awk '{print $1}' | sort | uniq -c
     11 consul
  10308 log-couri
root@foo-j-3 ~ # systemctl restart log-courier
root@foo-j-3 ~ # lc-admin -config /etc/log-courier/log-courier.conf status | grep path | wc -l
8
root@foo-j-3 ~ # lsof | grep deleted | awk '{print $1}' | sort | uniq -c
     11 consul

I should have timed the restart, it took ~20s. The 25GB disk went from 88% full to 54% full after the service restart.

driskell commented 6 years ago

I will check some of my environments to check I’m not seeing something similar.

I have a few questions though:

What is the log courier version?
If you filter lsof deleted output by the path you filter the lc-admin by do the numbers more closely match? I can see one output was filtered by path and not the other. Alternatively, can you compare total number of files reported open with lc-admin and lsof regardless of state? It’s also worth checking there isn’t an issue with restarts and ensure only one log courier is running before running the commands (was the awk on user? Check only one process running as log courier)

I checked codebase and file should always close if the harvester ends so we shouldn’t see numbers like that.

I’ll check more on some of my own environments see if I can find anything similar

driskell commented 6 years ago

Aha I see the last one is unfiltered on path and lsof showing many deleted. Interesting. Are you able to grab a sample of the files reported by lsof if you see this again and compare to lc-admin? Just to see if there’s a rogue type of file (maybe even log courier’s own log file)

pol commented 6 years ago

Version:

~ % lc-admin --version
Log Courier client version 2.0.5

And here is a file list gist:

https://gist.github.com/pol/8a5899c58aa424b69d646d4fc9449606

pol commented 6 years ago

Also, this is basically happening all of the time, so I have pretty much constant access to sample data. :P

I mean, if it goes away I will be happy, but for now I can get you pretty much anything you need.

driskell commented 6 years ago

Ahhhh OK so I found explanation for the lsof number of files being huge. This is because lsof is reporting per thread and not per process.

For example, for my memcache instance I have the following, and I ran lsof | head -1 to get the header line:

COMMAND     PID   TID               USER   FD      TYPE             DEVICE  SIZE/OFF       NODE NAME
memcached   862                memcached   26u     unix 0xffff8801c33fa400       0t0   23397500 /var/run/memcached/memcached.sock
memcached   862   866          memcached   26u     unix 0xffff8801c33fa400       0t0   23397500 /var/run/memcached/memcached.sock
memcached   862   867          memcached   26u     unix 0xffff8801c33fa400       0t0   23397500 /var/run/memcached/memcached.sock
memcached   862   869          memcached   26u     unix 0xffff8801c33fa400       0t0   23397500 /var/run/memcached/memcached.sock
memcached   862   870          memcached   26u     unix 0xffff8801c33fa400       0t0   23397500 /var/run/memcached/memcached.sock
memcached   862   871          memcached   26u     unix 0xffff8801c33fa400       0t0   23397500 /var/run/memcached/memcached.sock
memcached   862   873          memcached   26u     unix 0xffff8801c33fa400       0t0   23397500 /var/run/memcached/memcached.sock
memcached   862   874          memcached   26u     unix 0xffff8801c33fa400       0t0   23397500 /var/run/memcached/memcached.sock
memcached   862   875          memcached   26u     unix 0xffff8801c33fa400       0t0   23397500 /var/run/memcached/memcached.sock
memcached   862   876          memcached   26u     unix 0xffff8801c33fa400       0t0   23397500 /var/run/memcached/memcached.sock

However, I only have a single memcached process, and it only has that file open once. It's because the file descriptor (26u) is shared amongst threads it reports each one (different TID and there's a blank TID representing the main thread which has the same ID as the PID.)

This means your lsof pipeline command will report number of log-courier golang threads * number of open files that are deleted - and as there are lots of files being read, there'll be lots of threads, and so you'll get a huuuuuge number.

In order to prevent the duplication, you need to use specific filters in lsof to target the query, or it automatically starts throwing out information for every task (threads).

This would be a better result to give more realistic data: lsof -p pid-of-log-courier

driskell commented 6 years ago

I distributed some commands to my environments and the results were all accurate running 2.0.5. The number of path results from lc-admin matched exactly the number of open log files (I included only the Log Courier PID using lsof -p and included only log file paths)

Let me know if there's still anything amiss!

pol commented 6 years ago

Ok, so that solves that mystery. Does log-courier always use the same number of threads? Should the number of files reported by lsof always be active_files * num_threads?

driskell commented 6 years ago

Sorry for late reply I was away for the year end

Threads are managed by Golang - it will hold as many threads as there are system calls blocking, plus one active running thread each for each CPU processor core. So for quad core it would be 4 threads and then if any sys call block happens it'll start a new thread so that there's always 4 threads running.

I believe it should probably always be active_files * num_threads. Bare in mind active files would include the log courier log file and possibly stdin/stdout/stderr and socket connections too.

driskell / log-courier

log-courier holding many files open #364