Closed pol closed 5 years ago
And as an aside, is there any reason to use log-courier over filebeat at this time? Other than "if it ain't broke..."? I inherited the log-courier setup and am not deeply familiar with the differences between the two projects; there may be critical features of log-courier that make it more appropriate than filebeat for some situations.
I think this is probably similar (or related) to a previous posted issue (#355 ). If the completion value is 100, how would I determine the reason for the file being held as deleted? The logstash servers are not being overwhelmed, is there some other debugging that could be done to determine what is happening?
I am continuing to work on this and it looks like most of the files being held open are working as intended (log-courier trying to ship logs to logstash, and logstash is not keeping up with the volume).
But I have some lines that I think are problematic (matching my original issue). There are a bunch of files that are at completion: 100
with a status: idle
and various values for dead_timer: foo
. We have the dead time
config set to 10m
. Here's a sample:
lc-admin -config /etc/log-courier/log-courier.conf status | grep -B 4 -A 12 'completion: 100' | grep dead_timer | tail
dead_timer: 48m12.696290565s
dead_timer: 18m1.75158368s
dead_timer: 12m7.437940649s
dead_timer: 56m40.013881409s
dead_timer: 38m28.133104811s
dead_timer: 6m45.432244864s
dead_timer: 34m42.476957265s
dead_timer: 47m5.400176294s
dead_timer: 12m57.468869139s
dead_timer: 40m38.394606093s
The only one that is there that should be there is the 6m one. The rest should be released, right? If they are idle and orphaned?
Another example, what scenario would cause this to happen:
lc-admin -config /etc/log-courier/log-courier.conf status | grep -A 16 '0xc8229257a0:' Thu Nov 30 01:49:32 2017
0xc8229257a0:
error: n/a
harvester:
codecs: none
completion: 100
current_offset: 999018
dead_timer: 25m42.428614567s
last_eof_offset: 999018
last_known_size: 999018
processed_lines: 5582
speed_bps: 0
speed_lps: 0
stale_bytes: 0
status: idle
id: 0xc8229257a0
orphaned: yes
25 minutes, just sitting there. Based on my understanding, this would only happen if the processed_lines were not all shipped to logstash, so it is waiting to do so (maybe it isn't one of the 10 payload processes)? Is that what is happening?
Is it hanging on to files which have been deleted? Try setting "dead time": "5m" In the "files" section of the config.
pol llovet wrote:
Another example, what scenario would cause this to happen:
|lc-admin -config /etc/log-courier/log-courier.conf status | grep -A 16 '0xc8229257a0:' Thu Nov 30 01:49:32 2017 0xc8229257a0: error: n/a harvester: codecs: none completion: 100 current_offset: 999018 dead_timer: 25m42.428614567s last_eof_offset: 999018 last_known_size: 999018 processed_lines: 5582 speed_bps: 0 speed_lps: 0 stale_bytes: 0 status: idle id: 0xc8229257a0 orphaned: yes |
25 minutes, just sitting there. Based on my understanding, this would only happen if the processed_lines were not all shipped to logstash, so it is waiting to do so (maybe it isn't one of the 10 payload processes)? Is that what is happening?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/driskell/log-courier/issues/364#issuecomment-348057413, or mute the thread https://github.com/notifications/unsubscribe-auth/AK_Mqh_9LhUCbcBIPgnW1TjPw8p7HFhNks5s7gpzgaJpZM4Qt20Q.
So, the dead time is set to 10m at the moment, but we have lots of files like the last message I posted. Completed, Orphaned, with a dead_timer of way past the "dead time" setting. To repeat, my assumption is that it is waiting for logstash to ingest the pending data, but how do I confirm that?
I will look into this and what causes might result in your symptoms (thanks for the amazing details it will help a lot!)
With regards to filebeat - I'm unsure of differences. A few years ago it had some major issues when you hit pipeline saturation so I built Log Courier specifically to cope with high saturation. Log Courier 2 brought in some smart load balancing so it could connect to multiple LogStash of varying degrees of performance and smartly balance the data across them, sending more to faster servers. Filebeat might have caught up and surpassed by now so it's definitely worth looking at and it likely will bring lots of features with it (IIRC it has modules to parse common application logs really simply.)
I still maintain Log Courier though for my own use as it's easier than upgrading something that currently works fine 👍
I'll look into this issue
OK, so dead timer
is actually the remaining time before the file becomes dead
.
Statistics per harvester routine (file ingestion) take place every second. Every 10 seconds (roughly every 10th statistic calculation) it will examine the file status to check for truncation (so it can warn and restart) and also check if the dead time is reached (which means no successful read from the file in X minutes as per configuration). If the dead time is reached it then stops and closes the file.
I can see if you are setting 10m for dead time this isn't being paid attention to. Can you paste that part of the config? Possibly it's not parsing it for some reason and not reporting it as invalid. Also if you can let me know version of Log Courier that will help too.
If the dead timer is not reaching 0s (which will trigger file to close imminently) then they may be getting modified. It will be worth checking those files modification times to check they are indeed ending.
Maybe check lsof
to see if the writing application still has those files open too as well as Log Courier. If the writing application is still writing to it, even though it's deleted, then Log Courier will keep reading from it and resetting the dead timer, even though it's deleted.
I did have a plan a long time ago to make this smarter so you can enable a reaper timer that will force death on a routine if the file is deleted. I was toying with letting this be based on available disk space or size of the file too so it kicks in when it's needed - when disk space gets too high.
So, more info, it appears that the real problem in our setup is really elasticsearch. It is not ingesting log events fast enough, and logstash is piling up, and in turn, log-courier. So my initial concern that log-courier was doing something wrong is likely itself wrong. I think everything is working right, it's just that there isn't much feedback about it. I currently have a "worker" server (one of the heavier log generators) on which log-courier has 3600+ files open.
On first look, the situation is similar to the initial post, with lots of files held open (log courier is watching ~80 files on this server, with most having ~20 files open, and 6 having over 200 open. Here's one:
$ lc-admin -config /etc/log-courier/log-courier.conf status | grep path | sort | uniq -c
<SNIP>
260 path: /app/myapp/etc/sv/resque-worker-statused,long_high,high,medhigh,medium,long_medium-8/log/main/current
$ lsof | grep deleted | wc -l
105212
$ lsof | grep /app/myapp/etc/sv/resque-worker-statused,long_high,high,medhigh,medium,long_medium-8/log/main/current | wc -l
26
$ lc-admin -config /etc/log-courier/log-courier.conf status | grep -B 4 'path: /app/myapp/etc/sv/resque-worker-statused,long_high,high,medhigh,medium,long_medium-8/log/main/current' | grep orphaned | sort | uniq -c
1 orphaned: no
259 orphaned: yes
I am not sure what it means that log-courier is working on 260 files, but lsof only sees 26 of them? My gut thinks that this is what major logstash backlog looks like, but from the point of view of log-courier, I don't know what to do about it (other than try to figure out how to either increase log ingestion or decrease logging).
A follow up that indicates that maybe log-courier may be doing something wrong.
So, in the prior example, the disparity between lsof deleted files and the total number of active files confused me. Here are similar results from two different servers (the prior one from the example and one other):
# lc-admin -config /etc/log-courier/log-courier.conf status | grep path | wc -l
3769
# lsof | grep deleted | awk '{print $1}' | sort | uniq -c
8 consul
107154 log-couri
### Other Instance
# lc-admin -config /etc/log-courier/log-courier.conf status | grep path | wc -l
830
# lsof | grep deleted | awk '{print $1}' | sort | uniq -c
11 consul
9252 log-couri
What is going on there? Why does log-courier think that it has 3000 or 800 files open, but lsof thinks that it has 100000 or 9000 files open? That seems very suspicious.
I was running out of disk space, so I had to kill log-courier. Here's what that looked like in practice, with a bit of extra info:
lc-admin -config /etc/log-courier/log-courier.conf status | grep path | wc -l
917
root@foo-j-3 ~ # lsof | grep deleted | awk '{print $1}' | sort | uniq -c
11 consul
10308 log-couri
root@foo-j-3 ~ # systemctl restart log-courier
root@foo-j-3 ~ # lc-admin -config /etc/log-courier/log-courier.conf status | grep path | wc -l
8
root@foo-j-3 ~ # lsof | grep deleted | awk '{print $1}' | sort | uniq -c
11 consul
I should have timed the restart, it took ~20s. The 25GB disk went from 88% full to 54% full after the service restart.
I will check some of my environments to check I’m not seeing something similar.
I have a few questions though:
I checked codebase and file should always close if the harvester ends so we shouldn’t see numbers like that.
I’ll check more on some of my own environments see if I can find anything similar
Aha I see the last one is unfiltered on path and lsof showing many deleted. Interesting. Are you able to grab a sample of the files reported by lsof if you see this again and compare to lc-admin? Just to see if there’s a rogue type of file (maybe even log courier’s own log file)
Version:
~ % lc-admin --version
Log Courier client version 2.0.5
And here is a file list gist:
https://gist.github.com/pol/8a5899c58aa424b69d646d4fc9449606
Also, this is basically happening all of the time, so I have pretty much constant access to sample data. :P
I mean, if it goes away I will be happy, but for now I can get you pretty much anything you need.
Ahhhh OK so I found explanation for the lsof
number of files being huge. This is because lsof
is reporting per thread and not per process.
For example, for my memcache
instance I have the following, and I ran lsof | head -1
to get the header line:
COMMAND PID TID USER FD TYPE DEVICE SIZE/OFF NODE NAME
memcached 862 memcached 26u unix 0xffff8801c33fa400 0t0 23397500 /var/run/memcached/memcached.sock
memcached 862 866 memcached 26u unix 0xffff8801c33fa400 0t0 23397500 /var/run/memcached/memcached.sock
memcached 862 867 memcached 26u unix 0xffff8801c33fa400 0t0 23397500 /var/run/memcached/memcached.sock
memcached 862 869 memcached 26u unix 0xffff8801c33fa400 0t0 23397500 /var/run/memcached/memcached.sock
memcached 862 870 memcached 26u unix 0xffff8801c33fa400 0t0 23397500 /var/run/memcached/memcached.sock
memcached 862 871 memcached 26u unix 0xffff8801c33fa400 0t0 23397500 /var/run/memcached/memcached.sock
memcached 862 873 memcached 26u unix 0xffff8801c33fa400 0t0 23397500 /var/run/memcached/memcached.sock
memcached 862 874 memcached 26u unix 0xffff8801c33fa400 0t0 23397500 /var/run/memcached/memcached.sock
memcached 862 875 memcached 26u unix 0xffff8801c33fa400 0t0 23397500 /var/run/memcached/memcached.sock
memcached 862 876 memcached 26u unix 0xffff8801c33fa400 0t0 23397500 /var/run/memcached/memcached.sock
However, I only have a single memcached process, and it only has that file open once. It's because the file descriptor (26u
) is shared amongst threads it reports each one (different TID
and there's a blank TID
representing the main thread which has the same ID as the PID.)
This means your lsof
pipeline command will report number of log-courier golang threads * number of open files that are deleted
- and as there are lots of files being read, there'll be lots of threads, and so you'll get a huuuuuge number.
In order to prevent the duplication, you need to use specific filters in lsof
to target the query, or it automatically starts throwing out information for every task
(threads
).
This would be a better result to give more realistic data: lsof -p pid-of-log-courier
I distributed some commands to my environments and the results were all accurate running 2.0.5. The number of path
results from lc-admin
matched exactly the number of open log files (I included only the Log Courier PID using lsof -p
and included only log file paths)
Let me know if there's still anything amiss!
Ok, so that solves that mystery. Does log-courier always use the same number of threads? Should the number of files reported by lsof always be active_files * num_threads?
Sorry for late reply I was away for the year end
Threads are managed by Golang - it will hold as many threads as there are system calls blocking, plus one active running thread each for each CPU processor core. So for quad core it would be 4 threads and then if any sys call block happens it'll start a new thread so that there's always 4 threads running.
I believe it should probably always be active_files * num_threads. Bare in mind active files would include the log courier log file and possibly stdin/stdout/stderr and socket connections too.
We have an issue where log-courier will intermittently hold hundreds or thousands of files open. Restarting log-courier resolves the issue, but that seems like an inappropriate log-term resolution to the issue.
Unfortunately, I don't know how to reproduce the issue. The behavior is that sometimes log-courier will keep hundreds of paths in a 100% completed state (due to log volume, we rotate logs very frequently), and there will be thousands of files listed in
lsof | grep deleted
held open by log-courier.For example, let's say the path is:
/app/myapp/etc/resque/resque-worker-statused,highmem,high,medhigh,medium-8/log/main/current
running this:
lc-admin -config /etc/log-courier/log-courier.conf status | grep path | sort | uniq -c
will show a line like this:237 path: /app/myapp/etc/resque/resque-worker-statused,highmem,high,medhigh,medium-8/log/main/current
indicating that the log-courier status is listing that path 237 times. Likewise, grepping for 'completion: 100' will show hundreds of lines.
At the same time, running
lsof | grep deleted
will show hundreds or thousands of lines of the same path being held open.The consequences of this are that: 1) These files use up disk space, and are only released when log-courier is restarted, so disks will eventually run out of space; and 2) we need to increase the number of file handles that the log-courier service is allowed to have open (service configuration of LimitNOFile=65556) to prevent it from locking up.
There is no indication that log-courier is broken or in an error state. It is just using up many more system resources than necessary.
Any help would be appreciated.