Closed littlefish-littlefish closed 5 years ago
Checking more closely, I not only observed duplicated data, but also missing data, around the time the log-courier.conf file was replaced. We are using log-courier version 1.8. Any ideas?
Also found a place in cookbook, which does not stop log-courier before updating it.
Best thing to do is not stop log-courier and just update the configuration and request a reload. This can be done using the service wrappers (they send SIGHUP to log-courier) or by using lc-admin
to send reload.
What is likely happening is there's contention in the pipeline and log-courier is waiting for acknowledgements from logstash. If the pipeline is slow enough that acknowledgements are not forthcoming on outstanding data for up to X many seconds, the service wrapper sometimes force kills the process. This means when log-courier starts again it will need to re-send logs that weren't acknowledged. Those can duplicate if logstash was still trying to process and save them before things died.
Regarding missing data, I can't explain that, unless logstash was also restarted. Once logstash pipeline acknowledges a set of logs - log-courier records that point as acknowledged and will never send it again. So in case of log-courier crash log should not be lost at all. It should only be lost if there's issues at logstash -> ES side. Though depending on logstash version and persistence configuration that might even be completed prevented.
If there is indeed missing data it'll be useful to know from what files and have a copy of the shutdown and startup logs from log-courier when it happened.
I have two log-courier configuration files, default log-courier.conf and service specific log-courier.conf. Default log-courier.conf has some basic settings (eg. system log, etc.), and service log-courier.conf has basic settings plus service specific settings. Log-courier pushes logs to ELK stack.
Part of the service deployment flow is:
However, after each service deployment, I noticed duplicate records in ELK Elasticsearch server. Here are some observations:
Purely based on my observation, it is hard to derive a pattern regarding the behavior of log-courier agent when its configuration file gets overwritten. How exactly does log-courier maintain the location pointer in this case, especially for redeployment, we start with service log-courier.conf, overwrite it with default log-courier.conf, and then with service log-courier.conf again, with log-courier restart in between?