Duplicates observed when log-courier configuration file is overwritten

littlefish-littlefish commented 6 years ago

I have two log-courier configuration files, default log-courier.conf and service specific log-courier.conf. Default log-courier.conf has some basic settings (eg. system log, etc.), and service log-courier.conf has basic settings plus service specific settings. Log-courier pushes logs to ELK stack.

Part of the service deployment flow is:

Common setup
- call chef cookbook, which puts the default log-courier.conf on my system and starts log-courier;
Service specific setup
- stop log-courier, overwrite log-courier.conf with service specific configuration, and start log-courier.

However, after each service deployment, I noticed duplicate records in ELK Elasticsearch server. Here are some observations:

When there are overlapping configurations in both files (eg. the default settings in both files), log-courier seems to pick up some log entries again at redeployment time. I consistently observed duplicate entries in ES server, one is when the log was first generated, and another duplicate log entry with timestamp of the deployment.
Log-courier did not pick up the complete log file again though, because I could not seem to find some entries in ES that has duplicates.

Purely based on my observation, it is hard to derive a pattern regarding the behavior of log-courier agent when its configuration file gets overwritten. How exactly does log-courier maintain the location pointer in this case, especially for redeployment, we start with service log-courier.conf, overwrite it with default log-courier.conf, and then with service log-courier.conf again, with log-courier restart in between?

littlefish-littlefish commented 6 years ago

Checking more closely, I not only observed duplicated data, but also missing data, around the time the log-courier.conf file was replaced. We are using log-courier version 1.8. Any ideas?

littlefish-littlefish commented 6 years ago

Also found a place in cookbook, which does not stop log-courier before updating it.

driskell commented 6 years ago

Best thing to do is not stop log-courier and just update the configuration and request a reload. This can be done using the service wrappers (they send SIGHUP to log-courier) or by using lc-admin to send reload.

What is likely happening is there's contention in the pipeline and log-courier is waiting for acknowledgements from logstash. If the pipeline is slow enough that acknowledgements are not forthcoming on outstanding data for up to X many seconds, the service wrapper sometimes force kills the process. This means when log-courier starts again it will need to re-send logs that weren't acknowledged. Those can duplicate if logstash was still trying to process and save them before things died.

Regarding missing data, I can't explain that, unless logstash was also restarted. Once logstash pipeline acknowledges a set of logs - log-courier records that point as acknowledged and will never send it again. So in case of log-courier crash log should not be lost at all. It should only be lost if there's issues at logstash -> ES side. Though depending on logstash version and persistence configuration that might even be completed prevented.

If there is indeed missing data it'll be useful to know from what files and have a copy of the shutdown and startup logs from log-courier when it happened.

driskell / log-courier

Duplicates observed when log-courier configuration file is overwritten #366