driskell / log-courier

The Log Courier Suite is a set of lightweight tools created to ship and process log files speedily and securely, with low resource usage, to Elasticsearch or Logstash instances.
Other
419 stars 107 forks source link

Duplicates observed when log-courier configuration file is overwritten #366

Closed littlefish-littlefish closed 5 years ago

littlefish-littlefish commented 6 years ago

I have two log-courier configuration files, default log-courier.conf and service specific log-courier.conf. Default log-courier.conf has some basic settings (eg. system log, etc.), and service log-courier.conf has basic settings plus service specific settings. Log-courier pushes logs to ELK stack.

Part of the service deployment flow is:

  1. Common setup
    • call chef cookbook, which puts the default log-courier.conf on my system and starts log-courier;
  2. Service specific setup
    • stop log-courier, overwrite log-courier.conf with service specific configuration, and start log-courier.

However, after each service deployment, I noticed duplicate records in ELK Elasticsearch server. Here are some observations:

Purely based on my observation, it is hard to derive a pattern regarding the behavior of log-courier agent when its configuration file gets overwritten. How exactly does log-courier maintain the location pointer in this case, especially for redeployment, we start with service log-courier.conf, overwrite it with default log-courier.conf, and then with service log-courier.conf again, with log-courier restart in between?

littlefish-littlefish commented 6 years ago

Checking more closely, I not only observed duplicated data, but also missing data, around the time the log-courier.conf file was replaced. We are using log-courier version 1.8. Any ideas?

littlefish-littlefish commented 6 years ago

Also found a place in cookbook, which does not stop log-courier before updating it.

driskell commented 6 years ago

Best thing to do is not stop log-courier and just update the configuration and request a reload. This can be done using the service wrappers (they send SIGHUP to log-courier) or by using lc-admin to send reload.

What is likely happening is there's contention in the pipeline and log-courier is waiting for acknowledgements from logstash. If the pipeline is slow enough that acknowledgements are not forthcoming on outstanding data for up to X many seconds, the service wrapper sometimes force kills the process. This means when log-courier starts again it will need to re-send logs that weren't acknowledged. Those can duplicate if logstash was still trying to process and save them before things died.

Regarding missing data, I can't explain that, unless logstash was also restarted. Once logstash pipeline acknowledges a set of logs - log-courier records that point as acknowledged and will never send it again. So in case of log-courier crash log should not be lost at all. It should only be lost if there's issues at logstash -> ES side. Though depending on logstash version and persistence configuration that might even be completed prevented.

If there is indeed missing data it'll be useful to know from what files and have a copy of the shutdown and startup logs from log-courier when it happened.