fluent / fluentd

Fluentd: Unified Logging Layer (project under CNCF)
https://www.fluentd.org
Apache License 2.0
12.91k stars 1.34k forks source link

Some logs are missing #2597

Closed vascoosx closed 1 year ago

vascoosx commented 5 years ago

Describe the bug

In my current setup logs go to papertrail by syslog+tls and to a gcp instance by https which then goes to stackdriver. Now, some logs that are present in papertrail can't be found in stackdriver

To Reproduce

Logs are initially sent through heroku's logdrain. The logs first go to an nginx server working as a proxy then to fluentd which sends it to stackdriver.

Expected behavior

Every log that is in papertrail should be in stackdriver

Your Configuration

client setting:

<source>
  @type http
  tag <tag name>
  <parse>
    @type regexp
    expression /^.*<\S+>\d (?<time>\S+) host app web.1 - (?<severity>.), (?<message>.*)$/
  </parse>
  port <port>
  bind 0.0.0.0
  add_remote_addr https://<url>
</source>

Your Error Log

# /var/log/google-fluentd/google-fluentd.log

2019-09-01 06:25:01 +0000 [info]: #0 flushing all buffer forcedly
2019-09-01 06:25:01 +0000 [info]: #0 detected rotation of /var/log/nginx/access.log; waiting 5 seconds
2019-09-01 06:25:01 +0000 [info]: #0 following tail of /var/log/nginx/access.log
2019-09-01 06:25:01 +0000 [info]: #0 detected rotation of /var/log/syslog; waiting 5 seconds
2019-09-01 06:25:01 +0000 [info]: #0 following tail of /var/log/syslog

(no errors were found in nginx)

Additional context

agent version: google-fluentd 1.4.2 OS: Ubuntu 18.04

The text below is a portion of the logs. Asterix denote the logs that were missing

05:54:11.468349
05:54:11.474820
05:54:11.477478 *
05:54:11.481780 *
05:54:11.484050 *
05:54:11.485974 *
05:54:11.488010 *
05:54:11.491051 *
05:54:11.492902 *
05:54:11.495263 *
05:54:11.497550 *
05:54:11.498517 *
05:54:11.499052 *
05:54:12.163430
05:54:12.272951
05:54:12.298832 * 
05:54:12.304858 *
05:54:12.307521 *
05:54:12.309893 *
05:54:12.310037 *
05:54:12.311776 *
05:54:12.313578 *
05:54:12.315410 *
05:54:12.317899 *
05:54:12.319555 *
05:54:12.321456 *
05:54:12.323302 *
05:54:12.323988 *
05:54:12.324458 *
05:54:12.796234
05:54:12.916607
repeatedly commented 5 years ago

Is this fluentd core bug? The logs are lost inside fluentd or 3rd party plugin?

repeatedly commented 5 years ago

We can't setup gcp or other cloud service. Could you reproduce the issue on simple environment, e.g. one linux server?

vascoosx commented 5 years ago

Thank you. I'll try reproducing it with a simpler environment. Meanwhile may you tell me if there are any spec on the maximum throughput for http source? Seems like wherever the issue stems from, it is a load related issue.

repeatedly commented 5 years ago

Meanwhile may you tell me if there are any spec on the maximum throughput for http source?

I'm not sure because it depends on machin spec, format and more... Official article mentions one example: https://docs.fluentd.org/input/http#handle-large-data-with-batch-mode

vguaglione commented 5 years ago

We have a similar problem when using the splunk_hec plugin to forward messages to an external splunk installation via the splunk heavy forwarder.

We have noticed that when the problem manifests, we see this error in the fluent log:

2019-08-20 14:05:44 +0000 [info]: Worker 0 finished unexpectedly with signal SIGKILL

If the worker is killed, I suspect all the of messages that were in the queue are lost. Is this a correct assumption? We're not currently configured to handle overflow conditions (by backing to a file for example). We lost three days worth of messages that had yet to be funneled over to splunk when this happened.

Looking for clarification to help determine if it's fluentd, or the plugin that is problematic.

daipom commented 1 year ago

Sorry for the delay.

@vguaglione

If the worker is killed, I suspect all the of messages that were in the queue are lost. Is this a correct assumption?

We can use file buffer. Log loss due to forced process killed cannot be completely prevented, but it can be minimized.

daipom commented 1 year ago

@vascoosx I will close this issue as there will be no update for a while.

If you are still experiencing this problem and know anything about how to reproduce it, please re-open.