fluent / fluentd

Fluentd: Unified Logging Layer (project under CNCF)
https://www.fluentd.org
Apache License 2.0
12.89k stars 1.34k forks source link

Kubernetes container logs - in_tail lose some of rotated logs when rotation is quite fast #3434

Closed doge95 closed 3 years ago

doge95 commented 3 years ago

Describe the bug

We were doing stress test on our EFK stack. We deploy a simple openresty web server that writes access logs to stdout, and use vetega tool to send HTTP requests at a constant rate (QPS: 10000 for 180 seconds). We find that when log file is rotated quite fast, it takes a long time (around 2 mins) for in_tail to discover the new log file, and in_tail loses some of the rotated files.

To Reproduce

Docker logging driver config:

{
  "exec-opts": ["native.cgroupdriver=systemd"],
  "log-driver": "json-file",
  "log-opts": {
    "max-size": "100m",
    "max-file": "50",
    "compress": "true"
  },
  "storage-driver": "overlay2"
}
# echo "GET http://172.16.151.239" | vegeta attack -rate=10000 -duration=180s | tee results.bin | vegeta report
Requests      [total, rate, throughput]         1799999, 9999.95, 9999.94
Duration      [total, attack, wait]             3m0s, 3m0s, 324.017µs
Latencies     [min, mean, 50, 90, 95, 99, max]  107.842µs, 904.156µs, 306.604µs, 1.96ms, 3.064ms, 8.051ms, 114.653ms
Bytes In      [total, mean]                     0, 0.00
Bytes Out     [total, mean]                     0, 0.00
Success       [ratio]                           100.00%
Status Codes  [code:count]                      204:1799999
Error Set:

Expected behavior We expect Kibana to also receive 1799999 logs. However, it only receives 622756 entires. Screen Shot 2021-06-24 at 10 36 25 PM

Your Environment

Your Configuration

<source>
        @type tail
        @log_level debug
        @id in_tail_container_logs
        path /var/log/containers/*.log
        pos_file /var/log/fluentd-containers.log.pos
        tag kubernetes.*
        exclude_path ["/var/log/containers/*_kube-system_*.log", "/var/log/containers/*_default_*.log", "/var/log/containers/*_rook-ceph_*.log", "/var/log/contain
ers/*fluent*.log"]
        read_from_head true
        follow_inodes true
        <parse>
          @type json
          time_format "%Y-%m-%dT%H:%M:%S.%NZ"
          keep_time_key true
        </parse>
      </source>

      <match kubernetes.** >
          @type elasticsearch_data_stream
          data_stream_name logs-efk-container
          @log_level debug
          hosts https://elasticsearch-client.default.svc.cluster.local:9200
          user elastic
          password xxxxxx
          ssl_verify false
          reload_connections false
          reconnect_on_error true
          reload_on_failure true
          log_es_400_reason false
          request_timeout 1000s
          include_timestamp true
          <buffer>
            flush_thread_count 8
            flush_interval 1s
            retry_max_interval 5
            retry_forever true
          </buffer>
      </match>

Additional context Docker logs

# wc -l *.log*
    34143 8a703c5ad9672f8f8101d6ad47c4ed61b4b04d69621e84cbd0851ba1ee3d3d5f-json.log
   588621 8a703c5ad9672f8f8101d6ad47c4ed61b4b04d69621e84cbd0851ba1ee3d3d5f-json.log.1
   588621 8a703c5ad9672f8f8101d6ad47c4ed61b4b04d69621e84cbd0851ba1ee3d3d5f-json.log.2
   588620 8a703c5ad9672f8f8101d6ad47c4ed61b4b04d69621e84cbd0851ba1ee3d3d5f-json.log.3
  1800005 total

Get the last log at 22:24. It is at line number 588614 of the log file - 8a703c5ad9672f8f8101d6ad47c4ed61b4b04d69621e84cbd0851ba1ee3d3d5f-json.log.3 Screen Shot 2021-06-24 at 10 30 52 PM

# grep '2021-06-24T14:24:00.849102284Z' 8a703c5ad9672f8f8101d6ad47c4ed61b4b04d69621e84cbd0851ba1ee3d3d5f-json.log.3 -n
588614:{"log":"172.24.0.214 - - [24/Jun/2021:22:24:00 +0800] \"GET / HTTP/1.1\" 204 0 \"-\" \"Go-http-client/1.1\"\n","stream":"stdout","time":"2021-06-24T14:24:00.849102284Z"}

Get the first log at 22:28. It is at the first line of the log file - 8a703c5ad9672f8f8101d6ad47c4ed61b4b04d69621e84cbd0851ba1ee3d3d5f-json.log Screen Shot 2021-06-24 at 10 50 45 PM

# grep '2021-06-24T14:25:58.573922367Z' 8a703c5ad9672f8f8101d6ad47c4ed61b4b04d69621e84cbd0851ba1ee3d3d5f-json.log -n
1:{"log":"172.24.0.214 - - [24/Jun/2021:22:25:58 +0800] \"GET / HTTP/1.1\" 204 0 \"-\" \"Go-http-client/1.1\"\n","stream":"stdout","time":"2021-06-24T14:25:58.573922367Z"}

It proofs that in_tail fails to check below 2 files when log rotation happening quite fast.

8a703c5ad9672f8f8101d6ad47c4ed61b4b04d69621e84cbd0851ba1ee3d3d5f-json.log.1.gz
8a703c5ad9672f8f8101d6ad47c4ed61b4b04d69621e84cbd0851ba1ee3d3d5f-json.log.2.gz
doge95 commented 3 years ago

If we use parse type "none", then Fluentd able to tail all the logs.

  ## Fluentd configurations:
  fileConfigs:
    01_sources.conf: |-
      <source>
        @type tail
        @log_level debug
        @id in_tail_container_logs
        path /var/log/containers/*.log
        pos_file /var/log/fluentd-containers.log.pos
        tag kubernetes.*
        exclude_path ["/var/log/containers/*_kube-system_*.log", "/var/log/containers/*_tseacloud_*.log", "/var/log/containers/*_default_*.log", "/var/log/contain
ers/*_rook-ceph_*.log", "/var/log/containers/*fluent*.log", "/var/log/containers/*_cloudsql_*.log"]
        read_from_head true
        follow_inodes true
        <parse>
          @type none
        </parse>
      </source>
ashie commented 3 years ago