Spamming errors in the log about failed connections to the backend in 1.8.9

qingling128 commented 3 years ago

Bug Report

Describe the bug When Fluent Bit 1.8.9 first restarts to apply configuration changes, we are seeing spamming errors in the log like:

[2021/10/30 02:47:00] [ warn] [engine] failed to flush chunk '2372-1635562009.567200761.flb', retry in 19 seconds: task_id=2, input=tail.1 > output=stackdriver.1 (out_id=1)

When we enabled debug logging, it shows that the error is a result of failure to connect to the backend.

[2021/10/30 03:28:33] [debug] [upstream] connection #212 failed to logging.googleapis.com:443
[2021/10/30 03:28:33] [ warn] [engine] failed to flush chunk '9380-1635564512.859097494.flb', retry in 8 seconds: task_id=1, input=tail.1 > output=stackdriver.1 (out_id=1)

But the same issue does not happen with Fluent Bit 1.8.4.

To Reproduce

Steps to reproduce the problem:

We were testing with GCE VMs.

Step 1. Create a VM

gcloud compute instances create --zone us-central1-a --image-project suse-sap-cloud  --image-family sles-15-sp2-sap sles-15-sp2-sap

Step 2. In the VM, install Fluent Bit and start / restart it.

// For my case, the fluent bit binary comes with ops agent install. But it should be reproducible with just regular Fluent Bit binary.

curl -sSO https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.sh
sudo bash add-google-cloud-ops-agent-repo.sh --also-install

// Ensure nothing is in the buffer and no prior fluent bit process was running

lingshi@sles-15-sp2-sap:~> ps aux | grep fluent-bit && sudo rm -rf buffers/*
lingshi   6264  0.0  0.0   8964   828 pts/1    S+   15:58   0:00 grep --color=auto fluent-bit

// start and stop Fluent Bit

/opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit is Fluent Bit 1.8.9

lingshi@sles-15-sp2-sap:~> sudo /opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --log_file fluent_bit.log --storage_path buffers --config fluent_bit_syslog.conf
Fluent Bit v1.8.9
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

^C[2021/10/29 15:58:53] [engine] caught signal (SIGINT)
[2021/10/29 15:58:53] [ info] [input] pausing tail.0
[2021/10/29 15:58:53] [ info] [input] pausing storage_backlog.1

// start Fluent Bit again

lingshi@sles-15-sp2-sap:~> ps aux | grep fluent-bit
lingshi   6284  0.0  0.0   8964   812 pts/1    R+   15:59   0:00 grep --color=auto fluent-bit
lingshi@sles-15-sp2-sap:~> sudo /opt/google-cloud-ops-agent/subagents/fluent-bit/bin/fluent-bit --log_file fluent_bit.log --storage_path buffers --config fluent_bit_syslog.conf
Fluent Bit v1.8.9
* Copyright (C) 2019-2021 The Fluent Bit Authors
* Copyright (C) 2015-2018 Treasure Data
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

^C[2021/10/29 16:01:45] [engine] caught signal (SIGINT)
[2021/10/29 16:01:45] [ info] [input] pausing tail.0
[2021/10/29 16:01:45] [ info] [input] pausing storage_backlog.1

Step 3. Observe warnings in the log

[2021/10/30 03:28:31] [ info] [engine] started (pid=9380)
[2021/10/30 03:28:31] [ info] [storage] version=1.1.5, initializing...
[2021/10/30 03:28:31] [ info] [cmetrics] version=0.2.2
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.0] worker #7 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.0] worker #2 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.0] worker #3 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.0] worker #0 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.0] worker #4 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.0] worker #5 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.0] worker #6 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.0] worker #1 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.1] worker #3 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.1] worker #4 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.1] worker #0 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.1] worker #1 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.1] worker #5 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.1] worker #6 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.1] worker #7 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.1] worker #2 started
[2021/10/30 03:28:32] [debug] [upstream] connection #207 failed to logging.googleapis.com:443
[2021/10/30 03:28:32] [ warn] [engine] failed to flush chunk '9380-1635564511.878069324.flb', retry in 11 seconds: task_id=0, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:28:33] [debug] [upstream] connection #212 failed to logging.googleapis.com:443
[2021/10/30 03:28:33] [debug] [upstream] connection #212 failed to logging.googleapis.com:443
[2021/10/30 03:28:33] [ warn] [engine] failed to flush chunk '9380-1635564512.859097494.flb', retry in 8 seconds: task_id=1, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:28:33] [ warn] [engine] failed to flush chunk '9380-1635564513.796305770.flb', retry in 10 seconds: task_id=2, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:28:34] [debug] [upstream] connection #216 failed to logging.googleapis.com:443
[2021/10/30 03:28:34] [ warn] [engine] failed to flush chunk '9380-1635564514.794875256.flb', retry in 9 seconds: task_id=4, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:28:36] [debug] [upstream] connection #220 failed to logging.googleapis.com:443
[2021/10/30 03:28:36] [ warn] [engine] failed to flush chunk '9380-1635564516.854153124.flb', retry in 9 seconds: task_id=7, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:28:41] [debug] [upstream] connection #212 failed to logging.googleapis.com:443
[2021/10/30 03:28:41] [ warn] [engine] failed to flush chunk '9380-1635564520.863754400.flb', retry in 6 seconds: task_id=9, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:28:44] [debug] [upstream] connection #222 failed to logging.googleapis.com:443
[2021/10/30 03:28:44] [ warn] [engine] failed to flush chunk '9380-1635564523.871216058.flb', retry in 6 seconds: task_id=2, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:28:49] [debug] [upstream] connection #227 failed to logging.googleapis.com:443
[2021/10/30 03:28:49] [ warn] [engine] failed to flush chunk '9380-1635564528.866208875.flb', retry in 11 seconds: task_id=5, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:29:23] [debug] [upstream] connection #212 failed to logging.googleapis.com:443
[2021/10/30 03:29:23] [ warn] [engine] failed to flush chunk '9380-1635564563.275999866.flb', retry in 10 seconds: task_id=3, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:29:33] [debug] [upstream] connection #215 failed to logging.googleapis.com:443
[2021/10/30 03:29:33] [ warn] [engine] failed to flush chunk '9380-1635564563.275999866.flb', retry in 7 seconds: task_id=3, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:29:45] [debug] [upstream] connection #212 failed to logging.googleapis.com:443
[2021/10/30 03:29:45] [ warn] [engine] failed to flush chunk '9380-1635564585.68852675.flb', retry in 10 seconds: task_id=3, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:29:47] [debug] [upstream] connection #207 failed to logging.googleapis.com:443
[2021/10/30 03:29:47] [ warn] [engine] failed to flush chunk '9380-1635564586.862819885.flb', retry in 8 seconds: task_id=1, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:30:17] [debug] [upstream] connection #211 failed to logging.googleapis.com:443
[2021/10/30 03:30:17] [ warn] [engine] failed to flush chunk '9380-1635564617.259778154.flb', retry in 9 seconds: task_id=2, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:30:24] [debug] [upstream] connection #215 failed to logging.googleapis.com:443
[2021/10/30 03:30:24] [ warn] [engine] failed to flush chunk '9380-1635564623.966495065.flb', retry in 6 seconds: task_id=3, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:30:26] [debug] [upstream] connection #220 failed to logging.googleapis.com:443
[2021/10/30 03:30:26] [ warn] [engine] failed to flush chunk '9380-1635564617.259778154.flb', retry in 11 seconds: task_id=2, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:31:31] [debug] [upstream] connection #211 failed to logging.googleapis.com:443
[2021/10/30 03:31:31] [ warn] [engine] failed to flush chunk '9380-1635564691.842525650.flb', retry in 10 seconds: task_id=2, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:31:41] [debug] [upstream] connection #215 failed to logging.googleapis.com:443
[2021/10/30 03:31:41] [ warn] [engine] failed to flush chunk '9380-1635564691.842525650.flb', retry in 19 seconds: task_id=2, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:34:32] [debug] [upstream] connection #207 failed to logging.googleapis.com:443
[2021/10/30 03:34:32] [ warn] [engine] failed to flush chunk '9380-1635564871.865380045.flb', retry in 8 seconds: task_id=2, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)

** Full Fluent Bit log fluent-bit-trimmed.log

** Fluent Bit Metrics

lingshi@sles-15-sp2-sap:~> curl -s http://127.0.0.1:2020 | json_pp
{
   "fluent-bit" : {
      "edition" : "Community",
      "version" : "1.8.9",
      "flags" : [
         "FLB_HAVE_PARSER",
         "FLB_HAVE_RECORD_ACCESSOR",
         "FLB_HAVE_TLS",
         "FLB_HAVE_OPENSSL",
         "FLB_HAVE_METRICS",
         "FLB_HAVE_AWS",
         "FLB_HAVE_AWS_CREDENTIAL_PROCESS",
         "FLB_HAVE_SIGNV4",
         "FLB_HAVE_SQLDB",
         "FLB_HAVE_METRICS",
         "FLB_HAVE_HTTP_SERVER",
         "FLB_HAVE_SYSTEMD",
         "FLB_HAVE_FORK",
         "FLB_HAVE_TIMESPEC_GET",
         "FLB_HAVE_GMTOFF",
         "FLB_HAVE_UNIX_SOCKET",
         "FLB_HAVE_PROXY_GO",
         "FLB_HAVE_LIBBACKTRACE",
         "FLB_HAVE_REGEX",
         "FLB_HAVE_UTF8_ENCODER",
         "FLB_HAVE_LUAJIT",
         "FLB_HAVE_C_TLS",
         "FLB_HAVE_ACCEPT4",
         "FLB_HAVE_INOTIFY"
      ]
   }
}

lingshi@sles-15-sp2-sap:~> curl -s http://127.0.0.1:2020/api/v1/metrics | json_pp
{
   "input" : {
      "emitter_for_rewrite_tag.1" : {
         "bytes" : 133348,
         "records" : 236
      },
      "tail.0" : {
         "bytes" : 129808,
         "files_opened" : 1,
         "files_closed" : 0,
         "files_rotated" : 0,
         "records" : 236
      },
      "storage_backlog.2" : {
         "bytes" : 0,
         "records" : 0
      },
      "tail.1" : {
         "bytes" : 894639836,
         "files_opened" : 1,
         "files_closed" : 0,
         "files_rotated" : 0,
         "records" : 8684147
      }
   },
   "output" : {
      "stackdriver.1" : {
         "retried_records" : 138522,
         "retries_failed" : 0,
         "stackdriver_successful_requests" : 702,
         "retries" : 12,
         "stackdriver_failed_requests" : 12,
         "proc_bytes" : 894329153,
         "dropped_records" : 0,
         "errors" : 0,
         "proc_records" : 8681132
      },
      "stackdriver.0" : {
         "dropped_records" : 0,
         "errors" : 0,
         "proc_records" : 236,
         "retries_failed" : 0,
         "retried_records" : 73,
         "proc_bytes" : 129808,
         "stackdriver_failed_requests" : 13,
         "retries" : 13,
         "stackdriver_successful_requests" : 23
      }
   },
   "filter" : {
      "modify.0" : {
         "add_records" : 0,
         "drop_records" : 0
      },
      "rewrite_tag.1" : {
         "add_records" : 0,
         "drop_records" : 236,
         "emit_records" : 236
      },
      "modify.2" : {
         "drop_records" : 0,
         "add_records" : 0
      }
   }
}

lingshi@sles-15-sp2-sap:~> curl -s http://127.0.0.1:2020/api/v1/storage | json_pp
{
   "storage_layer" : {
      "chunks" : {
         "mem_chunks" : 0,
         "fs_chunks_up" : 1,
         "total_chunks" : 1,
         "fs_chunks_down" : 0,
         "fs_chunks" : 1
      }
   },
   "input_chunks" : {
      "emitter_for_rewrite_tag.1" : {
         "chunks" : {
            "up" : 0,
            "total" : 0,
            "busy" : 0,
            "busy_size" : "0b",
            "down" : 0
         },
         "status" : {
            "mem_size" : "0b",
            "mem_limit" : "9.5M",
            "overlimit" : false
         }
      },
      "storage_backlog.2" : {
         "chunks" : {
            "up" : 0,
            "busy_size" : "0b",
            "busy" : 0,
            "total" : 0,
            "down" : 0
         },
         "status" : {
            "mem_limit" : "0b",
            "overlimit" : false,
            "mem_size" : "0b"
         }
      },
      "tail.1" : {
         "chunks" : {
            "down" : 0,
            "up" : 1,
            "total" : 1,
            "busy" : 0,
            "busy_size" : "0b"
         },
         "status" : {
            "overlimit" : false,
            "mem_limit" : "9.5M",
            "mem_size" : "303.4K"
         }
      },
      "tail.0" : {
         "status" : {
            "mem_size" : "0b",
            "mem_limit" : "9.5M",
            "overlimit" : false
         },
         "chunks" : {
            "busy_size" : "0b",
            "busy" : 0,
            "total" : 0,
            "up" : 0,
            "down" : 0
         }
      }
   }
}

Expected behavior

Fluent Bit resumes log ingestion smoothly after restarts.

Actual behavior Lots of failed chunks. Further investigation revealed that some of the bad connections never recover.

$ catlog | grep "failed to\|started\|stopped\|fluent bit"
[2021/10/30 03:28:31] [ info] [engine] started (pid=9380)
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.0] worker #7 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.0] worker #2 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.0] worker #3 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.0] worker #0 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.0] worker #4 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.0] worker #5 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.0] worker #6 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.0] worker #1 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.1] worker #3 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.1] worker #4 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.1] worker #0 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.1] worker #1 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.1] worker #5 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.1] worker #6 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.1] worker #7 started
[2021/10/30 03:28:31] [ info] [output:stackdriver:stackdriver.1] worker #2 started
[2021/10/30 03:28:32] [debug] [upstream] connection #207 failed to logging.googleapis.com:443
[2021/10/30 03:28:32] [ warn] [engine] failed to flush chunk '9380-1635564511.878069324.flb', retry in 11 seconds: task_id=0, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:28:33] [debug] [upstream] connection #212 failed to logging.googleapis.com:443
[2021/10/30 03:28:33] [debug] [upstream] connection #212 failed to logging.googleapis.com:443
[2021/10/30 03:28:33] [ warn] [engine] failed to flush chunk '9380-1635564512.859097494.flb', retry in 8 seconds: task_id=1, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:28:33] [ warn] [engine] failed to flush chunk '9380-1635564513.796305770.flb', retry in 10 seconds: task_id=2, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:28:34] [debug] [upstream] connection #216 failed to logging.googleapis.com:443
[2021/10/30 03:28:34] [ warn] [engine] failed to flush chunk '9380-1635564514.794875256.flb', retry in 9 seconds: task_id=4, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:28:36] [debug] [upstream] connection #220 failed to logging.googleapis.com:443
[2021/10/30 03:28:36] [ warn] [engine] failed to flush chunk '9380-1635564516.854153124.flb', retry in 9 seconds: task_id=7, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:28:41] [debug] [upstream] connection #212 failed to logging.googleapis.com:443
[2021/10/30 03:28:41] [ warn] [engine] failed to flush chunk '9380-1635564520.863754400.flb', retry in 6 seconds: task_id=9, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:28:44] [debug] [upstream] connection #222 failed to logging.googleapis.com:443
[2021/10/30 03:28:44] [ warn] [engine] failed to flush chunk '9380-1635564523.871216058.flb', retry in 6 seconds: task_id=2, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:28:49] [debug] [upstream] connection #227 failed to logging.googleapis.com:443
[2021/10/30 03:28:49] [ warn] [engine] failed to flush chunk '9380-1635564528.866208875.flb', retry in 11 seconds: task_id=5, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:29:23] [debug] [upstream] connection #212 failed to logging.googleapis.com:443
[2021/10/30 03:29:23] [ warn] [engine] failed to flush chunk '9380-1635564563.275999866.flb', retry in 10 seconds: task_id=3, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:29:33] [debug] [upstream] connection #215 failed to logging.googleapis.com:443
[2021/10/30 03:29:33] [ warn] [engine] failed to flush chunk '9380-1635564563.275999866.flb', retry in 7 seconds: task_id=3, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:29:45] [debug] [upstream] connection #212 failed to logging.googleapis.com:443
[2021/10/30 03:29:45] [ warn] [engine] failed to flush chunk '9380-1635564585.68852675.flb', retry in 10 seconds: task_id=3, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:29:47] [debug] [upstream] connection #207 failed to logging.googleapis.com:443
[2021/10/30 03:29:47] [ warn] [engine] failed to flush chunk '9380-1635564586.862819885.flb', retry in 8 seconds: task_id=1, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:30:17] [debug] [upstream] connection #211 failed to logging.googleapis.com:443
[2021/10/30 03:30:17] [ warn] [engine] failed to flush chunk '9380-1635564617.259778154.flb', retry in 9 seconds: task_id=2, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:30:24] [debug] [upstream] connection #215 failed to logging.googleapis.com:443
[2021/10/30 03:30:24] [ warn] [engine] failed to flush chunk '9380-1635564623.966495065.flb', retry in 6 seconds: task_id=3, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:30:26] [debug] [upstream] connection #220 failed to logging.googleapis.com:443
[2021/10/30 03:30:26] [ warn] [engine] failed to flush chunk '9380-1635564617.259778154.flb', retry in 11 seconds: task_id=2, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:31:31] [debug] [upstream] connection #211 failed to logging.googleapis.com:443
[2021/10/30 03:31:31] [ warn] [engine] failed to flush chunk '9380-1635564691.842525650.flb', retry in 10 seconds: task_id=2, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:31:41] [debug] [upstream] connection #215 failed to logging.googleapis.com:443
[2021/10/30 03:31:41] [ warn] [engine] failed to flush chunk '9380-1635564691.842525650.flb', retry in 19 seconds: task_id=2, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:34:32] [debug] [upstream] connection #207 failed to logging.googleapis.com:443
[2021/10/30 03:34:32] [ warn] [engine] failed to flush chunk '9380-1635564871.865380045.flb', retry in 8 seconds: task_id=2, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:35:31] [debug] [upstream] connection #212 failed to logging.googleapis.com:443
[2021/10/30 03:35:31] [ warn] [engine] failed to flush chunk '9380-1635564931.849574352.flb', retry in 8 seconds: task_id=3, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:35:39] [debug] [upstream] connection #82 failed to logging.googleapis.com:443
[2021/10/30 03:35:39] [ warn] [engine] failed to flush chunk '9380-1635564931.849574352.flb', retry in 16 seconds: task_id=3, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:35:51] [debug] [upstream] connection #39 failed to logging.googleapis.com:443
[2021/10/30 03:35:51] [ warn] [engine] failed to flush chunk '9380-1635564950.863303928.flb', retry in 7 seconds: task_id=5, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:35:58] [ warn] [engine] failed to flush chunk '9380-1635564956.865265341.flb', retry in 6 seconds: task_id=0, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:36:02] [debug] [upstream] connection #82 failed to logging.googleapis.com:443
[2021/10/30 03:36:02] [ warn] [engine] failed to flush chunk '9380-1635564961.862243975.flb', retry in 10 seconds: task_id=1, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:36:07] [debug] [upstream] connection #211 failed to logging.googleapis.com:443
[2021/10/30 03:36:07] [ warn] [engine] failed to flush chunk '9380-1635564966.862773162.flb', retry in 7 seconds: task_id=6, input=tail.1 > output=stackdriver.1 (out_id=1)
[2021/10/30 03:36:32] [debug] [upstream] connection #211 failed to logging.googleapis.com:443
[2021/10/30 03:36:32] [ warn] [engine] failed to flush chunk '9380-1635564992.91572428.flb', retry in 6 seconds: task_id=1, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:39:49] [debug] [upstream] connection #221 failed to logging.googleapis.com:443
[2021/10/30 03:39:49] [ warn] [engine] failed to flush chunk '9380-1635565187.999528430.flb', retry in 11 seconds: task_id=1, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:39:51] [debug] [upstream] connection #221 failed to logging.googleapis.com:443
[2021/10/30 03:39:51] [ warn] [engine] failed to flush chunk '9380-1635565191.189458869.flb', retry in 7 seconds: task_id=2, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:40:00] [debug] [upstream] connection #221 failed to logging.googleapis.com:443
[2021/10/30 03:40:00] [ warn] [engine] failed to flush chunk '9380-1635565187.999528430.flb', retry in 18 seconds: task_id=1, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:40:17] [debug] [upstream] connection #221 failed to logging.googleapis.com:443
[2021/10/30 03:40:17] [ warn] [engine] failed to flush chunk '9380-1635565187.999528430.flb', retry in 9 seconds: task_id=1, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:40:32] [debug] [upstream] connection #211 failed to logging.googleapis.com:443
[2021/10/30 03:40:32] [ warn] [engine] failed to flush chunk '9380-1635565232.84794701.flb', retry in 9 seconds: task_id=1, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)
[2021/10/30 03:43:31] [debug] [upstream] connection #211 failed to logging.googleapis.com:443
[2021/10/30 03:43:31] [ warn] [engine] failed to flush chunk '9380-1635565411.902720795.flb', retry in 9 seconds: task_id=1, input=emitter_for_rewrite_tag.1 > output=stackdriver.0 (out_id=0)

Your Environment

Version used: Fluent Bit 1.8.9 (The same issue does not exist in 1.8.4)

Configuration:


cat fluent_bit_syslog.conf
@SET buffers_dir=/var/lib/google-cloud-ops-agent/fluent-bit/buffers
@SET logs_dir=/var/log/google-cloud-ops-agent/subagents

[SERVICE] Daemon off Flush 1 HTTP_Listen 0.0.0.0 HTTP_PORT 2020 HTTP_Server On Log_Level info storage.backlog.mem_limit 50M storage.checksum on storage.max_chunks_up 128 storage.metrics on storage.sync normal

[INPUT] Buffer_Chunk_Size 512k Buffer_Max_Size 5M DB ${buffers_dir}/default_pipeline_syslog Key message Mem_Buf_Limit 10M Name tail Path /var/log/messages,/var/log/syslog Read_from_Head True Rotate_Wait 30 Skip_Long_Lines On Tag default_pipeline.syslog storage.type filesystem

[FILTER] Add logName syslog Match default_pipeline.syslog Name modify

[FILTER] Emitter_Mem_Buf_Limit 10M Emitter_Storage.type filesystem Match default_pipeline.syslog Name rewrite_tag Rule $logName .* $logName false

[FILTER] Match syslog Name modify Remove logName

[OUTPUT] Match_Regex ^(syslog)$ Name stackdriver Retry_Limit 3 resource gce_instance stackdriver_agent Google-Cloud-Ops-Agent-Logging/2.6.0 (BuildDistro=sles15;Platform=linux;ShortName=sles;ShortVersion=15-SP2) tls On tls.verify Off workers 8



* Environment name and version (e.g. Kubernetes? What version?):
Google Compute Engine

* Operating System and version:
centos 7, centos 8, rhel 7, rhel 8, sles 12, sles 15

**Additional context**
<!--- How has this issue affected you? What are you trying to accomplish? -->
Our integration tests try to verify ingested log entries and it timed out after 400 seconds (80 tries with 5 second interval). We are seeing very frequent timeouts of the tests that involves customizing the config file and restart the agent, which indicates the agent took > 400 seconds to establish the connection to the API when it restarts.

<!--- Providing context helps us come up with a solution that is most useful in the real world -->
I wonder whether this change https://github.com/fluent/fluent-bit/pull/4003/files somehow slows down the DNS resolution / API connection establishment and/or leave a bad connection in the pool permanently.

leonardo-albertovich commented 3 years ago

The symptom [debug] [upstream] connection #NNN failed to logging.googleapis.com:443 originates in a tls handshake error in tls_net_handshake which is triggered by a broken connection (errno = EPIPE).

However, according to the latest findings the issue could be related to fluent-bit accepting an IPV6 result for the DNS query when the system does not have an IPV6 address configured. The reason why this does not result in an error in net_connect_async is still unknown but so far evidence strongly backs this theory.

If you want to test this in your own system the easiest way to do it would be adding the following line to your /etc/hosts file :

173.194.213.95 logging.googleapis.com

stmaute commented 3 years ago

I am also seeing this logs when trying to push logs to DataDog. Fluent Bit version: 1.8.9 Even the retry doesn't seem to work.

Is is possible to verify if the problem relates to DNS?

Is there a workaround in place?

leonardo-albertovich commented 3 years ago

I am also seeing this logs when trying to push logs to DataDog. Fluent Bit version: 1.8.9 Even the retry doesn't seem to work.

Is is possible to verify if the problem relates to DNS?

Is there a workaround in place?

You have two ways to confirm that this issue is what's' affecting you at the moment :

Add an entry in your /etc/hosts file with the ip / host mapping for the endpoint (I think it's http-intake.logs.datadoghq.com and the IP I got for it is 3.233.146.17)
Clone and build the PR that's linked to this issue where I fixed it

Don't hesitate to contact me in the fluent slack server if you have any questions.

stmaute commented 3 years ago

@leonardo-albertovich Thanks for the hints. During my tests today I didn't encounter any connection issues. If they pop up again I will try the first thing.

qingling128 commented 3 years ago

PRs in progress to fix this:

(backport to 1.8) https://github.com/fluent/fluent-bit/pull/4295

JeffLuoo commented 3 years ago

Some errors I met after upgrading to v1.8.9:

First issue:

[2021/11/10 16:20:09] [ warn] [engine] failed to flush chunk '1-1636561198.640695704.flb', retry in 42 seconds: task_id=1, input=tail.0 > output=stackdriver.0 (out_id=0)
[2021/11/10 16:20:09] [ warn] [net] getaddrinfo(host='logging.googleapis.com', err=11): Could not contact DNS servers

Second issue is similar to https://github.com/fluent/fluent-bit/issues/4120 but I set the buffer_limit to 0 in kubernetes filter and it didn't work for me.

I am guessing whether we need to set 4192 here to 0? https://github.com/fluent/fluent-bit/blob/431459122841c4600abe6e384fcfb56e5967b276/plugins/out_stackdriver/stackdriver.c#L2153

fabito commented 3 years ago

@leonardo-albertovich ,

I have Fluent-bit running as a DaemonSet in K3s. I switched to the 1.8.x-debug image and within a terminal in the running container I ran nslookup:

/ # nslookup logging.googleapis.com
Server:     10.43.0.10
Address:    10.43.0.10:53

Non-authoritative answer:
Name:   logging.googleapis.com
Address: 2404:6800:4004:813::200a

*** Can't find logging.googleapis.com: No answer

/ # nslookup -type=a logging.googleapis.com
Server:     10.43.0.10
Address:    10.43.0.10:53

Name:   logging.googleapis.com
Address: 142.251.42.138

/ # nslookup -type=aaaa logging.googleapis.com
Server:     10.43.0.10
Address:    10.43.0.10:53

Non-authoritative answer:
Name:   logging.googleapis.com
Address: 2404:6800:4004:81d::200a

/ # nslookup logging.googleapis.com
Server:     10.43.0.10
Address:    10.43.0.10:53

Name:   logging.googleapis.com
Address: 142.251.42.138

*** Can't find logging.googleapis.com: No answer

After that fluentbit was able to connect to Stackdriver - and after a POD restart the DNS issue was back. The issue is not specific to 1.8.9. I observed the same behaviour in: 1.8.3 and 1.8.4.

I've worked around the issue by adding dnsConfig to the DaemonSet.

      dnsConfig:
        nameservers:
          - 1.1.1.1
          - 8.8.8.8
        options:
          - name: ndots
            value: "1"

I am not completely sure whether the issues are related but as I said, the symptoms (error messages) are the same.

leonardo-albertovich commented 3 years ago

@fabito the issue reported by @qingling128 was related to a specific GCE detail (instances cannot connect to some hosts through the external IPv6 address) which was exposed because of how the DNS client incorporated in 1.8.5 produced the results which differs from how the system produces them.

Regardless of that, the underlying issue is in the connection error detection and handling mechanisms which is what the related PR fixes.

In your case if I understood you correctly you are not running in GCE and you actually do have DNS issues if you don't override the default ones (DHCP?) with those which means the issue is different.

What do you think?

fabito commented 3 years ago

In your case if I understood you correctly you are not running in GCE and you actually do have DNS issues if you don't override the default ones (DHCP?) with those which means the issue is different.

You are right I am not in GCE. What intrigued me is the fact that FluentBit is the only workload that needed that DNS tweak - we have other workloads (within the same k3s instance) that connect to other google apis (storage, pubsub, etc) which are working just fine.

In https://github.com/fluent/fluent-bit/pull/4295 you mentioned the order the DNS answers arrive. You can notice in the nslookup executions I cited above that when -type is not specified nslookup seems to randomly pick either AAAA or A as its first query (and it always fails in the second query).

Anyway, Fluentbit probably has its own DNS client implementation (decoupled from nslookup) but, still, I thought it could be helpful to share my findings :-)

leonardo-albertovich commented 3 years ago

You are right, fluent-bit does not use the system resolver and that's the origin of the issue in this case, I'd like to know more of your particular case so it would be really helpful if you could clone and build the branch from PR #4295

If you want to try to get to the bottom of it feel free to contact me in the fluent slack so we can determine if yours is a different problem and what the root cause for it is (if that's the case).

Thanks for chiming in!

Edit : My name in slack is Leonardo Almiñana

fluent / fluent-bit

Spamming errors in the log about failed connections to the backend in 1.8.9 #4262

Bug Report