aws / aws-for-fluent-bit

The source of the amazon/aws-for-fluent-bit container image
Apache License 2.0
450 stars 133 forks source link

[upstream] connection #-1 to vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com:443 timed out after 10 seconds #833

Open artur-tud opened 3 months ago

artur-tud commented 3 months ago

We have installed Fluent Bit in our EKS cluster to transfer logs to OpenSearch. We have noticed regular timeout errors during the connection to OpenSearch. Interestingly, most logs are transferred correctly, but a small proportion of them fail to transfer. It is unclear to us how to resolve this issue. Even when we enable debug mode, we do not receive any meaningful error messages.

I realize that there are many issue tickets that describe a similar problem. Unfortunately, the recommendations described there do not help me. I have tried to set different values for net.dns.mode and net.dns.resolver. I have also tried switching off the TLS verification tls.verify or changing the retry limit _RetryLimit. Nothing helped.

Configuration

Fluent Bit runs in a EKS cluster (Kubernetes 1.28) It is installed with aws-for-fluent-bit helm chart 0.1.33 and application version 2.32.2.20240425.

Here is our current fluent-bit.conf

[SERVICE]
    HTTP_Server            On
    HTTP_Listen            0.0.0.0
    HTTP_PORT              2020
    Health_Check           On 
    HC_Errors_Count        5 
    HC_Retry_Failure_Count 5 
    HC_Period              5 
    Log_Level              warn
    Parsers_File           /fluent-bit/parsers/parsers.conf
[INPUT]
    Name                   tail
    Tag                    kube.*
    Path                   /var/log/containers/*.log
    DB                     /var/log/flb_kube.db
    multiline.parser       docker, cri
    Mem_Buf_Limit          20MB
    Skip_Long_Lines        On
    Refresh_Interval       10
[FILTER]
    Name                   kubernetes
    Match                  kube.*
    Kube_URL               https://kubernetes.default.svc.cluster.local:443
    Merge_Log              On
    Merge_Log_Key          data
    Keep_Log               On
    K8S-Logging.Parser     On
    K8S-Logging.Exclude    On
    Buffer_Size            10MB
[FILTER]
    Name                   lua
    Match                  kube.*
    script                 /fluent-bit/lua/filters.lua
    call                   format_logs
[OUTPUT]
    Name                   opensearch
    Match                  *
    AWS_Region             eu-central-1
    AWS_Auth               On
    Host                   vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com
    Port                   443
    tls                    on
    Buffer_Size            20MB
    Index                  aws-fluent-bit
    Type                   _doc
    Logstash_Format        On
    Logstash_Prefix        logstash
    Logstash_DateFormat    %Y.%m.%d
    Time_Key               @timestamp
    Time_Key_Format        %Y-%m-%dT%H:%M:%S
    Time_Key_Nanos         Off
    Include_Tag_Key        Off
    Tag_Key                _flb-key
    Generate_ID            Off
    Write_Operation        create
    Replace_Dots           On
    Trace_Output           Off
    Trace_Error            On
    Current_Time_Index     On
    Logstash_Prefix_Key    os_index
    Suppress_Type_Name     On
    net.dns.mode           TCP

Here is the log:

┌─────────────────────────────────────────────────────────────── Logs(fluent-bit/fluent-bit-5zt55:aws-for-fluent-bit)[1m] ────────────────────────────────────────────────────────────────┐
│                                                            Autoscroll:On      FullScreen:Off     Timestamps:Off     Wrap:Off                                                            │
│ Fluent Bit v1.9.10                                                                                                                                                                      │
│ * Copyright (C) 2015-2022 The Fluent Bit Authors                                                                                                                                        │
│ * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd                                                                                                                        │
│ * https://fluentbit.io                                                                                                                                                                  │
│                                                                                                                                                                                         │
│ [2024/06/06 09:35:13] [ info] [fluent bit] version=1.9.10, commit=9be1f19e5a, pid=1                                                                                                     │
│ [2024/06/06 09:35:13] [ info] [storage] version=1.4.0, type=memory-only, sync=normal, checksum=disabled, max_chunks_up=128                                                              │
│ [2024/06/06 09:35:13] [ info] [cmetrics] version=0.3.7                                                                                                                                  │
│ [2024/06/06 09:35:13] [ info] [input:tail:tail.0] multiline core started                                                                                                                │
│ [2024/06/06 09:35:13] [ info] [filter:kubernetes:kubernetes.0] https=1 host=kubernetes.default.svc.cluster.local port=443                                                               │
│ [2024/06/06 09:35:13] [ info] [filter:kubernetes:kubernetes.0]  token updated                                                                                                           │
│ [2024/06/06 09:35:13] [ info] [filter:kubernetes:kubernetes.0] local POD info OK                                                                                                        │
│ [2024/06/06 09:35:13] [ info] [filter:kubernetes:kubernetes.0] testing connectivity with API server...                                                                                  │
│ [2024/06/06 09:35:13] [ info] [filter:kubernetes:kubernetes.0] connectivity OK                                                                                                          │
│ [2024/06/06 09:35:13] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020                                                                                                          │
│ [2024/06/06 09:35:13] [ info] [sp] stream processor started                                                                                                                             │
│ [2024/06/06 09:35:13] [ info] [input:tail:tail.0] inotify_fs_add(): inode=9437515 watch_fd=1 name=/var/log/containers/aws-node-r8gmr_kube-system_aws-eks-nodeagent-7738e90d0c853740f085 │
│ [2024/06/06 09:35:13] [ info] [input:tail:tail.0] inotify_fs_add(): inode=8392764 watch_fd=2 name=/var/log/containers/aws-node-r8gmr_kube-system_aws-node-4185d1a2546b33074dbe1f3a2db65 │
│ [2024/06/06 09:35:13] [ info] [input:tail:tail.0] inotify_fs_add(): inode=7396358 watch_fd=3 name=/var/log/containers/aws-node-r8gmr_kube-system_aws-vpc-cni-init-66351ba3e7cad47c93f6e │
│ [2024/06/06 09:35:13] [ info] [input:tail:tail.0] inotify_fs_add(): inode=18874491 watch_fd=4 name=/var/log/containers/ebs-csi-node-ccw24_kube-system_ebs-plugin-91ff1c3f87b909e0c468f6 │
│ [2024/06/06 09:35:13] [ info] [input:tail:tail.0] inotify_fs_add(): inode=22020220 watch_fd=5 name=/var/log/containers/ebs-csi-node-ccw24_kube-system_liveness-probe-b9e85a168bc24c359c │
│ [2024/06/06 09:35:13] [ info] [input:tail:tail.0] inotify_fs_add(): inode=19923074 watch_fd=6 name=/var/log/containers/ebs-csi-node-ccw24_kube-system_node-driver-registrar-787c96c9e42 │
│ [2024/06/06 09:35:13] [ info] [input:tail:tail.0] inotify_fs_add(): inode=5476078 watch_fd=7 name=/var/log/containers/prometheus-prometheus-node-exporter-h8mxh_prometheus_node-exporte │
│ [2024/06/06 09:35:14] [ info] [input:tail:tail.0] inotify_fs_add(): inode=7019429 watch_fd=8 name=/var/log/containers/kube-proxy-82zbg_kube-system_kube-proxy-ac8295ddd275612f5634a2127 │
│ ...                                                                                                                                                                                     │
│ [2024/06/06 09:36:14] [error] [upstream] connection #-1 to vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com:443 timed out after 10 seconds                                              │
│ [2024/06/06 09:36:14] [ warn] [net] getaddrinfo(host='vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com', err=12): Timeout while contacting DNS serve                                    │
│ [2024/06/06 09:36:29] [error] [upstream] connection #-1 to vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com:443 timed out after 10 seconds                                              │
│ [2024/06/06 09:36:29] [ warn] [net] getaddrinfo(host='vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com', err=12): Timeout while contacting DNS serve                                    │
│ [2024/06/06 09:36:29] [ warn] [engine] failed to flush chunk '1-1717666510.580386114.flb', retry in 9 seconds: task_id=0, input=tail.0 > output=opensearch.0 (out_id=0)                 │
│ [2024/06/06 09:36:48] [error] [upstream] connection #-1 to vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com:443 timed out after 10 seconds                                              │
│ [2024/06/06 09:36:48] [ warn] [net] getaddrinfo(host='vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com', err=12): Timeout while contacting DNS serve                                    │
│ [2024/06/06 09:36:48] [ warn] [engine] failed to flush chunk '1-1717666511.76029697.flb', retry in 8 seconds: task_id=1, input=tail.0 > output=opensearch.0 (out_id=0)                  │
│ [2024/06/06 09:37:07] [error] [upstream] connection #-1 to vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com:443 timed out after 10 seconds                                              │
│ [2024/06/06 09:37:07] [ warn] [net] getaddrinfo(host='vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com', err=12): Timeout while contacting DNS serve                                    │
│ [2024/06/06 09:37:07] [ warn] [engine] failed to flush chunk '1-1717666526.88917251.flb', retry in 6 seconds: task_id=2, input=tail.0 > output=opensearch.0 (out_id=0)                  │
│ [2024/06/06 09:37:35] [error] [upstream] connection #-1 to vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com:443 timed out after 10 seconds                                              │
│ [2024/06/06 09:37:35] [ warn] [engine] failed to flush chunk '1-1717666526.562789262.flb', retry in 7 seconds: task_id=3, input=tail.0 > output=opensearch.0 (out_id=0)                 │
│ [2024/06/06 09:37:53] [error] [upstream] connection #-1 to vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com:443 timed out after 10 seconds                                              │
│ [2024/06/06 09:37:53] [engine] caught signal (SIGSEGV)                                                                                                                                  │
│ [2024/06/06 09:37:53] [ warn] [net] getaddrinfo(host='vpc-es-abn-opensearch-xxxxxxxxx.es.amazonaws.com', err=12): Timeout while contacting DNS serve                                    │
│ Stream closed EOF for fluent-bit/fluent-bit-8rj9t (aws-for-fluent-bit)                                                                                                                  │
└─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┘

After a while, if the error occurs too often, the pod will restart. As you can see in the log above.