aws / aws-for-fluent-bit

The source of the amazon/aws-for-fluent-bit container image
Apache License 2.0
447 stars 133 forks source link

[http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 #645

Open zvia-eiger-hs opened 1 year ago

zvia-eiger-hs commented 1 year ago

Describe the question/issue

Fluentbit logs are showing frequent network connection issues to firehose:

[error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[error] [tls] error: error:00000006:lib(0):func(0):EVP lib
[error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to production_de_logs_stream
[error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records
[error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records
[ warn] [engine] failed to flush chunk '1-1682551542.474843075.flb', retry in 6 seconds: task_id=0, input=tail.7 > output=kinesis_firehose.1 (out_id=1)
[ info] [engine] flush chunk '1-1682551542.474843075.flb' succeeded at retry 1: task_id=0, input=tail.7 > output=kinesis_firehose.1 (out_id=1)
[error] [tls] error: unexpected EOF
[error] [aws_client] connection initialization error

Occasionally we also see log loss:

 [ warn] [engine] chunk '1-1682412324.275151234.flb' cannot be retried: task_id=0, input=tail.7 > output=kinesis_firehose.1

We have upgraded Fluentbit to v2.31.9 but we still see the same errors.

Configuration

apiVersion: v1
kind: ConfigMap
metadata:
  name: fluent-bit-config
  labels:
    k8s-app: fluent-bit
data:
  # Configuration files: server, input, filters and output
  # ======================================================
  fluent-bit.conf: |
    [SERVICE]
        Flush         1
        Log_Level     info
        Daemon        off
        Parsers_File  /fluent-bit/parsers/parsers.conf
        HTTP_Server   On
        HTTP_Listen   0.0.0.0
        HTTP_Port     2020

    @INCLUDE input-kubernetes.conf
    @INCLUDE filter-kubernetes.conf
    @INCLUDE output-elasticsearch.conf

  input-kubernetes.conf: |
    [INPUT]
        Name              tail
        Tag               kube.production-de.*
        Path              /var/log/containers/*production-de_*.log
        Parser            docker
        DB                /var/log/flb_kube.db
        Mem_Buf_Limit     5MB
        Skip_Long_Lines   On
        Ignore_Older      2d
        Refresh_Interval  10

  filter-kubernetes.conf: |
    [FILTER]
        Name                kubernetes
        Match               kube.*
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Merge_Log_Trim      On
        Keep_Log            Off
        K8S-Logging.Parser  On
        K8S-Logging.Exclude Off
    [FILTER]
        Name   grep
        match  *
        exclude  event ping
    [FILTER]
        Name   grep
        match  *
        exclude  event health

  output-elasticsearch.conf: |
    [OUTPUT]
        Name            kinesis_firehose
        Match           kube.production-de.*
        region          eu-central-1
        delivery_stream production_de_logs_stream
        workers 2

  parsers.conf: |
    [PARSER]
        Name         keda
        Format       regex
        Regex        ^(?<time>[^\t]+)\t(?<severity>[^\t]+)\t(?<logger>[^\t]+)\t(?<event>[^\t]+)\t(?<keda_json>[^\t]+)$
        Time_Key     time
        Decode_Field json keda_json
        Time_Keep    On

Fluent Bit Log Output

[2023/04/25 08:42:00] [ warn] [engine] failed to flush chunk '1-1682412120.191981284.flb', retry in 9 seconds: task_id=0, input=tail.9 > output=kinesis_firehose.1 (out_id=1)
[2023/04/25 08:42:01] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/04/25 08:42:01] [error] [tls] error: error:00000006:lib(0):func(0):EVP lib
[2023/04/25 08:42:01] [error] [src/flb_http_client.c:1199 errno=25] Inappropriate ioctl for device
[2023/04/25 08:42:01] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to production_de_logs_stream
[2023/04/25 08:42:01] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records
[2023/04/25 08:42:01] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records
[2023/04/25 08:42:01] [ warn] [engine] failed to flush chunk '1-1682412120.282088735.flb', retry in 7 seconds: task_id=1, input=tail.9 > output=kinesis_firehose.1 (out_id=1)
[2023/04/25 08:42:08] [ info] [engine] flush chunk '1-1682412120.282088735.flb' succeeded at retry 1: task_id=1, input=tail.9 > output=kinesis_firehose.1 (out_id=1)
[2023/04/25 08:42:09] [ info] [engine] flush chunk '1-1682412120.191981284.flb' succeeded at retry 1: task_id=0, input=tail.9 > output=kinesis_firehose.1 (out_id=1)
[2023/04/25 08:42:16] [ info] [filter:kubernetes:kubernetes.0]  token updated
[2023/04/25 08:43:25] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/04/25 08:43:25] [error] [tls] error: error:00000006:lib(0):func(0):EVP lib
[2023/04/25 08:43:25] [error] [src/flb_http_client.c:1199 errno=25] Inappropriate ioctl for device
[2023/04/25 08:43:25] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to production_de_logs_stream
[2023/04/25 08:43:25] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records
[2023/04/25 08:43:25] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records
[2023/04/25 08:43:25] [ warn] [engine] failed to flush chunk '1-1682412205.134915032.flb', retry in 11 seconds: task_id=0, input=tail.7 > output=kinesis_firehose.1 (out_id=1)
[2023/04/25 08:43:36] [ info] [engine] flush chunk '1-1682412205.134915032.flb' succeeded at retry 1: task_id=0, input=tail.7 > output=kinesis_firehose.1 (out_id=1)
[2023/04/25 08:45:21] [ info] [input:tail:tail.2] inotify_fs_add(): inode=163014582 watch_fd=1 name=/var/log/containers/......log
[2023/04/25 08:45:24] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/04/25 08:45:24] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/04/25 08:45:24] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to production_de_logs_stream
[2023/04/25 08:45:24] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records
[2023/04/25 08:45:24] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records
[2023/04/25 08:45:24] [ warn] [engine] failed to flush chunk '1-1682412324.275151234.flb', retry in 9 seconds: task_id=0, input=tail.7 > output=kinesis_firehose.1 (out_id=1)
[2023/04/25 08:45:31] [ info] [input:tail:tail.2] inotify_fs_add(): inode=76839812 watch_fd=2 name=/var/log/containers/.....log
[2023/04/25 08:45:33] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/04/25 08:45:33] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/04/25 08:45:33] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to production_de_logs_stream
[2023/04/25 08:45:33] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records
[2023/04/25 08:45:33] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records
[2023/04/25 08:45:33] [ warn] [engine] chunk '1-1682412324.275151234.flb' cannot be retried: task_id=0, input=tail.7 > output=kinesis_firehose.1
[2023/04/25 08:46:11] [ info] [input:tail:tail.2] inotify_fs_add(): inode=20513258 watch_fd=3 name=/var/log/containers/............log
[2023/04/25 08:47:47] [ info] [input:tail:tail.2] inotify_fs_remove(): inode=20513258 watch_fd=3

Fluent Bit Version Info

v2.31.9

Cluster Details

EKS - Kubernetes version 1.21 Daemon deployment for Fluent Bit

Related Issues

I saw this one which seemed similar - https://github.com/aws/aws-for-fluent-bit/issues/354 Decided to open this bug after reading https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#how-do-i-tell-if-fluent-bit-is-losing-logs & https://github.com/aws/aws-for-fluent-bit/blob/mainline/troubleshooting/debugging.md#network-connection-issues

PettitWesley commented 1 year ago

Occasionally we also see log loss:

So as you saw in the guide, network connection issues happen occasionally, but when they cause log loss that's when it becomes a problem.

How high of throughput are you sending at? Do you ever get throttle exceptions from firehose?

As noted in that guide:

One of the simplest causes of network connection issues is throttling, some AWS APIs will block new connections from the same IP for throttling (rather than wasting effort returning a throttling error in the response). We have seen this with the CloudWatch Logs API. So, the first thing to check when you experience network connection issues is your log ingestion/throughput rate and the limits for your destination.

I notice you are using the default retry_limit, consider increasing it: https://docs.fluentbit.io/manual/administration/scheduling-and-retries

With high throughput and only one default retry log loss could happen occasionally.

zvia-eiger-hs commented 1 year ago

@PettitWesley Thank you for replying. Throughput is maximum 2.20 MiB/s, mostly around 1Mib/s and I do not see throttle exceptions from Firehose. We have added a retry limit that seems to cover most log loss cases but some still happen.

    [OUTPUT]
        Name            kinesis_firehose
        Match           kube.*
        region          eu-central-1
        delivery_stream production_de_logs_stream
        Retry_Limit 3
        workers 2

Adding some more bad logs:

[2023/05/17 13:09:31] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to production_de_logs_stream
[2023/05/17 13:09:31] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records
[2023/05/17 13:09:31] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records
[2023/05/17 13:09:31] [ warn] [engine] failed to flush chunk '1-1684328971.482540519.flb', retry in 6 seconds: task_id=0, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:09:37] [ info] [engine] flush chunk '1-1684328971.482540519.flb' succeeded at retry 1: task_id=0, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:13:25] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/05/17 13:13:25] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/05/17 13:13:25] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to production_de_logs_stream
[2023/05/17 13:13:25] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records
[2023/05/17 13:13:25] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records
[2023/05/17 13:13:25] [ warn] [engine] failed to flush chunk '1-1684329205.285653322.flb', retry in 6 seconds: task_id=0, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:13:31] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/05/17 13:13:31] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/05/17 13:13:31] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to production_de_logs_stream
[2023/05/17 13:13:31] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records
[2023/05/17 13:13:31] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records
[2023/05/17 13:13:31] [ warn] [engine] failed to flush chunk '1-1684329205.285653322.flb', retry in 15 seconds: task_id=0, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:13:37] [error] [tls] error: unexpected EOF
[2023/05/17 13:13:37] [error] [aws_client] connection initialization error
[2023/05/17 13:13:45] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/05/17 13:13:45] [error] [tls] error: error:00000006:lib(0):func(0):EVP lib
[2023/05/17 13:13:45] [error] [src/flb_http_client.c:1189 errno=25] Inappropriate ioctl for device
[2023/05/17 13:13:45] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to production_de_logs_stream
[2023/05/17 13:13:45] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records
[2023/05/17 13:13:45] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records
[2023/05/17 13:13:45] [ warn] [engine] failed to flush chunk '1-1684329224.768845563.flb', retry in 7 seconds: task_id=1, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:13:46] [ info] [engine] flush chunk '1-1684329205.285653322.flb' succeeded at retry 2: task_id=0, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:13:52] [ info] [engine] flush chunk '1-1684329224.768845563.flb' succeeded at retry 1: task_id=1, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:13:54] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/05/17 13:14:35] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/05/17 13:14:35] [error] [http_client] broken connection to firehose.eu-central-1.amazonaws.com:443 ?
[2023/05/17 13:14:35] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to production_de_logs_stream
[2023/05/17 13:14:35] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records
[2023/05/17 13:14:35] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records
[2023/05/17 13:14:35] [ warn] [engine] failed to flush chunk '1-1684329275.63531396.flb', retry in 8 seconds: task_id=0, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:14:43] [ info] [engine] flush chunk '1-1684329275.63531396.flb' succeeded at retry 1: task_id=0, input=tail.1 > output=kinesis_firehose.1 (out_id=1)
[2023/05/17 13:20:00] [ info] [filter:kubernetes:kubernetes.0]  token updated
[2023/05/17 13:29:03] [error] [tls] error: unexpected EOF
[2023/05/17 13:29:03] [error] [aws_client] connection initialization error
[2023/05/17 13:30:00] [error] [tls] error: unexpected EOF
[2023/05/17 13:30:00] [error] [aws_client] connection initialization error
[2023/05/17 13:30:01] [ info] [filter:kubernetes:kubernetes.0]  token updated

Note, I have bumped the version to v2.31.10, the same errors here :)

PettitWesley commented 1 year ago

Some examples of errors from one of our own tests, these errors do happen occasionally: https://github.com/aws/aws-for-fluent-bit/pull/654

kclinden commented 9 months ago

Did you ever figure out a solution for this?

elnurm commented 8 months ago

Facing the same issue in version v2.1.6

[2024/01/24 09:36:19] [error] [tls] error: error:00000006:lib(0):func(0):EVP lib [2024/01/24 09:36:19] [error] [tls] error: error:00000006:lib(0):func(0):EVP lib [2024/01/24 09:36:19] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records to kinesis-events [2024/01/24 09:36:19] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send log records [2024/01/24 09:36:19] [error] [output:kinesis_firehose:kinesis_firehose.1] Failed to send records [2024/01/24 09:36:19] [error] [engine] chunk '14612-1706088928.508247200.flb' cannot be retried: task_id=0, input=tail.1 > output=kinesis_firehose.1