Frequent errors when using output plugin cloudwatch_logs

yangyang919 commented 2 years ago

Describe the question/issue

I know this is not a new issue. But I'm wondering whether the root cause is identified? We are using cloudwatch_logs plugin to send logs from our Pods to AWS cloud watch. The errors look like below:

[2021/12/01 01:53:50] [ warn] [engine] failed to flush chunk '1-1638323628.679735747.flb', retry in 10 seconds: task_id=0, input=emitter_for_rewrite_tag.5 > output=http.2 (out_id=2) [2021/12/01 02:02:41] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096 [2021/12/01 02:02:41] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Recieved code 200 but response was invalid, x-amzn-RequestId header not found [2021/12/01 02:02:41] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:02:41] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:02:41] [ warn] [engine] failed to flush chunk '1-1638324161.187277303.flb', retry in 7 seconds: task_id=0, input=tail.0 > output=cloudwatch_logs.3 (out_id=3) [2021/12/01 02:02:55] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096 [2021/12/01 02:02:55] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Recieved code 200 but response was invalid, x-amzn-RequestId header not found [2021/12/01 02:02:55] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:02:55] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:02:55] [ warn] [engine] failed to flush chunk '1-1638324175.42999390.flb', retry in 7 seconds: task_id=0, input=tail.0 > output=cloudwatch_logs.3 (out_id=3) [2021/12/01 02:04:29] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096 [2021/12/01 02:04:29] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Recieved code 200 but response was invalid, x-amzn-RequestId header not found [2021/12/01 02:04:29] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:04:29] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:04:29] [ warn] [engine] failed to flush chunk '1-1638324268.678872073.flb', retry in 9 seconds: task_id=0, input=tail.0 > output=cloudwatch_logs.3 (out_id=3) [2021/12/01 02:04:38] [ warn] [http_client] malformed HTTP response from logs.eu-central-1.amazonaws.com:443 on connection #34 [2021/12/01 02:04:38] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:04:38] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:04:38] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:04:38] [ warn] [engine] chunk '1-1638324268.678872073.flb' cannot be retried: task_id=0, input=tail.0 > output=cloudwatch_logs.3 [2021/12/01 02:07:49] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096 [2021/12/01 02:07:49] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Recieved code 200 but response was invalid, x-amzn-RequestId header not found [2021/12/01 02:07:49] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:07:49] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:07:49] [ warn] [engine] failed to flush chunk '1-1638324469.352128961.flb', retry in 11 seconds: task_id=0, input=tail.0 > output=cloudwatch_logs.3 (out_id=3) [2021/12/01 02:08:00] [ warn] [http_client] malformed HTTP response from logs.eu-central-1.amazonaws.com:443 on connection #102 [2021/12/01 02:08:00] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:08:00] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:08:00] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:08:00] [ warn] [engine] chunk '1-1638324469.352128961.flb' cannot be retried: task_id=0, input=tail.0 > output=cloudwatch_logs.3 [2021/12/01 02:10:31] [ warn] [http_client] malformed HTTP response from logs.eu-central-1.amazonaws.com:443 on connection #102 [2021/12/01 02:10:31] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:10:31] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/01 02:10:31] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/01 02:10:31] [ warn] [engine] failed to flush chunk '1-1638324630.866492143.flb', retry in 9 seconds: task_id=0, input=tail.0 > output=cloudwatch_logs.3 (out_id=3) [2021/12/01 02:11:41] [ warn] [http_client] cannot increase buffer: current=4096 requested=36864 max=4096 [2021/12/01 02:11:41] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Recieved code 200 but response was invalid, x-amzn-RequestId header not found

` [INPUT] Name tail Path /var/log/containers/.log Parser custom_cri Tag kube. Mem_Buf_Limit 5MB Skip_Long_Lines On [INPUT] Name systemd Tag host.* Systemd_Filter _SYSTEMD_UNIT=kubelet.service Read_From_Tail On

[FILTER]
    Name record_modifier
    Match *
    Record cluster_identifier {{ .Values.cluster }}
[FILTER]
    Name kubernetes
    Match kube.*
    Merge_Log On
    Keep_Log Off
    K8S-Logging.Parser On
    K8S-Logging.Exclude On
[FILTER]
    Name                lua
    Match               kube.*
    script              kubernetes_map_keys.lua
    call                kubernetes_map_keys
[FILTER]
    Name                lua
    Match               *
    script              kubernetes_map_keys.lua
    call                add_source_type
[FILTER]
    Name                lua
    Match               *
    script              kubernetes_map_keys.lua
    call                add_aligned_sent_received
[FILTER]
    Name                rewrite_tag
    Match               kube.*
    Rule                $source_type ^(istio-envoy-proxy-access-log)$ istio-envoy false

[OUTPUT]
    Name s3
    Match kube.*
    bucket upscale-sre-fluentbit-logs-dev-use1
    region {{ .Values.region }}
    total_file_size 100M
    upload_timeout 5m
    s3_key_format /%Y/%m/%d/%H/
    s3_key_format_tag_delimiters .

[OUTPUT]
    Name                cloudwatch_logs
    Match               kube.*
    region              {{ .Values.region }}
    log_group_name      fluent-bit-cloudwatch/{{ .Values.clusterName }}
    log_stream_prefix   from-fluent-bit-
    auto_create_group   true`

PettitWesley commented 2 years ago

The errors look like below:

Don't see anything below 😐

yangyang919 commented 2 years ago

Just added...hit the wrong button...

PettitWesley commented 2 years ago

What version are you using? I think its something old:

cannot increase buffer: current=4096 requested=36864 max=4096

I believe this was fixed a while ago.

Also, try the auto_retry_requests option in newer versions as well: https://github.com/aws/aws-for-fluent-bit/pull/266/files

yangyang919 commented 2 years ago

What version are you using? I think its something old:

cannot increase buffer: current=4096 requested=36864 max=4096

I believe this was fixed a while ago.

Also, try the auto_retry_requests option in newer versions as well: https://github.com/aws/aws-for-fluent-bit/pull/266/files

The version we used I believe it's 1.8.6, is this version too old?

PettitWesley commented 2 years ago

@yangyang919 Unfortunately there have been a lot of bug fixes throughout the 1.8.x series, so yeah, 1.8.6 is a bit old.

Please use the AWS distro, the latest version is on 1.8.9: https://github.com/aws/aws-for-fluent-bit/releases/tag/v2.21.2

yangyang919 commented 2 years ago

thanks @PettitWesley we'll upgrade the version and get back to you

yangyang919 commented 2 years ago

Hi @PettitWesley ,

We have upgraded fluent-bit to version 1.8.10. New issues appeared: Pod gets OOMKilled, and Back-off restarted frequently. If removing cloud-watch output part, then everything is normal.

Our config is like below, do you see any problem?

` [SERVICE] Daemon Off Flush 1 Log_Level warn Parsers_File parsers.conf Parsers_File custom_parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port 2020 Health_Check On

[INPUT]
    Name tail
    Path /var/log/containers/*.log
    Parser custom_cri
    Tag kube.*
    Mem_Buf_Limit 128MB
    Skip_Long_Lines On

[INPUT]
    Name systemd
    Tag host.*
    Systemd_Filter _SYSTEMD_UNIT=kubelet.service
    Read_From_Tail On

[FILTER]
    Name record_modifier
    Match *
    Record cluster_identifier stage-use1-saml

[FILTER]
    Name kubernetes
    Match kube.*
    Merge_Log On
    Keep_Log Off
    K8S-Logging.Parser On
    K8S-Logging.Exclude On

[FILTER]
    Name                lua
    Match               kube.*
    script              kubernetes_map_keys.lua
    call                kubernetes_map_keys

[FILTER]
    Name                lua
    Match               *
    script              kubernetes_map_keys.lua
    call                add_source_type

[FILTER]
    Name                lua
    Match               *
    script              kubernetes_map_keys.lua
    call                add_aligned_sent_received

[FILTER]
    Name                rewrite_tag
    Match               kube.*
    Rule                $source_type ^(istio-envoy-proxy-access-log)$ istio-envoy false

[OUTPUT]
    Name http
    Match kube.*
    Retry_Limit False
    Host host
    port 443
    uri /
    http_User user
    http_Passwd pwd
    format json
    tls on
    tls.debug 1

[OUTPUT]
    Name http
    Match approuter
    Retry_Limit False
    Host host
    port 443
    uri /
    http_User user
    http_Passwd pwd
    format json
    tls on
    tls.debug 1

[OUTPUT]
    Name http
    Match istio-envoy
    Retry_Limit False
    Host host
    port 443
    uri /customindex/istio-envoy
    http_User user
    http_Passwd pwd
    format json
    tls on
    tls.debug 1

[OUTPUT]
    Name                cloudwatch_logs
    Match               kube.*
    region              us-east-1
    log_group_name      fluent-bit-cloudwatch/stage-use1
    log_stream_prefix   from-fluent-bit-
    auto_create_group   true

[OUTPUT]
    Name                cloudwatch_logs
    Match               approuter
    region              us-east-1
    log_group_name      fluent-bit-cloudwatch/stage-use1
    log_stream_prefix   from-fluent-bit-
    auto_create_group   true`

DrewZhang13 commented 2 years ago

memory growth issue reported from another customer for aws-for fluent bit version 2.21.1 which is fluent bit 1.8.9 version: https://github.com/aws/aws-for-fluent-bit/issues/277

PettitWesley commented 2 years ago

Please also consider checking out relevant sections of our new debugging guide: https://github.com/aws/aws-for-fluent-bit/pull/266/files

yangyang919 commented 2 years ago

Hi @PettitWesley @DrewZhang13

Thanks for replying. Here is what we observed:

As you can see in our config, the outputs are sent to two places. One is fluentD for ElasticSearch, while the other is to AWS Cloud Watch.
If enable both outputs, the pods will suffer very frequent restart issue due to OOMKilled.
If enable only cloud watch or fluentD, OOMKilled will become much less frequent. Still has, but very very few.

So I'm thinking maybe tuning parameters can solve this issue. Our fluent-bit pod has resources limits like this: Limits: cpu: 200m memory: 512Mi Requests: cpu: 100m memory: 256Mi

[INPUT] Name tail Path /var/log/containers/.log Parser custom_cri Tag kube. Mem_Buf_Limit 128MB Skip_Long_Lines On

You met more cases like us. Do you foresee any parameters can be optimized?

Some logs before get OOMKilled:

[2021/12/10 11:15:50] [debug] [out coro] cb_destroy coro_id=352 [2021/12/10 11:15:50] [ info] [engine] chunk '1-1639134890.24990947.flb' is not retried (no retry config): task_id=7, input=tail.0 > output=http.0 (out_id=0) [2021/12/10 11:15:50] [debug] [task] destroy task=0x7f2633a37460 (task_id=7) [2021/12/10 11:15:50] [debug] [socket] could not validate socket status for #99 (don't worry) [2021/12/10 11:15:52] [debug] [input:tail:tail.0] inode=1166440 events: IN_MODIFY [2021/12/10 11:15:53] [debug] [task] created task=0x7f2633a37460 id=7 OK [2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: before SSL initialization [2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS write client hello [2021/12/10 11:15:53] [debug] [tls] connection #99 WANT_READ [2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS write client hello [2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS read server hello [2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: TLSv1.3 read encrypted extensions [2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS read server certificate [2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: TLSv1.3 read server certificate verify [2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS read finished [2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS write change cipher spec [2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS write finished [2021/12/10 11:15:53] [debug] [http_client] not using http_proxy for header [2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSL negotiation finished successfully [2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSL negotiation finished successfully [2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS read server session ticket [2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSL negotiation finished successfully [2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSL negotiation finished successfully [2021/12/10 11:15:53] [debug] [tls] connection #99 SSL_connect: SSLv3/TLS read server session ticket [2021/12/10 11:15:54] [debug] [input:tail:tail.0] inode=1294459 events: IN_MODIFY [2021/12/10 11:15:54] [ info] [output:http:http.0] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200 [2021/12/10 11:15:54] [debug] [upstream] KA connection #110 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available [2021/12/10 11:15:54] [debug] [out coro] cb_destroy coro_id=416 [2021/12/10 11:15:54] [debug] [task] destroy task=0x7f2633a3b560 (task_id=8) [2021/12/10 11:15:54] [debug] [task] created task=0x7f2633a3b560 id=8 OK [2021/12/10 11:15:54] [debug] [upstream] KA connection #118 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 has been assigned (recycled) [2021/12/10 11:15:54] [debug] [http_client] not using http_proxy for header [2021/12/10 11:15:54] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sending 1 events [2021/12/10 11:15:54] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sending log events to log stream from-fluent-bit-kube.var.log.containers.calico-node-7mgp2_kube-system_calico-node-87a7327d7d88b919f6b8fb8b7e42f3458010b8dc26292a5725636d1a96ee15c4.log [2021/12/10 11:15:54] [debug] [upstream] KA connection #44 to logs.us-east-1.amazonaws.com:443 has been assigned (recycled) [2021/12/10 11:15:54] [debug] [http_client] not using http_proxy for header [2021/12/10 11:15:54] [debug] [aws_credentials] Requesting credentials from the env provider.. [2021/12/10 11:15:54] [ warn] [http_client] malformed HTTP response from logs.us-east-1.amazonaws.com:443 on connection #44 [2021/12/10 11:15:54] [debug] [aws_client] logs.us-east-1.amazonaws.com: http_do=-1, HTTP Status: 0 [2021/12/10 11:15:54] [debug] [upstream] KA connection #44 to logs.us-east-1.amazonaws.com:443 is now available [2021/12/10 11:15:54] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/10 11:15:54] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send log events [2021/12/10 11:15:54] [error] [output:cloudwatch_logs:cloudwatch_logs.3] Failed to send events [2021/12/10 11:15:54] [debug] [out coro] cb_destroy coro_id=421 [2021/12/10 11:15:54] [debug] [retry] new retry created for task_id=8 attempts=1 [2021/12/10 11:15:54] [ warn] [engine] failed to flush chunk '1-1639134954.235814243.flb', retry in 8 seconds: task_id=8, input=tail.0 > output=cloudwatch_logs.3 (out_id=3) [2021/12/10 11:15:54] [debug] [upstream] KA connection #44 to logs.us-east-1.amazonaws.com:443 has been disconnected by the remote service [2021/12/10 11:15:54] [debug] [socket] could not validate socket status for #44 (don't worry) [2021/12/10 11:15:55] [ info] [output:http:http.0] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200 [2021/12/10 11:15:55] [debug] [upstream] KA connection #104 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available [2021/12/10 11:15:55] [debug] [out coro] cb_destroy coro_id=413 [2021/12/10 11:15:55] [debug] [task] destroy task=0x7f26299b8fa0 (task_id=1) [2021/12/10 11:15:55] [ info] [output:http:http.0] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200 [2021/12/10 11:15:55] [debug] [upstream] KA connection #118 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available [2021/12/10 11:15:55] [debug] [out coro] cb_destroy coro_id=420 [2021/12/10 11:15:55] [ info] [output:http:http.0] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200 [2021/12/10 11:15:55] [debug] [upstream] KA connection #120 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available [2021/12/10 11:15:55] [debug] [out coro] cb_destroy coro_id=414 [2021/12/10 11:15:55] [debug] [task] destroy task=0x7f2633a3bf60 (task_id=4) [2021/12/10 11:16:02] [ info] [output:http:http.0] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200 [2021/12/10 11:16:02] [debug] [upstream] KA connection #108 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available [2021/12/10 11:16:02] [debug] [upstream] KA connection #108 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 has been disconnected by the remote service [2021/12/10 11:16:02] [debug] [out coro] cb_destroy coro_id=390 [2021/12/10 11:16:02] [debug] [task] destroy task=0x7f2633a3b420 (task_id=3) [2021/12/10 11:16:02] [debug] [socket] could not validate socket status for #108 (don't worry) [2021/12/10 11:16:02] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sending 1 events [2021/12/10 11:16:02] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sending log events to log stream from-fluent-bit-kube.var.log.containers.calico-node-7mgp2_kube-system_calico-node-87a7327d7d88b919f6b8fb8b7e42f3458010b8dc26292a5725636d1a96ee15c4.log [2021/12/10 11:16:02] [debug] [http_client] not using http_proxy for header [2021/12/10 11:16:02] [debug] [aws_credentials] Requesting credentials from the env provider.. [2021/12/10 11:16:02] [debug] [upstream] KA connection #108 to logs.us-east-1.amazonaws.com:443 is now available [2021/12/10 11:16:02] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] PutLogEvents http status=200 [2021/12/10 11:16:02] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sent events to from-fluent-bit-kube.var.log.containers.calico-node-7mgp2_kube-system_calico-node-87a7327d7d88b919f6b8fb8b7e42f3458010b8dc26292a5725636d1a96ee15c4.log [2021/12/10 11:16:02] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sent 1 events to CloudWatch [2021/12/10 11:16:02] [debug] [out coro] cb_destroy coro_id=422 [2021/12/10 11:16:02] [debug] [task] destroy task=0x7f2633a3b560 (task_id=8) [2021/12/10 11:16:02] [debug] [input:tail:tail.0] inode=1166440 events: IN_MODIFY [2021/12/10 11:16:02] [debug] [input:tail:tail.0] inode=1166440 events: IN_MODIFY [2021/12/10 11:16:03] [debug] [task] created task=0x7f2633a3b560 id=1 OK [2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: before SSL initialization [2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS write client hello [2021/12/10 11:16:03] [debug] [tls] connection #102 WANT_READ [2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS write client hello [2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS read server hello [2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: TLSv1.3 read encrypted extensions [2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS read server certificate [2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: TLSv1.3 read server certificate verify [2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS read finished [2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS write change cipher spec [2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS write finished [2021/12/10 11:16:03] [debug] [http_client] not using http_proxy for header [2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSL negotiation finished successfully [2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSL negotiation finished successfully [2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS read server session ticket [2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSL negotiation finished successfully [2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSL negotiation finished successfully [2021/12/10 11:16:03] [debug] [tls] connection #102 SSL_connect: SSLv3/TLS read server session ticket [2021/12/10 11:16:04] [ info] [output:http:http.2] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200 [2021/12/10 11:16:04] [debug] [upstream] KA connection #102 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available [2021/12/10 11:16:04] [debug] [out coro] cb_destroy coro_id=59 [2021/12/10 11:16:04] [debug] [task] destroy task=0x7f2633a3b560 (task_id=1) [2021/12/10 11:16:04] [ info] [output:http:http.0] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200 [2021/12/10 11:16:04] [debug] [upstream] KA connection #138 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available [2021/12/10 11:16:04] [debug] [out coro] cb_destroy coro_id=396 [2021/12/10 11:16:04] [debug] [task] destroy task=0x7f26299b9040 (task_id=5) [2021/12/10 11:16:04] [debug] [upstream] KA connection #138 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 has been disconnected by the remote service [2021/12/10 11:16:04] [debug] [socket] could not validate socket status for #138 (don't worry) [2021/12/10 11:16:05] [ info] [output:http:http.0] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200 [2021/12/10 11:16:05] [debug] [upstream] KA connection #105 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available [2021/12/10 11:16:05] [debug] [out coro] cb_destroy coro_id=410 [2021/12/10 11:16:05] [debug] [task] destroy task=0x7f2633a3a020 (task_id=9) [2021/12/10 11:16:09] [debug] [input:tail:tail.0] inode=1166155 events: IN_MODIFY [2021/12/10 11:16:09] [debug] [task] created task=0x7f2633a3a020 id=1 OK [2021/12/10 11:16:09] [debug] [upstream] KA connection #102 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 has been assigned (recycled) [2021/12/10 11:16:09] [debug] [http_client] not using http_proxy for header [2021/12/10 11:16:09] [debug] [input:tail:tail.0] inode=1170715 events: IN_MODIFY [2021/12/10 11:16:09] [ info] [output:http:http.2] fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443, HTTP status=200 [2021/12/10 11:16:09] [debug] [upstream] KA connection #102 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 is now available [2021/12/10 11:16:09] [debug] [out coro] cb_destroy coro_id=60 [2021/12/10 11:16:09] [debug] [task] destroy task=0x7f2633a3a020 (task_id=1) [2021/12/10 11:16:10] [debug] [task] created task=0x7f2633a3a020 id=1 OK [2021/12/10 11:16:10] [debug] [upstream] KA connection #110 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 has been assigned (recycled) [2021/12/10 11:16:10] [debug] [http_client] not using http_proxy for header [2021/12/10 11:16:10] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sending 1 events [2021/12/10 11:16:10] [debug] [output:cloudwatch_logs:cloudwatch_logs.3] Sending log events to log stream from-fluent-bit-kube.var.log.containers.oneagent-7tvjp_dynatrace_dynatrace-oneagent-10f778095fd6aa924c26a41c818872058bcf3c3c0a9832f2db0c96e73eda2bf9.log [2021/12/10 11:16:10] [debug] [upstream] KA connection #108 to logs.us-east-1.amazonaws.com:443 has been assigned (recycled) [2021/12/10 11:16:10] [debug] [http_client] not using http_proxy for header [2021/12/10 11:16:10] [debug] [aws_credentials] Requesting credentials from the env provider..

PettitWesley commented 2 years ago

@yangyang919

We have upgraded fluent-bit to version 1.8.10. New issues appeared: Pod gets OOMKilled, and Back-off restarted frequently. As you can see in our config, the outputs are sent to two places. One is fluentD for ElasticSearch, while the other is to AWS Cloud Watch.

Which outputs are seeing errors? (Both? Or more one than the other?). Usually OOMKill/high memory usage is caused by errors in fluent bit which lead to retries, which mean the logs pile up in the buffer. Consider using the monitoring interface to get error counts: https://docs.fluentbit.io/manual/administration/monitoring#health-check-for-fluent-bit

Also, what are the error messages?

So I'm thinking maybe tuning parameters can solve this issue. Our fluent-bit pod has resources limits like this: Limits: cpu: 200m memory: 512Mi Requests: cpu: 100m memory: 256Mi Mem_Buf_Limit 128MB

So the Mem_Buf_Limit is the amount of data the input can buffer in memory (note that you can also have file system buffering as an additional layer, though memory will still be used in that case and the usage can still be high, don't think of it like a filesystem instead of memory feature, think of it as filesystem in addition to memory...).

When data is sent, its copied into a new buffer for each output and then the output might create additional buffers for the http requests and for processing. Thus, it is very reasonable for the following to be true:

"actual memory usage" > "# of outputs" * "Total Mem_Buf_Limit for inputs"

Just for normal/happy cases (and if your pods are producing tons of logs), and in error cases, it could be much more...

yangyang919 commented 2 years ago

@PettitWesley Thanks for replying! Very very important message for us:

"actual memory usage" > "# of outputs" * "Total Mem_Buf_Limit for inputs"

We have 6 outputs in total. 3 for FluentD, while 3 for Cloudwatch. Considering Mem_Buf_Limit is 128M, so the actual for FluentD is already at least 3*128MB. Coz we found out FluentD has many retries logs, while cloudwatch are quite stable.

I'll lower the Mem_Buf_Limit buffer to verify this

PettitWesley commented 2 years ago

@yangyang919 Remember that lowering the mem_buf_limit could lead to log loss under high throughput and you should check the fluent bit logs for paused/overlimit warnings.

yangyang919 commented 2 years ago

@PettitWesley yes, thanks for reminding. I also enabled file system storage as well. But sure, to really solve this, we need to make sure fluentD side digests our logs, otherwise no matter how large buffer we set, it will eventually get full.. Another question is is there a way to monitor live buffer consumption in Fluent-bit?

PettitWesley commented 2 years ago

@yangyang919 See this: https://docs.fluentbit.io/manual/administration/monitoring#metrics-examples

I think there are disk storage metrics too but only if you use the config option noted here: https://docs.fluentbit.io/manual/administration/monitoring#rest-api-interface

yangyang919 commented 2 years ago

thanks @PettitWesley , I can see that the buffer gets consumed all for tail plugin. A further question: we have a bunch of Filters, is there a way to check memory consumption by those plugins as well? Coz I already set Mem_Buf_Limit to 20MB, 6 outputs, then 120MB maximum. The requested memory is 320MB now. But in 4 hours period, we still see 5 Pods out of 30 get restarted once due to OOM (which is much much better already). I am suspecting it's because some other plugins eat the memory...

{"storage_layer":{"chunks":{"total_chunks":10,"mem_chunks":10,"fs_chunks":0,"fs_chunks_up":0,"fs_chunks_down":0}},"input_chunks":{"tail.0":{"status":{"overlimit":true,"mem_size":"19.1M","mem_limit":"19.1M"},"chunks":{"total":10,"up":10,"down":0,"busy":9,"busy_size":"17.6M"}},"systemd.1":{"status":{"overlimit":false,"mem_size":"0b","mem_limit":"4.8M"},"chunks":{"total":0,"up":0,"down":0,"busy":0,"busy_size":"0b"}},"storage_backlog.2":{"status":{"overlimit":false,"mem_size":"0b","mem_limit":"0b"},"chunks":{"total":0,"up":0,"down":0,"busy":0,"busy_size":"0b"}},"emitter_for_rewrite_tag.5":{"status":{"overlimit":false,"mem_size":"0b","mem_limit":"9.5M"},"chunks":{"total":0,"up":0,"down":0,"busy":0,"busy_size":"0b"}}}}

PettitWesley commented 2 years ago

@yangyang919 Unfortunately there's currently no way to see plugin specific memory usage..

Also, remember the formula I gave used > not =, each plugin will have all sorts of temporary processing buffers, so its possible for usage to be higher.

yangyang919 commented 2 years ago

thanks @PettitWesley , today I found a very strange behavior. One Pod gets restarted 3 times, but if checking its memory consumption, it's definitely below the requested resources (request 400M, limit 800M). Do you have any thoughts?

PettitWesley commented 2 years ago

@yangyang919 Do the logs contain any evidence that it crashed? Can you get the stopped reason from k8s?

yangyang919 commented 2 years ago

@PettitWesley The reason is OOMKilled...

Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Tue, 14 Dec 2021 14:23:08 +0800
  Finished:     Tue, 14 Dec 2021 14:39:00 +0800

The logs before it gets restarted seems ok:

[2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY [2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY [2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY [2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY [2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY [2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY [2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY [2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY [2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY [2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY [2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY [2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY [2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY [2021/12/14 06:38:32] [debug] [input:tail:tail.0] inode=519581 events: IN_MODIFY [2021/12/14 06:38:32] [ warn] [input] tail.0 paused (mem buf overlimit) [2021/12/14 06:38:32] [debug] [task] created task=0x7fa2f4e3aa20 id=1 OK [2021/12/14 06:38:32] [debug] [task] created task=0x7fa2f4e3b060 id=2 OK [2021/12/14 06:38:32] [debug] [task] created task=0x7fa2f4e3b1a0 id=3 OK [2021/12/14 06:38:32] [debug] [task] created task=0x7fa2f4e3b2e0 id=4 OK [2021/12/14 06:38:32] [debug] [task] created task=0x7fa2f4e3b4c0 id=5 OK [2021/12/14 06:38:32] [debug] [task] created task=0x7fa2f4e3b600 id=6 OK [2021/12/14 06:38:32] [debug] [upstream] KA connection #93 to fluentd-sf-c2510cb1-149f-411f-a25f-266f8863fe65.cls-01.cloud.logs.services.us10.hana.ondemand.com:443 has been assigned (recycled) [2021/12/14 06:38:32] [debug] [http_client] not using http_proxy for header [2021/12/14 06:38:32] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] Sending 3 events [2021/12/14 06:38:32] [debug] [output:cloudwatch_logs:cloudwatch_logs.2] Sending log events to log stream from-fluent-bit-kube.var.log.containers.shipping-service-67cc57c794-vgmnd_upscale_istio-proxy-ea0865f2af6aaacf16da4f2e79ce8b3a040293d0c57f734c52a1b2ac28e97ab9.log [2021/12/14 06:38:32] [debug] [upstream] KA connection #36 to logs.us-east-1.amazonaws.com:443 has been assigned (recycled) [2021/12/14 06:38:32] [debug] [http_client] not using http_proxy for header [2021/12/14 06:38:32] [debug] [aws_credentials] Requesting credentials from the env provider..

PettitWesley commented 2 years ago

@yangyang919 If K8s says it was OOMKilled I would assume that isn't a lie and is true... so somehow the memory usage spiked I guess.. the graph does show that right? It just isn't as high as you expect.

But I found this article: https://blog.freshtracks.io/a-deep-dive-into-kubernetes-metrics-part-3-container-resource-metrics-361c5ee46e66

And it says:

You might think that memory utilization is easily tracked with container_memory_usage_bytes, however, this metric also includes cached (think filesystem cache) items that can be evicted under memory pressure. The better metric is container_memory_working_set_bytes as this is what the OOM killer is watching for.

yangyang919 commented 2 years ago

@PettitWesley really weird...Below pod has restarted once in last 60 minutes due to OOMKilled. But using container_memory_working_set_bytes still looks so normal...

State:          Running
  Started:      Tue, 14 Dec 2021 15:02:15 +0800
Last State:     Terminated
  Reason:       OOMKilled
  Exit Code:    137
  Started:      Tue, 14 Dec 2021 14:22:05 +0800
  Finished:     Tue, 14 Dec 2021 15:02:13 +0800
Ready:          True

PettitWesley commented 2 years ago

@yangyang919

[2021/12/14 06:38:32] [ warn] [input] tail.0 paused (mem buf overlimit)

You do have mem buf overlimit warning. Also, I wonder if the memory spikes very suddenly in your case, and the graph never catches up to the real memory usage. Each crash has a sudden increase in memory right before it. This is my guess.

As to what is causing that... I am not sure... no retries or other errors?

yangyang919 commented 2 years ago

@PettitWesley Yes, retries are set to unlimited. From metrics API, we can see there are quite some retries => means the output destination is kind of stuck, the requests are not so fluent. This is something I'm trying to check with FluentD side, why they keep rejects our requests.

But still I'm guessing somewhere in our Fluent-bit there is memory leak, not sure which Plugin is leaking. The memory grows so fast, basically vertical ascending and then get OOMKilled. If I only enabled outputs to FluentD or outputs to cloudwatch, everything is fine. If enabled both, then OOMKilled happened. I attached the config below, this is right now the most stable version. In this version there are still restarts due to OOMKilled, but not no frequent. Some points worth to mention:

if set Flush from 5 to 10, other config remains unchanged => OOMKilled becomes very frequent, like 3 times in 10 minutes.
Inputs have memory buffer limit, storage layer has memory limit. I tried to disable Filter plugin one by one, but still no strong proof that they have memory leak.

service: | [SERVICE] Daemon Off Flush 5 Log_Level {{ .Values.logLevel }} Parsers_File parsers.conf Parsers_File custom_parsers.conf HTTP_Server On HTTP_Listen 0.0.0.0 HTTP_Port {{ .Values.service.port }} Health_Check On storage.path /var/log/fluentbit storage.sync normal storage.checksum Off storage.backlog.mem_limit 10M storage.metrics On inputs: | [INPUT] Name tail Path /var/log/containers/.log Parser custom_cri Tag kube. Mem_Buf_Limit 10MB Skip_Long_Lines On [INPUT] Name systemd Tag host. Systemd_Filter _SYSTEMD_UNIT=kubelet.service Read_From_Tail On Mem_Buf_Limit 5MB filters: | [FILTER] Name record_modifier Match Record cluster_identifier {{ .Values.cluster }} [FILTER] Name kubernetes Match kube. Merge_Log On Keep_Log Off K8S-Logging.Parser On K8S-Logging.Exclude On Buffer_Size 0 [FILTER] Name lua Match kube. script kubernetes_map_keys.lua call kubernetes_map_keys [FILTER] Name lua Match script kubernetes_map_keys.lua call add_source_type [FILTER] Name lua Match script kubernetes_map_keys.lua call add_aligned_sent_received [FILTER] Name rewrite_tag Match kube. Rule $source_type ^(istio-envoy-proxy-access-log)$ istio-envoy false outputs: | [OUTPUT] Name http Match kube. Retry_Limit False Host <path:kv-v2/#Host> port 443 uri / http_User <path:kv-v2/#http_User> http_Passwd <path:kv-v2/#http_Passwd> format json tls on tls.debug 1

[OUTPUT] Name http Match istio-envoy Retry_Limit False Host <path:kv-v2/data/fluent-bit/{{ .Values.cluster }}#Host> port 443 uri /customindex/istio-envoy http_User <path:kv-v2/#http_User> http_Passwd <path:kv-v2/#http_Passwd> format json tls on tls.debug 1

[OUTPUT] Name cloudwatch_logs Match kube.* region {{ .Values.region }} log_group_name fluent-bit-cloudwatch/{{ .Values.clusterName }} log_stream_prefix from-fluent-bit- auto_create_group true

[OUTPUT] Name cloudwatch_logs Match istio-envoy region {{ .Values.region }} log_group_name fluent-bit-cloudwatch/{{ .Values.clusterName }} log_stream_prefix from-fluent-bit- auto_create_group true

customParsers: | [PARSER] Name docker_no_time Format json Time_Keep Off Time_Key time Time_Format %Y-%m-%dT%H:%M:%S.%L [PARSER] Name custom_cri Format regex Regex ^(?[^ ]+) (?stdout|stderr) (?[^ ]) (?.)$ Time_Key time Time_Format %Y-%m-%dT%H:%M:%S.%L%z

luaScripts: kubernetes_map_keys.lua: | function kubernetes_map_keys(tag, timestamp, record) if record.kubernetes == nil then return 0 end map_keys(record.kubernetes.annotations) map_keys(record.kubernetes.labels) return 1, timestamp, record end function map_keys(table) if table == nil then return end local new_table = {} local changed_keys = {} for key, val in pairs(table) do local mappedkey = string.gsub(key, "[%/%.]", "") if mapped_key ~= key then new_table[mapped_key] = val changed_keys[key] = true end end for key in pairs(changed_keys) do table[key] = nil end for key, val in pairs(new_table) do table[key] = val end end function add_source_type(tag, timestamp, record) if record.kubernetes ~= nil then local namespace = record.kubernetes.namespace_name if record.kubernetes.container_name == "istio-proxy" and record.protocol ~= nil then record.source_type = "istio-envoy-proxy-access-log" elseif string.find(namespace, "compass%-system") or string.find(namespace, "kube%-node%-lease") or string.find(namespace, "kube%-public") or string.find(namespace, "kube%-system") or string.find(namespace, "istio%-system") then record.source_type = "kube-system" else record.source_type = "application" end end return 1, timestamp, record end function add_aligned_sent_received(tag, timestamp, record) if record.source_type ~= "istio-envoy-proxy-access-log" then return 0 end if record.upstream_cluster:find("outbound|", 1) == 1 then record.bytes_sent_aligned = record.bytes_received record.bytes_received_aligned = record.bytes_sent else record.bytes_sent_aligned = record.bytes_sent record.bytes_received_aligned = record.bytes_received end return 1, timestamp, record end

PettitWesley commented 2 years ago

@yangyang919 Sorry I'm not really sure what's causing this for you.

For another issue, I recently did some testing/benchmarking and memory leak checking, and I found that our latest stable version is very stable in memory usage and does not have any leaks: https://github.com/aws/aws-for-fluent-bit/blob/mainline/AWS_FOR_FLUENT_BIT_STABLE_VERSION

Please try that version if you haven't already.

vkadi commented 1 year ago

@PettitWesley I do confirm that the stable version 2.28.5 is much better with the cloudwatch_logs plugin vs latest "2.31.x" We are also seeing the issues with the latest version and the recommendation here says switch to the older cloudwatch plugin (https://github.com/aws/aws-for-fluent-bit/issues/340) but that is low performant. @yangyang919 did you found any solution to stablize on these errors/restarts ?

PettitWesley commented 1 year ago

@vkadi please check this issue: https://github.com/aws/aws-for-fluent-bit/issues/542

vkadi commented 1 year ago

@PettitWesley does https://github.com/aws-samples/amazon-ecs-firelens-examples/tree/mainline/examples/fluent-bit/oomkill-prevention applicable for EKS as well ?

PettitWesley commented 1 year ago

@vkadi the recommendations and fluent bit config options can be used in EKS as well. Fluent Bit config language is the same no matter where you deploy it.

aws / aws-for-fluent-bit

Frequent errors when using output plugin cloudwatch_logs #274

Describe the question/issue