fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.7k stars 1.55k forks source link

FluentBit restarts sporadically with "[engine] caught signal (SIGSEGV)" on ARM64 Nodes in Kubernetes Cluster (AWS EKS) #8630

Open cloud-thought opened 5 months ago

cloud-thought commented 5 months ago

Bug Report

Describe the bug FluentBit restarts sporadically with "[engine] caught signal (SIGSEGV)" on ARM64 Nodes in Kubernetes Cluster (AWS EKS)

To Reproduce

[2024/03/23 07:12:29] [engine] caught signal (SIGSEGV)
#0  0xaaaadc6d4804      in  flb_upstream_conn_timeouts() at src/flb_upstream.c:896
#1  0xaaaadc6b151f      in  cb_engine_sched_timer() at src/flb_engine.c:175
#2  0xaaaadc6c0f13      in  flb_sched_event_handler() at src/flb_scheduler.c:428
#3  0xaaaadc6b2e57      in  flb_engine_start() at src/flb_engine.c:1025
#4  0xaaaadc655c53      in  flb_lib_worker() at src/flb_lib.c:638
#5  0xffff88a9b647      in  ???() at ???:0
#6  0xffff883fffdb      in  ???() at ???:0
#7  0xffffffffffffffff  in  ???() at ???:0

Not sure how to reproduce the issue. Only happens on ARM64 nodes. The FluentBit pods have enough memory assigned to them and the node is not under any type of heavy load. Our FluentBit pods on the AMD64 nodes work fine and the restarts only happen very rarely in regards on a minimal subset of our ARM64 nodes.

Expected behavior No restarts happening.

Screenshots None.

Your Environment

Tail generic container logs

[INPUT] Name tail Path /var/log/containers/.log Exclude_Path /var/log/containers/daprd.log,/var/log/containers/fluent-bit.log,/var/log/containers/fluentd.log DB /var/log/flb_containers_log.db multiline.parser cri Read_From_Head On Tag kube.generic. Refresh_Interval 10 storage.type filesystem

Tail daprd container logs

[INPUT] Name tail Path /var/log/containers/daprd.log Exclude_Path /var/log/containers/fluent-bit.log,/var/log/containers/fluentd.log DB /var/log/flb_daprd_log.db multiline.parser cri Read_From_Head On Tag kube.daprd.* Refresh_Interval 10 storage.type filesystem

Tail system logs

[INPUT] Name systemd Tag host.systemd.* DB /var/log/flb_systemd_log.db Read_From_Tail On Systemd_Filter _SYSTEMD_UNIT=kubelet.service Systemd_Filter _SYSTEMD_UNIT=kubeproxy.service Systemd_Filter _SYSTEMD_UNIT=containerd.service Strip_Underscores On

Collect Kubernetes events

[INPUT] name kubernetes_events tag host.events.*

Gather kubernetes metadata by using the Kubelet API for generic containers

[FILTER] Name kubernetes Match kube.generic.* Merge_Log On Keep_Log Off K8S-Logging.Parser On K8S-Logging.Exclude On Kube_Tag_Prefix kube.generic.var.log.containers. Use_Kubelet On Buffer_Size 0

Gather kubernetes metadata by using the Kubelet API for daprd containers

[FILTER] Name kubernetes Match kube.daprd.* Merge_Log On Keep_Log Off K8S-Logging.Parser On K8S-Logging.Exclude On Kube_Tag_Prefix kube.daprd.var.log.containers. Use_Kubelet On Buffer_Size 0

Nest metadata fields

[FILTER] Name nest Match kube.daprd.* Operation nest Wildcard app_id Wildcard scope Wildcard type Wildcard ver Nest_under dapr

Record normalization for further processing

[FILTER] Name modify Match kube.daprd.* Rename time timestamp Rename msg message

Only output logs for containers with set fluentd.active=true or fluentbit.io/active=true annotations

[FILTER] Name grep Match kube.* Logical_Op or Regex $kubernetes['annotations']['fluentd.active'] true Regex $kubernetes['annotations']['fluentbit.io/active'] true

Rewrite tag by provided custom tag for Kubernetes scaffold logs

[FILTER] Name rewrite_tag Match kube.* Rule $kubernetes['annotations']['fluentbit.io/tag'] ^scaffold$ custom.scaffold.$TAG false Emitter_Name kubernetes_scaffold_emitter

Add AWS metadata to container logs

[FILTER] Name aws Match kube.* imds_version v1 account_id true

Add AWS metadata to custom emitter logs

[FILTER] Name aws Match custom.* imds_version v1 account_id true

Remove unnecessary keys from container logs

[FILTER] Name lua Match kube.* script /fluent-bit/scripts/remove_keys.lua call remove_keys

Final record mapping for container logs

[FILTER] Name modify Match kube.* Add @log_group /eks/${CLUSTER_NAME}/containers Rename x_request_id x-request-id

Discard Kubernetes events of type normal

[FILTER] Name grep Match host.events.* Exclude type Normal

Kubernetes events final mapping

[FILTER] Name record_modifier Match host.events.* Record inputSource kubernetes_events

Remove unnecessary keys from container logs

[FILTER] Name record_modifier Match host.systemd.* Whitelist_key MESSAGE Whitelist_key HOSTNAME Whitelist_key SYSTEMD_UNIT Record inputSource systemd

Output container logs to CloudWatch grouped by EKS cluster name

[OUTPUT] Name cloudwatch_logs Match kube.* region ${REGION} log_group_name /eks/fluentbit/${CLUSTER_NAME}/containers log_stream_template $kubernetes['pod_name'].$kubernetes['namespace_name'].$kubernetes['container_name'].$kubernetes['pod_id'] log_stream_prefix fallback-stream auto_create_group On retry_limit 5 net.keepalive On net.keepalive_idle_timeout 4s log_retention_days 5

Output container logs to Elastic grouped by EKS cluster name

[OUTPUT] Name es Match kube.* Index eks-${CLUSTER_NAME}-containers cloud_id ${ELASTICSEARCH_CLOUD_ID} cloud_auth ${ELASTICSEARCH_USER}:${ELASTICSEARCH_PASSWORD} Trace_Error On tls On tls.verify Off Retry_Limit 5 write_operation upsert Replace_Dots On Generate_ID On Suppress_Type_Name On

Output Kubernetes events to CloudWatch grouped by EKS cluster name

[OUTPUT] Name cloudwatch_logs Match host.events.* region ${REGION} log_group_name /eks/fluentbit/${CLUSTER_NAME}/host log_stream_name events auto_create_group On retry_limit 5 net.keepalive On net.keepalive_idle_timeout 4s log_retention_days 5

Output host logs to CloudWatch grouped by EKS cluster name

[OUTPUT] Name cloudwatch_logs Match host.systemd.* region ${REGION} log_group_name /eks/fluentbit/${CLUSTER_NAME}/host log_stream_prefix systemd auto_create_group On retry_limit 5 net.keepalive On net.keepalive_idle_timeout 4s log_retention_days 5

Output systemd logs to Elastic grouped by EKS cluster name

[OUTPUT] Name es Match host.* Index kubernetes-host-eks-${CLUSTER_NAME} cloud_id ${ELASTICSEARCH_CLOUD_ID} cloud_auth ${ELASTICSEARCH_USER}:${ELASTICSEARCH_PASSWORD} Trace_Error On tls On tls.verify Off Retry_Limit 5 write_operation upsert Replace_Dots On Generate_ID On Suppress_Type_Name On

Output DevOps container logs to Elastic grouped by EKS cluster name

[OUTPUT] Name es Match custom.scaffold.* Index kubernetes-scaffold-eks-${CLUSTER_NAME} cloud_id ${ELASTICSEARCH_CLOUD_ID} cloud_auth ${ELASTICSEARCH_USER}:${ELASTICSEARCH_PASSWORD} Trace_Error On tls On tls.verify Off Retry_Limit 5 write_operation upsert Replace_Dots On Generate_ID On Suppress_Type_Name On


* Environment name and version (e.g. Kubernetes? What version?): AWS EKS 1.24
* Server type and version: Spot Instances, ARM64 
* Operating System and version: -
* Filters and plugins: see above.

**Additional context**
None.
patrick-stephens commented 5 months ago

Does it happen with 3.0?

cloud-thought commented 5 months ago

Hey @patrick-stephens, have not tried version 3.0. We are using this Helm Chart https://artifacthub.io/packages/helm/fluent/fluent-bit

patrick-stephens commented 5 months ago

I would try in case it's already resolved, in which case it's a case of figuring out what to backport or if it still affects it. I can't easily test ARM clusters I'm afraid so relies on user info.

cloud-thought commented 5 months ago

Alright will give it a try. Nonetheless, maybe other users have similar issues and would like to report. I will post an update as soon as I have enough data.

cloud-thought commented 5 months ago

Ok, didn't have to wait for that long to encounter the same issue:

Defaulted container "fluent-bit" out of: fluent-bit, reloader
Fluent Bit v3.0.0
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

___________.__                        __    __________.__  __          ________  
\_   _____/|  |  __ __   ____   _____/  |_  \______   \__|/  |_  ___  _\_____  \ 
 |    __)  |  | |  |  \_/ __ \ /    \   __\  |    |  _/  \   __\ \  \/ / _(__  < 
 |     \   |  |_|  |  /\  ___/|   |  \  |    |    |   \  ||  |    \   / /       \
 \___  /   |____/____/  \___  >___|  /__|    |______  /__||__|     \_/ /______  /
     \/                     \/     \/               \/                        \/ 

[2024/03/25 15:06:58] [engine] caught signal (SIGSEGV)
#0  0xaaaae6aec54b      in  cb_engine_sched_timer() at src/flb_engine.c:175
#1  0xaaaae6af09a7      in  flb_sched_event_handler() at src/flb_scheduler.c:428
#2  0xaaaae6aed42f      in  flb_engine_start() at src/flb_engine.c:1025
#3  0xaaaae6acd9ff      in  flb_lib_worker() at src/flb_lib.c:674
#4  0xffffaf669647      in  ???() at ???:0
#5  0xffffaefce01b      in  ???() at ???:0
#6  0xffffffffffffffff  in  ???() at ???:0

This time it happened on nearly all ARM64 nodes we have running in our cluster.

jtackaberry commented 2 months ago

We are hitting this too, and it still occurs with 3.0.7. Multiple clusters on EKS (v1.28.9) on arm64.

The problem seems related to the kubernetes_events input, because our clusters have two separate deployments of Fluent Bit: one deployed as a DaemonSet to collect pod logs, and the other as a Deployment to collect K8s events. The pod logs instances are fine, it's only the events instances that are crashing. And the key difference is that the events deployment uses the kubernetes_events input while the pods deployment doesn't.

Here's the config we are using (except obviously the endpoint FQDN has been redacted):

service:
  daemon: false
  health_check: true
  hot_reload: true
  http_listen: 0.0.0.0
  http_port: 2020
  http_server: true
  parsers_file: /fluent-bit/etc/conf/parsers.conf
  scheduler.cap: 300

pipeline:
  inputs:
    - alias: kubernetes_events
      db: /data/events.db
      name: kubernetes_events
      tag: kube.*

  outputs:
    - alias: loki
      compress: gzip
      drop_single_key: true
      host: loki.example.com
      labels: component=kubernetes, environment=lab, service=somesvc, cluster=some-eks-cluster, namespace=$metadata['namespace'], region=us-east-1, source=events
      match: '*'
      name: loki
      port: 443
      retry_limit: 500
      tls: true
      tls.crt_file: /secrets/cert/tls.crt
      tls.key_file: /secrets/cert/tls.key
      tls.verify: true

Log output:

Fluent Bit v3.0.7
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

___________.__                        __    __________.__  __          ________  
\_   _____/|  |  __ __   ____   _____/  |_  \______   \__|/  |_  ___  _\_____  \ 
 |    __)  |  | |  |  \_/ __ \ /    \   __\  |    |  _/  \   __\ \  \/ / _(__  < 
 |     \   |  |_|  |  /\  ___/|   |  \  |    |    |   \  ||  |    \   / /       \
 \___  /   |____/____/  \___  >___|  /__|    |______  /__||__|     \_/ /______  /
     \/                     \/     \/               \/                        \/ 

[2024/06/12 17:24:26] [ info] [fluent bit] version=3.0.7, commit=73e72bdf9a, pid=1
[2024/06/12 17:24:26] [ info] [storage] ver=1.5.2, type=memory, sync=normal, checksum=off, max_chunks_up=128
[2024/06/12 17:24:26] [ info] [cmetrics] version=0.9.1
[2024/06/12 17:24:26] [ info] [ctraces ] version=0.5.1
[2024/06/12 17:24:26] [ info] [input:kubernetes_events:kubernetes_events] initializing
[2024/06/12 17:24:26] [ info] [input:kubernetes_events:kubernetes_events] storage_strategy='memory' (memory only)
[2024/06/12 17:24:26] [ info] [input:kubernetes_events:kubernetes_events] API server: https://kubernetes.default.svc:443
[2024/06/12 17:24:26] [ info] [input:kubernetes_events:kubernetes_events] thread instance initialized
[2024/06/12 17:24:26] [ info] [output:loki:loki] configured, hostname=loki.example.com:443
[2024/06/12 17:24:26] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
[2024/06/12 17:24:26] [ info] [sp] stream processor started
[2024/06/12 17:33:24] [engine] caught signal (SIGSEGV)
#0  0xaaaac0c7d65b      in  cb_engine_sched_timer() at src/flb_engine.c:175
#1  0xaaaac0c81adb      in  flb_sched_event_handler() at src/flb_scheduler.c:428
#2  0xaaaac0c7e58b      in  flb_engine_start() at src/flb_engine.c:1025
#3  0xaaaac0c5deb3      in  flb_lib_worker() at src/flb_lib.c:674
#4  0xffff8d89ee57      in  ???() at ???:0
#5  0xffff8d907f9b      in  ???() at ???:0
#6  0xffffffffffffffff  in  ???() at ???:0
patrick-stephens commented 2 months ago

If you remove the events input does it resolve?

jtackaberry commented 2 months ago

I guess that question is for you @cloud-thought since in my case if I remove the events input I've got no inputs :)

cloud-thought commented 2 months ago

Great that this is getting more attention! Well, we are not able to test it at this point in time. But as soon as we get the chance to do so, I will come back with more insights about the result of deactivating the k8s events collector.

cloud-thought commented 2 months ago

@patrick-stephens So back with some test results. We deactivated the K8S events input and indeed, no more restarts since 1 week. Any idea what actually correlates with this behaviour?

patrick-stephens commented 2 months ago

@patrick-stephens So back with some test results. We deactivated the K8S events input and indeed, no more restarts since 1 week one. Any idea what actually correlates with this behaviour?

I don't have any knowledge of the code, I'm just good at breaking things (!) so trying to find the minimal breakage required helps narrow it down. It seems like there is something related to the events plugin then (or one of the others alongside it) so may tag in @pwhelan to see if he has some ideas.

Can you clarify the exact minimal set of plugins that both fail and do not fail for you?

cloud-thought commented 2 months ago

Respective to the configuration in my first post, the only thing that I changed is the K8S events input.