fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.56k stars 1.52k forks source link

Multiple issues with Fluent Bit - SIGSEGV, error deleting file at close, fail to drop enough chunks in order to place new data #8823

Open mirko-lazarevic opened 2 months ago

mirko-lazarevic commented 2 months ago

Bug Report

Describe the bug

We are experiencing issues with in our Fluent Bit deployments in Kubernetes cluster, where 1 or 2 pods entering a CrashLoopBackOff state on a cluster consisting of 6 to 8 nodes.

To Reproduce

(SIGSEGV)

Fluent Bit v3.0.2
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

___________.__                        __    __________.__  __          ________  
\_   _____/|  |  __ __   ____   _____/  |_  \______   \__|/  |_  ___  _\_____  \ 
 |    __)  |  | |  |  \_/ __ \ /    \   __\  |    |  _/  \   __\ \  \/ / _(__  < 
 |     \   |  |_|  |  /\  ___/|   |  \  |    |    |   \  ||  |    \   / /       \
 \___  /   |____/____/  \___  >___|  /__|    |______  /__||__|     \_/ /______  /
     \/                     \/     \/               \/                        \/ 

[2024/05/02 08:53:20] [ info] [fluent bit] version=3.0.2, commit=, pid=1
[2024/05/02 08:53:20] [ info] [storage] ver=1.5.2, type=memory+filesystem, sync=normal, checksum=off, max_chunks_up=192
[2024/05/02 08:53:20] [ info] [storage] backlog input plugin: storage_backlog.1
[2024/05/02 08:53:20] [ info] [cmetrics] version=0.7.3
[2024/05/02 08:53:20] [ info] [ctraces ] version=0.4.0
[2024/05/02 08:53:20] [ info] [input:tail:tail.0] initializing
[2024/05/02 08:53:20] [ info] [input:tail:tail.0] storage_strategy='filesystem' (memory + filesystem)
[2024/05/02 08:53:20] [ info] [input:tail:tail.0] db: delete unmonitored stale inodes from the database: count=0
[2024/05/02 08:53:20] [ info] [input:storage_backlog:storage_backlog.1] initializing
[2024/05/02 08:53:20] [ info] [input:storage_backlog:storage_backlog.1] storage_strategy='memory' (memory only)
[2024/05/02 08:53:20] [ info] [input:storage_backlog:storage_backlog.1] queue memory limit: 95.4M
[2024/05/02 08:53:20] [ info] [input:emitter:emitter_for_rewrite_tag.0] initializing
[2024/05/02 08:53:20] [ info] [input:emitter:emitter_for_rewrite_tag.0] storage_strategy='memory' (memory only)
[2024/05/02 08:53:20] [ info] [filter:kubernetes:kubernetes.2] https=1 host=kubernetes.default.svc port=443
[2024/05/02 08:53:20] [ info] [filter:kubernetes:kubernetes.2]  token updated
[2024/05/02 08:53:20] [ info] [filter:kubernetes:kubernetes.2] local POD info OK
[2024/05/02 08:53:20] [ info] [filter:kubernetes:kubernetes.2] testing connectivity with Kubelet...
[2024/05/02 08:53:21] [ info] [filter:kubernetes:kubernetes.2] connectivity OK
[2024/05/02 08:53:22] [ info] [http_server] listen iface=0.0.0.0 tcp_port=8081
[2024/05/02 08:53:22] [ info] [sp] stream processor started
[2024/05/02 08:53:22] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/1-1714639982.798393729.flb
.
.
.
.
[2024/05/02 08:56:18] [ info] [input] tail.0 resume (storage buf overlimit 191/192)
[2024/05/02 08:56:18] [ warn] [input] tail.0 paused (storage buf overlimit 192/192)
[2024/05/02 08:56:18] [ info] [input] pausing tail.0
[2024/05/02 08:56:18] [ info] [input] resume tail.0
[2024/05/02 08:56:18] [ info] [input] tail.0 resume (storage buf overlimit 191/192)
[2024/05/02 08:56:18] [ warn] [input] tail.0 paused (storage buf overlimit 192/192)
[2024/05/02 08:56:18] [ info] [input] pausing tail.0
[2024/05/02 08:56:18] [ info] [input] resume tail.0
[2024/05/02 08:56:18] [ info] [input] tail.0 resume (storage buf overlimit 191/192)
[2024/05/02 08:56:36] [ info] [input:tail:tail.0] inode=262825 link_rotated: /var/log/containers/xxxxxx-9a40478e6859b88a9ba7db65755da522de6a03465eab2f88e72ca0ec1456d930.log
[2024/05/02 08:56:36] [ info] [input:tail:tail.0] inode=1972490 handle rotation(): /var/log/containers/xxxxxxxxx-9a40478e6859b88a9ba7db65755da522de6a03465eab2f88e72ca0ec1456d930.log => /var/data/kubeletlogs/5018a8564e8120570150b0764d8d194d_c-58a88ecb-c183-4d86-a4d3-e760f2377aa6-cronjob-28577336-r84gx_359216b1-1c12-4b4d-9bca-7a21df8f70af/c-58a88ecb-c183-4d86-a4d3-e760f2377aa6-cronjob/0.log (deleted)
.
.
.
.
.
2024/05/02 08:56:36] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972499 watch_fd=731
[2024/05/02 08:56:36] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972490 watch_fd=728
[2024/05/02 08:56:36] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972332 watch_fd=732
[2024/05/02 08:56:36] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972628 watch_fd=733
[2024/05/02 08:56:36] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972392 watch_fd=729
[2024/05/02 08:56:36] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972613 watch_fd=730
[/usr/local/src/fluent-bit/lib/chunkio/src/cio_file_unix.c:468 errno=2] No such file or directory
[2024/05/02 08:56:48] [error] [storage] [cio file] error deleting file at close tail.0:1-1714640207.838891586.flb
[2024/05/02 08:56:54] [engine] caught signal (SIGSEGV)

Fail to drop enough chunks in order to place new data

Fluent Bit v3.0.2
* Copyright (C) 2015-2024 The Fluent Bit Authors
* Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
* https://fluentbit.io

___________.__                        __    __________.__  __          ________  
\_   _____/|  |  __ __   ____   _____/  |_  \______   \__|/  |_  ___  _\_____  \ 
 |    __)  |  | |  |  \_/ __ \ /    \   __\  |    |  _/  \   __\ \  \/ / _(__  < 
 |     \   |  |_|  |  /\  ___/|   |  \  |    |    |   \  ||  |    \   / /       \
 \___  /   |____/____/  \___  >___|  /__|    |______  /__||__|     \_/ /______  /
     \/                     \/     \/               \/                        \/ 

[2024/05/02 08:50:43] [ info] [fluent bit] version=3.0.2, commit=, pid=1
[2024/05/02 08:50:43] [ info] [storage] ver=1.5.2, type=memory+filesystem, sync=normal, checksum=off, max_chunks_up=192
[2024/05/02 08:50:43] [ info] [storage] backlog input plugin: storage_backlog.1
[2024/05/02 08:50:43] [ info] [cmetrics] version=0.7.3
[2024/05/02 08:50:43] [ info] [ctraces ] version=0.4.0
[2024/05/02 08:50:43] [ info] [input:tail:tail.0] initializing
[2024/05/02 08:50:43] [ info] [input:tail:tail.0] storage_strategy='filesystem' (memory + filesystem)
[2024/05/02 08:50:44] [ info] [input:tail:tail.0] db: delete unmonitored stale inodes from the database: count=0
[2024/05/02 08:50:44] [ info] [input:storage_backlog:storage_backlog.1] initializing
[2024/05/02 08:50:44] [ info] [input:storage_backlog:storage_backlog.1] storage_strategy='memory' (memory only)
[2024/05/02 08:50:44] [ info] [input:storage_backlog:storage_backlog.1] queue memory limit: 95.4M
[2024/05/02 08:50:44] [ info] [input:emitter:emitter_for_rewrite_tag.0] initializing
[2024/05/02 08:50:44] [ info] [input:emitter:emitter_for_rewrite_tag.0] storage_strategy='memory' (memory only)
[2024/05/02 08:50:44] [ info] [filter:kubernetes:kubernetes.2] https=1 host=kubernetes.default.svc port=443
[2024/05/02 08:50:44] [ info] [filter:kubernetes:kubernetes.2]  token updated
[2024/05/02 08:50:44] [ info] [filter:kubernetes:kubernetes.2] local POD info OK
[2024/05/02 08:50:44] [ info] [filter:kubernetes:kubernetes.2] testing connectivity with Kubelet...
[2024/05/02 08:50:44] [ info] [filter:kubernetes:kubernetes.2] connectivity OK
[2024/05/02 08:50:45] [ info] [http_server] listen iface=0.0.0.0 tcp_port=8081
[2024/05/02 08:50:45] [ info] [sp] stream processor started
[2024/05/02 08:50:45] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/1-1714639842.499964038.flb
[2024/05/02 08:50:45] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1972446 watch_fd=1 name=/var/log/containers/xxxxxxxxxx-d9ae54902f2f65f8d214e2d68a5e2b3336062fb30ef411e05da8c6f1969980a2.log
[2024/05/02 08:50:45] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1972463 watch_fd=2 name=/var/log/containers/xxxxxxxxxx-445b3a6bb43b9b3ed3b836977091aaaa425cfc145581d8cb095208e767b9907b.log
[2024/05/02 08:50:45] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1860937 watch_fd=3 name=/var/log/containers/xxxxxxxxxx-b80c8b4cde44b2217a4b2814b3250a7545af53b3e9145fa27ff9e6f8f2a637b3.log
[2024/05/02 08:50:45] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1860893 watch_fd=4 name=/var/log/containers/xxxxxxxxxx-33a9979210e4c5ee454482a74429b50e129fea1872ac013185f233939319902f.log
.
.
.
.
[2024/05/02 08:51:41] [ info] [input:tail:tail.0] inotify_fs_add(): inode=1966324 watch_fd=724 name=/var/log/containers/xxxxxxxx-48c64a8d9e6c1a24ff5abae5616eda485f4b8580073be41a41e6506e06d0a00a.log
[2024/05/02 08:51:42] [ warn] [input] tail.0 paused (storage buf overlimit 192/192)
[2024/05/02 08:51:42] [ info] [input] pausing tail.0
[2024/05/02 08:51:42] [ info] [input] resume tail.0
[2024/05/02 08:51:42] [ info] [input] tail.0 resume (storage buf overlimit 191/192)
[2024/05/02 08:51:42] [ warn] [input] tail.0 paused (storage buf overlimit 192/192)
[2024/05/02 08:51:42] [ info] [input] pausing tail.0
[2024/05/02 08:51:42] [ info] [input] resume tail.0
[2024/05/02 08:51:42] [ info] [input] tail.0 resume (storage buf overlimit 191/192)
[2024/05/02 08:51:43] [ warn] [input] tail.0 paused (storage buf overlimit 192/192)
[2024/05/02 08:51:43] [ info] [input] pausing tail.0
[2024/05/02 08:51:43] [ info] [input] resume tail.0
[2024/05/02 08:51:43] [ info] [input] tail.0 resume (storage buf overlimit 191/192)
[2024/05/02 08:51:43] [ warn] [input] tail.0 paused (storage buf overlimit 192/192)
[2024/05/02 08:51:43] [ info] [input] pausing tail.0
[2024/05/02 08:51:43] [ info] [input] resume tail.0
.
.
.
.
[2024/05/02 08:52:33] [ info] [input:tail:tail.0] inode=262829 link_rotated: /var/log/containers/xxxxxxxxxxxx-79289f1aefefd7d290c024a47554831c0ab01c4975e47f21859ae1f2347efec6.log
[2024/05/02 08:52:33] [ info] [input:tail:tail.0] inode=1972760 handle rotation(): /var/log/containers/xxxxxxxxxxxx-79289f1aefefd7d290c024a47554831c0ab01c4975e47f21859ae1f2347efec6.log => /var/data/kubeletlogs/18d9f87e0d6a4a9d9d9c0966dd7b504d_c-46c15c32-872e-4c5f-b45b-f35e4d4e9c77-cronjob-28577332-624ml_c506bd27-ae63-44f2-8576-c8e1f29c1dc6/c-46c15c32-872e-4c5f-b45b-f35e4d4e9c77-cronjob/0.log (deleted)
.
.
.
.
[2024/05/02 08:52:35] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972760 watch_fd=725
[2024/05/02 08:52:35] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972736 watch_fd=726
[2024/05/02 08:52:35] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972789 watch_fd=727
[2024/05/02 08:52:35] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972759 watch_fd=728
[2024/05/02 08:52:35] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972678 watch_fd=729
[2024/05/02 08:52:35] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=1972793 watch_fd=730
[/usr/local/src/fluent-bit/lib/chunkio/src/cio_file_unix.c:468 errno=2] No such file or directory
[2024/05/02 08:52:42] [error] [storage] [cio file] error deleting file at close tail.0:1-1714639961.986152069.flb
[/usr/local/src/fluent-bit/lib/chunkio/src/cio_file_unix.c:468 errno=2] No such file or directory
[2024/05/02 08:52:58] [error] [storage] [cio file] error deleting file at close tail.0:1-1714639977.968069594.flb
[2024/05/02 08:53:02] [error] [input chunk] fail to drop enough chunks in order to place new data

Expected behavior

Screenshots

Your Environment

[SERVICE]
      Flush                   1
      Log_Level               info
      Daemon                  off
      Parsers_File            parsers.conf
      Plugins_File            plugins.conf
      HTTP_Server             On
      HTTP_Listen             0.0.0.0
      HTTP_Port               8081
      Health_Check            On
      HC_Errors_Count         1
      HC_Retry_Failure_Count  1
      HC_Period               30
      storage.path            /fluent-bit/cache
      storage.max_chunks_up   192
      storage.metrics         On

[INPUT]
      Name                              tail
      Tag                               kube.*
      Path                              /var/log/containers/logger-heartbeat-*.log
      Path_Key                          file
      DB                                /var/log/fluent-bit/fluent-bit.DB
      Parser                            cri
      Skip_Long_Lines                   On
      Refresh_Interval                  10
      storage.type                      filesystem
      storage.pause_on_chunks_overlimit on

[FILTER]
        Name                  rewrite_tag
        Match_Regex           <some-regex>
        Emitter_Mem_Buf_Limit 384MB
        Rule                  $message <some-regex> new.tag true

[FILTER]
        Name          grep
        Match_Regex   <some-regex>
        Exclude       $message <some-regex>

[FILTER]
        Name                kubernetes
        Match               kube.*
        Buffer_Size         10MB
        Kube_URL            https://kubernetes.default.svc:443
        Kube_CA_File        /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
        Kube_Token_File     /var/run/secrets/kubernetes.io/serviceaccount/token
        Kube_Tag_Prefix     kube.var.log.containers.
        Merge_Log           On
        Merge_Log_Key       log_processed
        K8S-Logging.Parser  On
        K8S-Logging.Exclude Off
        Use_Kubelet         On
        Kubelet_Port        10250
        Kubelet_Host        ${HOST_IP}

[FILTER]
        Name record_modifier
        Match *
        Record meta.cluster_name my-cluster
        Record node_name my-node

[FILTER]
        Name nest
        Match *
        Operation nest
        Wildcard meta.*
        Nest_under meta
        Remove_prefix meta.

[FILTER]
        NAME          lua
        Match_Regex   <some-regex>
        call          removeSTFields
        script        /fluent-bit/bin/script.lua

[FILTER]
        NAME    lua
        Match   new.tag.*
        call    removeUnwantedLogFields
        script  /fluent-bit/bin/script.lua

[OUTPUT]
        Name logger-agent
        Id logger-agent-output-1
        Match kube.*
        Retry_Limit 8

        # Connection
        Target_Host localhost
        Target_Port 8080
        Target_Path /v1/logs/ws

        # Authentication
        Authentication_Mode APIKey

        # Buffer storage
        storage.total_limit_size 5G

[OUTPUT]
        Name              logger-agent
        Id                logger-agent-output-2
        Match             new.tag.*
        Retry_Limit       8

        # Connection
        Target_Host localhost
        Target_Port 8080
        Target_Path /v1/logs/service

        # Authentication
        Authentication_Mode APIKey

        # Buffer storage
        storage.total_limit_size 5G

Additional context

mirko-lazarevic commented 2 months ago

Does anyone have any insights into what might be going wrong here? I've noticed that Fluent Bit suddenly exits when it encounters the error [input chunk] fail to drop enough chunks in order to place new data. Is this behavior intentional? Shouldn't Fluent Bit shut down gracefully and release resources properly?

...
if (count != 0) {
        flb_error("[input chunk] fail to drop enough chunks in order to place new data");
        exit(0);
}

cc: @agup006, @ScarletTanager

patrick-stephens commented 2 weeks ago

Can you see if it is reproducible in the latest version? There have been some bug fixes that may affect this.

Can you step up log_level debug as well to provide more information?

There should be a stack trace as well.

patrick-stephens commented 2 weeks ago

Ah I notice you are using custom golang plugins, my guess is it might be one of those - make sure you're using latest calyptia-plugin module to fix up segv in hot reload handling/registration (or ensure you have the correct ABI updates in place).

Also is it reproducible without the golang plugins? I don't think we can investigate issues if they are related to custom plugins so it needs to be reproducible without.

mirko-lazarevic commented 2 weeks ago

Thanks @patrick-stephens , I’ll attempt to obtain the stack trace first, as I’m currently without any clear information.

patrick-stephens commented 2 weeks ago

Honestly I would step up your dependencies as well first - you can run your existing stuff of course to get stack trace but ultimately I bet it's a dependency problem.