Fluentbit stucks when an output is offline

merveyilmaz-netrd commented 2 years ago

Bug Report

Describe the bug Hi i am trying to send 80 MB log per minute to Kafka output and it works. But then I did some tests. I closed the kafka service for 30 minutes and then i started the kafka service. I am using filesystem buffering so chunk are accumulated in the file system. But when restart the kafka service expected behaviour is keep sending data chunks to kafka. But i saw that fluentbit sending logs for some minutes after restart the kafka service but then it frozen and stopped the sending logs just accumulated the data on the filesystem. I restarted the fluentbit service and it started to send logs to kafka. Then i saw some errors. I shared the fluentbit logs after restarted the fluentbit service below. Fluentbit cannot create new chunks since there so many open chunk files. I shared my configuration. Please examine my configuration maybe the issue related with the configuration.

To Reproduce

Rubular link if applicable:

Example log message if applicable:


Nov 30 12:56:30 104-1 systemd: Stopping TD Agent Bit...
Nov 30 12:56:30 104-1 td-agent-bit: [2021/11/30 12:56:30] [engine] caught signal (SIGCONT)
Nov 30 12:56:30 104-1 td-agent-bit: [2021/11/30 12:56:30] Fluent Bit Dump
Nov 30 12:56:30 104-1 td-agent-bit: ===== Input =====
Nov 30 12:56:30 104-1 td-agent-bit: tail.0 (tail)
Nov 30 12:56:30 104-1 td-agent-bit: │
Nov 30 12:56:30 104-1 td-agent-bit: ├─ status
Nov 30 12:56:30 104-1 td-agent-bit: │  └─ overlimit     : no
Nov 30 12:56:30 104-1 td-agent-bit: │     ├─ mem size   : 139.5M (146327757 bytes)
Nov 30 12:56:30 104-1 td-agent-bit: │     └─ mem limit  : 476.8M (500000000 bytes)
Nov 30 12:56:30 104-1 td-agent-bit: │
Nov 30 12:56:30 104-1 td-agent-bit: ├─ tasks
Nov 30 12:56:30 104-1 td-agent-bit: │  ├─ total tasks   : 2048
Nov 30 12:56:30 104-1 td-agent-bit: │  ├─ new           : 0
Nov 30 12:56:30 104-1 td-agent-bit: │  ├─ running       : 2048
Nov 30 12:56:30 104-1 td-agent-bit: │  └─ size          : 2.5G (2667163685 bytes)
Nov 30 12:56:30 104-1 td-agent-bit: │
Nov 30 12:56:30 104-1 td-agent-bit: └─ chunks
Nov 30 12:56:30 104-1 td-agent-bit: └─ total chunks  : 4076
Nov 30 12:56:30 104-1 td-agent-bit: ├─ up chunks  : 129
Nov 30 12:56:30 104-1 td-agent-bit: ├─ down chunks: 3947
Nov 30 12:56:30 104-1 td-agent-bit: └─ busy chunks: 2048
Nov 30 12:56:30 104-1 td-agent-bit: ├─ size    : 0b (0 bytes)
Nov 30 12:56:30 104-1 td-agent-bit: └─ size err: 0
Nov 30 12:56:30 104-1 td-agent-bit: storage_backlog.1 (storage_backlog)
Nov 30 12:56:30 104-1 td-agent-bit: │
Nov 30 12:56:30 104-1 td-agent-bit: ├─ status
Nov 30 12:56:30 104-1 td-agent-bit: │  └─ overlimit     : no
Nov 30 12:56:30 104-1 td-agent-bit: │     ├─ mem size   : 0b (0 bytes)
Nov 30 12:56:30 104-1 td-agent-bit: │     └─ mem limit  : 0b (0 bytes)
Nov 30 12:56:30 104-1 td-agent-bit: │
Nov 30 12:56:30 104-1 td-agent-bit: ├─ tasks
Nov 30 12:56:30 104-1 td-agent-bit: │  ├─ total tasks   : 0
Nov 30 12:56:30 104-1 td-agent-bit: │  ├─ new           : 0
Nov 30 12:56:30 104-1 td-agent-bit: │  ├─ running       : 0
Nov 30 12:56:30 104-1 td-agent-bit: │  └─ size          : 0b (0 bytes)
Nov 30 12:56:30 104-1 td-agent-bit: │
Nov 30 12:56:30 104-1 td-agent-bit: └─ chunks
Nov 30 12:56:30 104-1 td-agent-bit: └─ total chunks  : 0
Nov 30 12:56:30 104-1 td-agent-bit: ├─ up chunks  : 0
Nov 30 12:56:30 104-1 td-agent-bit: ├─ down chunks: 0
Nov 30 12:56:30 104-1 td-agent-bit: └─ busy chunks: 0
Nov 30 12:56:30 104-1 td-agent-bit: ├─ size    : 0b (0 bytes)
Nov 30 12:56:30 104-1 td-agent-bit: └─ size err: 0
Nov 30 12:56:30 104-1 td-agent-bit: ===== Storage Layer =====
Nov 30 12:56:30 104-1 td-agent-bit: total chunks     : 4076
Nov 30 12:56:30 104-1 td-agent-bit: ├─ mem chunks    : 0
Nov 30 12:56:30 104-1 td-agent-bit: └─ fs chunks     : 4076
Nov 30 12:56:30 104-1 td-agent-bit: ├─ up         : 129
Nov 30 12:56:30 104-1 td-agent-bit: └─ down       : 3947
Nov 30 12:56:30 104-1 td-agent-bit: [2021/11/30 12:56:30] [engine] caught signal (SIGTERM)
Nov 30 12:56:30 104-1 td-agent-bit: [2021/11/30 12:56:30] [ info] [input] pausing tail.0
Nov 30 12:56:30 104-1 td-agent-bit: [2021/11/30 12:56:30] [ info] [input] pausing storage_backlog.1
Nov 30 12:56:30 104-1 td-agent-bit: [2021/11/30 12:56:30] [ warn] [engine] service will stop in 5 seconds
Nov 30 12:56:30 104-1 td-agent-bit: [2021/11/30 12:56:30] [ info] [task] re-schedule retry=0x7fe40048d4f8 1841 in the next 1734 seconds
Nov 30 12:56:31 104-1 td-agent-bit: [2021/11/30 12:56:31] [ info] [task] re-schedule retry=0x7fe400401ac8 1059 in the next 941 seconds
Nov 30 12:56:31 104-1 td-agent-bit: [2021/11/30 12:56:31] [ info] [task] re-schedule retry=0x7fe4004034b8 1224 in the next 157 seconds
Nov 30 12:56:32 104-1 td-agent-bit: [2021/11/30 12:56:32] [ info] [task] re-schedule retry=0x7fe400384df0 772 in the next 1979 seconds
Nov 30 12:56:32 104-1 td-agent-bit: [2021/11/30 12:56:32] [ info] [task] re-schedule retry=0x7fe40606d920 433 in the next 218 seconds
Nov 30 12:56:33 104-1 td-agent-bit: [2021/11/30 12:56:33] [ info] [task] re-schedule retry=0x7fe400384800 733 in the next 152 seconds
Nov 30 12:56:34 104-1 td-agent-bit: [2021/11/30 12:56:34] [ info] [task] re-schedule retry=0x7fe400404bd8 1373 in the next 694 seconds
Nov 30 12:56:34 104-1 td-agent-bit: [2021/11/30 12:56:34] [ info] [engine] service stopped
Nov 30 12:56:34 104-1 td-agent-bit: [2021/11/30 12:56:34] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=5111810 watch_fd=2
Nov 30 12:56:36 104-1 systemd: Removed slice User Slice of root.
Nov 30 12:56:42 104-1 td-agent-bit: [2021/11/30 12:56:42] [ warn] [output:kafka:kafka.0] fluent-bit#producer-1: [thrd:app]: Producer terminating with 100000 messages (7574680 bytes) still in queue or transit: use flush() to wait for outstanding message delivery
Nov 30 12:56:42 104-1 systemd: Stopped TD Agent Bit.
Nov 30 12:56:42 104-1 systemd: Started TD Agent Bit.
Nov 30 12:56:43 104-1 td-agent-bit: #033[1mFluent Bit v1.8.8#033[0m
Nov 30 12:56:43 104-1 td-agent-bit: * #033[1m#033[93mCopyright (C) 2019-2021 The Fluent Bit Authors#033[0m
Nov 30 12:56:43 104-1 td-agent-bit: * #033[1m#033[93mCopyright (C) 2015-2018 Treasure Data#033[0m
Nov 30 12:56:43 104-1 td-agent-bit: * Fluent Bit is a CNCF sub-project under the umbrella of Fluentd
Nov 30 12:56:43 104-1 td-agent-bit: * https://fluentbit.io
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [engine] started (pid=29595)
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [storage] version=1.1.4, initializing...
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [storage] root path '/opt/flb-storage'
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=128
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [storage] backlog input plugin: storage_backlog.1
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [cmetrics] version=0.2.2
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [input:storage_backlog:storage_backlog.1] queue memory limit: 476.8M
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [output:kafka:kafka.0] brokers='XX.XXX.XXX.XXX:9092' topics='logForwarding'
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [http_server] listen iface=0.0.0.0 tcp_port=2020
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [sp] stream processor started
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/26741-1638262924.940366370.flb
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/26741-1638262925.635690692.flb
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/26741-1638262926.649828554.flb
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/26741-1638262927.485422344.flb
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/26741-1638262928.485478425.flb
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/26741-1638262929.626397200.flb
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/26741-1638262930.485414196.flb
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/26741-1638262931.485531934.flb
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/26741-1638262932.485430048.flb
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/26741-1638262933.485446818.flb
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/26741-1638262934.485544073.flb
Nov 30 12:56:43 104-1 td-agent-bit: [2021/11/30 12:56:43] [ info] [input:storage_backlog:storage_backlog.1] register tail.0/26741-1638
Nov 30 13:00:48 104-1 td-agent-bit: [lib/chunkio/src/cio_file.c:415 errno=24] Too many open files
Nov 30 13:00:48 104-1 td-agent-bit: [2021/11/30 13:00:48] [error] [storage] cannot open/create /opt/flb-storage/tail.0/26741-1638263619.485639373.flb
Nov 30 13:00:48 104-1 td-agent-bit: [lib/chunkio/src/cio_file.c:415 errno=24] Too many open files
Nov 30 13:00:48 104-1 td-agent-bit: [2021/11/30 13:00:48] [error] [storage] [cio file] cannot open chunk: tail.0/26741-1638263619.485639373.flb
Nov 30 13:00:48 104-1 td-agent-bit: [2021/11/30 13:00:48] [error] [storage] cannot open/create /opt/flb-storage/tail.0/26741-1638263619.485639373.flb
Nov 30 13:00:48 104-1 td-agent-bit: [lib/chunkio/src/cio_file.c:415 errno=24] Too many open files
Nov 30 13:00:48 104-1 td-agent-bit: [2021/11/30 13:00:48] [error] [storage] [cio file] cannot open chunk: tail.0/26741-1638263619.485639373.flb
Nov 30 13:00:48 104-1 td-agent-bit: [2021/11/30 13:00:48] [error] [storage] cannot open/create /opt/flb-storage/tail.0/26741-1638263619.485639373.flb
Nov 30 13:00:48 104-1 td-agent-bit: [lib/chunkio/src/cio_file.c:415 errno=24] Too many open files
Nov 30 13:00:48 104-1 td-agent-bit: [2021/11/30 13:00:48] [error] [storage] [cio file] cannot open chunk: tail.0/26741-1638263619.485639373.flb
Nov 30 13:00:48 104-1 td-agent-bit: [2021/11/30 13:00:48] [error] [storage] cannot open/create /opt/flb-storage/tail.0/26741-1638263619.485639373.flb
Nov 30 13:00:48 104-1 td-agent-bit: [lib/chunkio/src/cio_file.c:415 errno=24] Too many open files
Nov 30 13:00:48 104-1 td-agent-bit: [2021/11/30 13:00:48] [error] [storage] [cio file] cannot open chunk: tail.0/26741-1638263619.485639373.flb
Nov 30 13:00:48 104-1 td-agent-bit: [2021/11/30 13:00:48] [error] [storage] cannot open/create /opt/flb-storage/tail.0/26741-1638263619.485639373.flb
Nov 30 13:00:48 104-1 td-agent-bit: [2021/11/30 13:00:48] [error] [storage] [cio file] cannot open chunk: tail.0/26741-1638263619.485639373.flb
Nov 30 13:00:48 104-1 td-agent-bit: [lib/chunkio/src/cio_file.c:415 errno=24] Too many open files
Nov 30 13:00:48 104-1 td-agent-bit: [2021/11/30 13:00:48] [error] [storage] cannot open/create /opt/flb-storage/tail.0/26741-1638263619.485639373.flb


- Steps to reproduce the problem:

**Expected behavior**
Keep sending logs when kafka is available

**Screenshots**
<!--- If applicable, add screenshots to help explain your problem. -->

**Your Environment**
<!--- Include as many relevant details about the environment you experienced the bug in -->
* Version used: 1.8.8
* Configuration:
[SERVICE]
    Flush     1
    Daemon    off
    Log_Level info
    HTTP_Server on
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020
    storage.path /opt/flb-storage
    storage.sync normal
    storage.backlog.mem_limit 500M

[INPUT]
    Name  tail
    Path  /opt/flb/*
    Tag   test
    storage.type filesystem
    Mem_Buf_Limit 500MB
    DB   /opt/db.file
    DB.sync normal

[OUTPUT]
    Name        kafka
    Match       test
    Brokers     XX.XXX.XXX.XXX:9092
    Topics      logForwarding
    rdkafka.message.timeout.ms 0
    Retry_Limit False

* Environment name and version (e.g. Kubernetes? What version?):
* Server type and version: Centos 7.9 
* Operating System and version:
* Filters and plugins: tail, kafka

**Additional context**
<!--- How has this issue affected you? What are you trying to accomplish? -->
<!--- Providing context helps us come up with a solution that is most useful in the real world -->

kc8421 commented 2 years ago

This might be possible duplicate for this: 4373

edsiper commented 2 years ago

errno=24] Too many open files

you should increase your file descriptor limits

merveyilmaz-netrd commented 2 years ago

errno=24] Too many open files
you should increase your file descriptor limits

I will try. But i have another problem i am not sure if it is bug or not. Fluentbit stops accumulating chunks in file system after a while. I realized that when total tasks size reached 2048 fluentbit stops accumulating chunks in filesystem. Is it configurable?

Dec 1 10:44:47 centos1 td-agent-bit: [2021/12/01 10:44:47] Fluent Bit Dump Dec 1 10:44:47 centos1 td-agent-bit: ===== Input ===== Dec 1 10:44:47 centos1 td-agent-bit: tail.0 (tail) Dec 1 10:44:47 centos1 td-agent-bit: │ Dec 1 10:44:47 centos1 td-agent-bit: ├─ status Dec 1 10:44:47 centos1 td-agent-bit: │ └─ overlimit : no Dec 1 10:44:47 centos1 td-agent-bit: │ ├─ mem size : 466.7M (489390065 bytes) Dec 1 10:44:47 centos1 td-agent-bit: │ └─ mem limit : 476.8M (500000000 bytes) Dec 1 10:44:47 centos1 td-agent-bit: │ Dec 1 10:44:47 centos1 td-agent-bit: ├─ tasks Dec 1 10:44:47 centos1 td-agent-bit: │ ├─ total tasks : 2048 Dec 1 10:44:47 centos1 td-agent-bit: │ ├─ new : 0 Dec 1 10:44:47 centos1 td-agent-bit: │ ├─ running : 2048 Dec 1 10:44:47 centos1 td-agent-bit: │ └─ size : 3.0G (3184583125 bytes) Dec 1 10:44:47 centos1 td-agent-bit: │ Dec 1 10:44:47 centos1 td-agent-bit: └─ chunks Dec 1 10:44:47 centos1 td-agent-bit: └─ total chunks : 2367 Dec 1 10:44:47 centos1 td-agent-bit: ├─ up chunks : 319 Dec 1 10:44:47 centos1 td-agent-bit: ├─ down chunks: 2048 Dec 1 10:44:47 centos1 td-agent-bit: └─ busy chunks: 2048 Dec 1 10:44:47 centos1 td-agent-bit: ├─ size : 0b (0 bytes) Dec 1 10:44:47 centos1 td-agent-bit: └─ size err: 0 Dec 1 10:44:47 centos1 td-agent-bit: storage_backlog.1 (storage_backlog) Dec 1 10:44:47 centos1 td-agent-bit: │ Dec 1 10:44:47 centos1 td-agent-bit: ├─ status Dec 1 10:44:47 centos1 td-agent-bit: │ └─ overlimit : no Dec 1 10:44:47 centos1 td-agent-bit: │ ├─ mem size : 0b (0 bytes) Dec 1 10:44:47 centos1 td-agent-bit: │ └─ mem limit : 0b (0 bytes) Dec 1 10:44:47 centos1 td-agent-bit: │ Dec 1 10:44:47 centos1 td-agent-bit: ├─ tasks Dec 1 10:44:47 centos1 td-agent-bit: │ ├─ total tasks : 0 Dec 1 10:44:47 centos1 td-agent-bit: │ ├─ new : 0 Dec 1 10:44:47 centos1 td-agent-bit: │ ├─ running : 0 Dec 1 10:44:47 centos1 td-agent-bit: │ └─ size : 0b (0 bytes) Dec 1 10:44:47 centos1 td-agent-bit: │ Dec 1 10:44:47 centos1 td-agent-bit: └─ chunks Dec 1 10:44:47 centos1 td-agent-bit: └─ total chunks : 0 Dec 1 10:44:47 centos1 td-agent-bit: ├─ up chunks : 0 Dec 1 10:44:47 centos1 td-agent-bit: ├─ down chunks: 0 Dec 1 10:44:47 centos1 td-agent-bit: └─ busy chunks: 0 Dec 1 10:44:47 centos1 td-agent-bit: ├─ size : 0b (0 bytes) Dec 1 10:44:47 centos1 td-agent-bit: └─ size err: 0 Dec 1 10:44:47 centos1 td-agent-bit: ===== Storage Layer ===== Dec 1 10:44:47 centos1 td-agent-bit: total chunks : 2367 Dec 1 10:44:47 centos1 td-agent-bit: ├─ mem chunks : 0 Dec 1 10:44:47 centos1 td-agent-bit: └─ fs chunks : 2367 Dec 1 10:44:47 centos1 td-agent-bit: ├─ up : 319 Dec 1 10:44:47 centos1 td-agent-bit: └─ down : 2048

merveyilmaz-netrd commented 2 years ago

I have increased my fd limits like below: ulimit -n 200000

But problem still exist. After a while fluentbit stops accumulating data on filesystem.

merveyilmaz-netrd commented 2 years ago

I have increased my fd limits like below: ulimit -n 200000

But problem still exist. After a while fluentbit stops accumulating data on filesystem.

I have increased the fd limits on service level. I am not getting "too many open files" error anymore. But issue is contiune.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] commented 2 years ago

This issue was closed because it has been stalled for 5 days with no activity.

karan56625 commented 1 year ago

is it fixed? Facing the same issue.

carlosrmendes commented 9 months ago

facing the same, any feedback?

damiancalabresi commented 8 months ago

I'm experiencing the same issue. In my case there is no "too many open files" error. It's not related.

Something in common on all these errors is this message log "re-schedule retry"

fluent / fluent-bit

Fluentbit stucks when an output is offline #4378

Bug Report