fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.85k stars 1.58k forks source link

Fluentbit stucks when an output is offline #4378

Closed merveyilmaz-netrd closed 2 years ago

merveyilmaz-netrd commented 2 years ago

Bug Report

Describe the bug Hi i am trying to send 80 MB log per minute to Kafka output and it works. But then I did some tests. I closed the kafka service for 30 minutes and then i started the kafka service. I am using filesystem buffering so chunk are accumulated in the file system. But when restart the kafka service expected behaviour is keep sending data chunks to kafka. But i saw that fluentbit sending logs for some minutes after restart the kafka service but then it frozen and stopped the sending logs just accumulated the data on the filesystem. I restarted the fluentbit service and it started to send logs to kafka. Then i saw some errors. I shared the fluentbit logs after restarted the fluentbit service below. Fluentbit cannot create new chunks since there so many open chunk files. I shared my configuration. Please examine my configuration maybe the issue related with the configuration.

To Reproduce


- Steps to reproduce the problem:

**Expected behavior**
Keep sending logs when kafka is available

**Screenshots**
<!--- If applicable, add screenshots to help explain your problem. -->

**Your Environment**
<!--- Include as many relevant details about the environment you experienced the bug in -->
* Version used: 1.8.8
* Configuration:
[SERVICE]
    Flush     1
    Daemon    off
    Log_Level info
    HTTP_Server on
    HTTP_Listen 0.0.0.0
    HTTP_Port 2020
    storage.path /opt/flb-storage
    storage.sync normal
    storage.backlog.mem_limit 500M

[INPUT]
    Name  tail
    Path  /opt/flb/*
    Tag   test
    storage.type filesystem
    Mem_Buf_Limit 500MB
    DB   /opt/db.file
    DB.sync normal

[OUTPUT]
    Name        kafka
    Match       test
    Brokers     XX.XXX.XXX.XXX:9092
    Topics      logForwarding
    rdkafka.message.timeout.ms 0
    Retry_Limit False

* Environment name and version (e.g. Kubernetes? What version?):
* Server type and version: Centos 7.9 
* Operating System and version:
* Filters and plugins: tail, kafka

**Additional context**
<!--- How has this issue affected you? What are you trying to accomplish? -->
<!--- Providing context helps us come up with a solution that is most useful in the real world -->
kc8421 commented 2 years ago

This might be possible duplicate for this: 4373

edsiper commented 2 years ago
errno=24] Too many open files

you should increase your file descriptor limits

merveyilmaz-netrd commented 2 years ago
errno=24] Too many open files

you should increase your file descriptor limits

I will try. But i have another problem i am not sure if it is bug or not. Fluentbit stops accumulating chunks in file system after a while. I realized that when total tasks size reached 2048 fluentbit stops accumulating chunks in filesystem. Is it configurable?

Dec 1 10:44:47 centos1 td-agent-bit: [2021/12/01 10:44:47] Fluent Bit Dump Dec 1 10:44:47 centos1 td-agent-bit: ===== Input ===== Dec 1 10:44:47 centos1 td-agent-bit: tail.0 (tail) Dec 1 10:44:47 centos1 td-agent-bit: │ Dec 1 10:44:47 centos1 td-agent-bit: ├─ status Dec 1 10:44:47 centos1 td-agent-bit: │ └─ overlimit : no Dec 1 10:44:47 centos1 td-agent-bit: │ ├─ mem size : 466.7M (489390065 bytes) Dec 1 10:44:47 centos1 td-agent-bit: │ └─ mem limit : 476.8M (500000000 bytes) Dec 1 10:44:47 centos1 td-agent-bit: │ Dec 1 10:44:47 centos1 td-agent-bit: ├─ tasks Dec 1 10:44:47 centos1 td-agent-bit: │ ├─ total tasks : 2048 Dec 1 10:44:47 centos1 td-agent-bit: │ ├─ new : 0 Dec 1 10:44:47 centos1 td-agent-bit: │ ├─ running : 2048 Dec 1 10:44:47 centos1 td-agent-bit: │ └─ size : 3.0G (3184583125 bytes) Dec 1 10:44:47 centos1 td-agent-bit: │ Dec 1 10:44:47 centos1 td-agent-bit: └─ chunks Dec 1 10:44:47 centos1 td-agent-bit: └─ total chunks : 2367 Dec 1 10:44:47 centos1 td-agent-bit: ├─ up chunks : 319 Dec 1 10:44:47 centos1 td-agent-bit: ├─ down chunks: 2048 Dec 1 10:44:47 centos1 td-agent-bit: └─ busy chunks: 2048 Dec 1 10:44:47 centos1 td-agent-bit: ├─ size : 0b (0 bytes) Dec 1 10:44:47 centos1 td-agent-bit: └─ size err: 0 Dec 1 10:44:47 centos1 td-agent-bit: storage_backlog.1 (storage_backlog) Dec 1 10:44:47 centos1 td-agent-bit: │ Dec 1 10:44:47 centos1 td-agent-bit: ├─ status Dec 1 10:44:47 centos1 td-agent-bit: │ └─ overlimit : no Dec 1 10:44:47 centos1 td-agent-bit: │ ├─ mem size : 0b (0 bytes) Dec 1 10:44:47 centos1 td-agent-bit: │ └─ mem limit : 0b (0 bytes) Dec 1 10:44:47 centos1 td-agent-bit: │ Dec 1 10:44:47 centos1 td-agent-bit: ├─ tasks Dec 1 10:44:47 centos1 td-agent-bit: │ ├─ total tasks : 0 Dec 1 10:44:47 centos1 td-agent-bit: │ ├─ new : 0 Dec 1 10:44:47 centos1 td-agent-bit: │ ├─ running : 0 Dec 1 10:44:47 centos1 td-agent-bit: │ └─ size : 0b (0 bytes) Dec 1 10:44:47 centos1 td-agent-bit: │ Dec 1 10:44:47 centos1 td-agent-bit: └─ chunks Dec 1 10:44:47 centos1 td-agent-bit: └─ total chunks : 0 Dec 1 10:44:47 centos1 td-agent-bit: ├─ up chunks : 0 Dec 1 10:44:47 centos1 td-agent-bit: ├─ down chunks: 0 Dec 1 10:44:47 centos1 td-agent-bit: └─ busy chunks: 0 Dec 1 10:44:47 centos1 td-agent-bit: ├─ size : 0b (0 bytes) Dec 1 10:44:47 centos1 td-agent-bit: └─ size err: 0 Dec 1 10:44:47 centos1 td-agent-bit: ===== Storage Layer ===== Dec 1 10:44:47 centos1 td-agent-bit: total chunks : 2367 Dec 1 10:44:47 centos1 td-agent-bit: ├─ mem chunks : 0 Dec 1 10:44:47 centos1 td-agent-bit: └─ fs chunks : 2367 Dec 1 10:44:47 centos1 td-agent-bit: ├─ up : 319 Dec 1 10:44:47 centos1 td-agent-bit: └─ down : 2048

merveyilmaz-netrd commented 2 years ago

I have increased my fd limits like below: ulimit -n 200000

But problem still exist. After a while fluentbit stops accumulating data on filesystem.

merveyilmaz-netrd commented 2 years ago

I have increased my fd limits like below: ulimit -n 200000

But problem still exist. After a while fluentbit stops accumulating data on filesystem.

I have increased the fd limits on service level. I am not getting "too many open files" error anymore. But issue is contiune.

github-actions[bot] commented 2 years ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] commented 2 years ago

This issue was closed because it has been stalled for 5 days with no activity.

karan56625 commented 1 year ago

is it fixed? Facing the same issue.

carlosrmendes commented 9 months ago

facing the same, any feedback?

damiancalabresi commented 8 months ago

I'm experiencing the same issue. In my case there is no "too many open files" error. It's not related.

Something in common on all these errors is this message log "re-schedule retry"