fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.85k stars 1.58k forks source link

flb-pipeline worker CPU stays high even when there is no input or output events. #9124

Closed suryamanikar closed 2 weeks ago

suryamanikar commented 3 months ago

Bug Report

flb-pipeline worker thread cpu stays high even if there is no input and output going on

Details are in this thread https://fluent-all.slack.com/archives/C0CTQGHKJ/p1721759831344629

Describe the bug I have input (tail), output (tcp). I am generating traffic using wrk tool that generates 27K/sec total. These messages goes to a log file which read by fluent-bit and sends over TCP to another hosts.

$ ./wrk -c 10 -t 10 -d 100 "https://100.10.7.153:443/test" Running 2m test @ https://1100.10.7.153:443/test 10 threads and 10 connections Thread Stats Avg Stdev Max +/- Stdev Latency 4.81ms 27.54ms 459.40ms 96.78% Req/Sec 2.79k 742.94 3.75k 79.86% 2730389 requests in 1.67m, 1.15GB read Requests/sec: 27300.89 Transfer/sec: 11.74MB

[SERVICE] Flush 1 Daemon on Log_Level info Log_File /home/user1/fluentbit.log Path /home/user1/fluentbit/ Workers 2

[INPUT] Name tail Path /home/user1/debug.log Parser json Tag my_tag Buffer_Chunk_Size 512k Buffer_Max_Size 5M

[FILTER] Name lua Match my_tag script /home/user1/transform_log.lua call transform_record

[OUTPUT] Name tcp Match my_tag Host 100.11.7.154 Port 15555 Format json_lines json_date_format java_sql_timestamp Workers 4

I have tried it without FILTER. The issue is same

I found the flb-pipeline epoll_wait is timing out frequently even if there is no data in pipeline cause the cpu to stay high.

fluentbit.log [2024/07/23 22:30:07] [ info] [sp] stream processor started [2024/07/23 22:30:07] [ info] [input:tail:tail.0] inotify_fs_add(): inode=2686195618 watch_fd=1 name=/home/user1/debug.log [2024/07/23 22:30:07] [ info] [output:tcp:tcp.0] worker #2 started [2024/07/23 22:30:07] [ info] [output:tcp:tcp.0] worker #1 started [2024/07/23 22:30:07] [ info] [output:tcp:tcp.0] worker #3 started [2024/07/23 22:33:32] [ info] [input:tail:tail.0] inode=2686195618 handle rotation(): /home/user1/debug.log => /home/user1/debug.log.1 [2024/07/23 22:33:32] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=2686195618 watch_fd=1 [2024/07/23 22:33:32] [ info] [input:tail:tail.0] inotify_fs_add(): inode=2686195618 watch_fd=2 name=/home/user1/debug.log.1 [2024/07/23 22:33:32] [ info] [input:tail:tail.0] inotify_fs_add(): inode=2686195619 watch_fd=3 name=/home/user1/debug.log [2024/07/23 22:33:41] [ info] [input:tail:tail.0] inotify_fs_remove(): inode=2686195618 watch_fd=2

The flb-pipeline cpu utilization goes down slowly (there is no traffic)

user1 - - 15 1808487 - 15.2 - - 04:00 00:00:36 Tue Jul 23 22:30:06 2024 96195 do_epoll_wait -

user1 - - 14 1808487 - 9.2 - - 06:37 00:00:36 Tue Jul 23 22:30:06 2024 96195 do_epoll_wait -

user1 - - 14 1808487 - 9.1 - - 06:39 00:00:36 Tue Jul 23 22:30:06 2024 96195 do_epoll_wait -

user1 - - 26 1808487 - 8.1 - - 07:30 00:00:36 Tue Jul 23 22:30:06 2024 96195 do_epoll_wait -

user1 - - 26 1808487 - 6.5 - - 09:16 00:00:36 Tue Jul 23 22:30:06 2024 96195 do_epoll_wait -

user1 - - 27 1808487 - 4.3 - - 14:10 00:00:36 Tue Jul 23 22:30:06 2024 96195 do_epoll_wait -

user1 - - 26 1808487 - 2.5 - - 23:51 00:00:36 Tue Jul 23 22:30:06 2024 96195 do_epoll_wait -

user1 - - 28 1808487 - 1.2 - - 49:41 00:00:36 Tue Jul 23 22:30:06 2024 96230 do_epoll_wait -

user1 - - 31 1808487 - 0.9 - - 01:04:37 00:00:37 Tue Jul 23 22:30:06 2024 96230 do_epoll_wait -

To Reproduce

On another host run fluent-bit as consumer to receive the data from TCP [INPUT] Name tcp Listen 100.11.7.154 Port 15555 Chunk_Size 32 Buffer_Size 64 [OUTFILE] Name file Match * Format json_lines

Expected behavior

Screenshots

Your Environment RHEL8 with 32 cores Linux 4.18.0-477.36.1.el8_8.x86_64 #1 SMP Thu Nov 9 08:12:18 EST 2023 x86_64 x86_64 x86_64 GNU/Linux

Additional context

suryamanikar commented 3 months ago

Another finding is that the CPU usage reported by ps does not match with top/core. I'm sending a screenshot for reference.

Screenshot 2024-07-23 at 5 14 34 PM
suryamanikar commented 3 months ago

Later, I analyzed that ps average it out since the process is started so it can't be reliable. top scores aggregated across cores but that's more reliable. Please review it and confirm if my assumption is right. I can close it

github-actions[bot] commented 2 weeks ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been stalled for 5 days with no activity.