fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.85k stars 1.58k forks source link

CPU constantly high #2469

Closed tarunwadhwa13 closed 4 years ago

tarunwadhwa13 commented 4 years ago

Bug Report

Describe the bug We have few applications which produce heavy logging. By heavy logging I mean both size and frequency. Average size of a json log line will be around 100KB. We have noticed that fluenbit which normally uses <5% shoots to 100% CPU usage which remains constant. Evey time, this log line was observed in docker logs

[2020/08/14 07:30:03] [ warn] [input:tail:tail.3] file=/opt/logs/flights-nav-ancillaries/flights-nav-ancillaries.json have long lines. Skipping long lines. Memory usage during that time is <50MiB even though 512 MB is given per container CPU allotted 256vcpu

Expected behavior If the buffer is full, contents should be flushed and it should start pushing data again

Your Environment

[SERVICE]
      Flush 4
      Daemon off
      Parsers_File parsers_mmt.conf

[INPUT]
      Name tail
      Path /opt/logs/access.log,/opt/logs/*/access.log,/opt/logs/*/*/access.log
      DB /opt/logs/Offset.db
      Refresh_Interval  5
      Path_Key file
      Mem_Buf_Limit 100MB
      Buffer_Chunk_Size 10MB
      Buffer_Max_Size 50MB
      Skip_Long_Lines On
      Parser mmt
      Tag access

[INPUT]
      Name tail
      Path /opt/logs/nginx-frontend.log,/opt/logs/*/nginx-frontend.log,/opt/logs/*/*/nginx-frontend.log
      DB /opt/logs/Offset.db
      Refresh_Interval  5
      Mem_Buf_Limit 100MB
      Buffer_Chunk_Size 10MB
      Path_Key file
      Buffer_Max_Size 50MB
      Skip_Long_Lines On
      Parser mmt
      Tag access

[INPUT]
      Name tail
      Path /opt/logs/*.log,/opt/logs/*/*.log,/opt/logs/*/*/*.log
      Exclude_Path /opt/logs/access.log,/opt/logs/nginx-frontend.log,/opt/logs/*/nginx-frontend.log,/opt/logs/*/access.log,/opt/logs/*/*/access.log,/opt/logs/*/*/nginx-frontend.log
      DB /opt/logs/Offset-app.db
      Refresh_Interval  5
      Mem_Buf_Limit 200MB
      Buffer_Chunk_Size 10MB
      Buffer_Max_Size 50MB
      Skip_Long_Lines On
      Path_Key file
      Multiline On
      Parser_Firstline mmt
      Tag app

[INPUT]
      Name tail
      Path /opt/logs/*.json,/opt/logs/*/*.json
      DB /opt/logs/Offset-app.db
      Key message
      Refresh_Interval  5
      Mem_Buf_Limit 200MB
      Buffer_Chunk_Size 20MB
      Buffer_Max_Size 100MB
      Skip_Long_Lines On
      Path_Key file
      Tag json_app

[FILTER]
      Name modify
      Match *
      Add app_name ${SERVICE}
      Add DEPLOYMENT_VERSION ${DEPLOYMENT_VERSION}
      Add hostname ${BASE_SERVER_IP}

[OUTPUT]
      Name kafka
      Match access*
      Brokers ${KAFKA_HOST}
      Topics m_${SPACE}_${SERVICE}_mon_access
      Timestamp_Key log_timestamp
      Timestamp_Format iso8601
      rdkafka.compression.codec snappy

[OUTPUT]
      Name kafka
      Match app*
      Brokers ${KAFKA_HOST}
      Topics m_${SPACE}_${SERVICE}_mon_app
      Timestamp_Key log_timestamp
      Timestamp_Format iso8601
      rdkafka.compression.codec snappy

[OUTPUT]
      Name kafka
      Match json_app
      Brokers ${KAFKA_HOST}
      Topics json_m_${SPACE}_${SERVICE}_mon_app
      Timestamp_Key log_timestamp
      Timestamp_Format iso8601
      rdkafka.compression.codec snappy

Additional context The problem is not with CPU utilization but in either case it shouldn't stop forwarding logs

Had taken an Intenal Dump as mentioned in the doc. Looks like it left all the tasks

[engine] caught signal (SIGCONT)
[2020/08/14 08:46:52] Fluent Bit Dump

===== Input =====
tail.0 (tail)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 0b (0 bytes)
│     └─ mem limit  : 95.4M (100000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 0
│  ├─ new           : 0
│  ├─ running       : 0
│  └─ size          : 0b (0 bytes)
│
└─ chunks
   └─ total chunks  : 0
      ├─ up chunks  : 0
      ├─ down chunks: 0
      └─ busy chunks: 0
         ├─ size    : 0b (0 bytes)
         └─ size err: 0

tail.1 (tail)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 0b (0 bytes)
│     └─ mem limit  : 95.4M (100000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 0
│  ├─ new           : 0
│  ├─ running       : 0
│  └─ size          : 0b (0 bytes)
│
└─ chunks
   └─ total chunks  : 0
      ├─ up chunks  : 0
      ├─ down chunks: 0
      └─ busy chunks: 0
         ├─ size    : 0b (0 bytes)
         └─ size err: 0

tail.2 (tail)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 765.5K (783873 bytes)
│     └─ mem limit  : 190.7M (200000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 1
│  ├─ new           : 0
│  ├─ running       : 1
│  └─ size          : 765.5K (783873 bytes)
│
└─ chunks
   └─ total chunks  : 1
      ├─ up chunks  : 1
      ├─ down chunks: 0
      └─ busy chunks: 1
         ├─ size    : 765.5K (783873 bytes)
         └─ size err: 0

tail.3 (tail)
│
├─ status
│  └─ overlimit     : no
│     ├─ mem size   : 284.5K (291354 bytes)
│     └─ mem limit  : 190.7M (200000000 bytes)
│
├─ tasks
│  ├─ total tasks   : 0
│  ├─ new           : 0
│  ├─ running       : 0
│  └─ size          : 0b (0 bytes)
│
└─ chunks
   └─ total chunks  : 1
      ├─ up chunks  : 1
      ├─ down chunks: 0
      └─ busy chunks: 0
         ├─ size    : 0b (0 bytes)
         └─ size err: 0

===== Storage Layer =====
total chunks     : 2
├─ mem chunks    : 2
└─ fs chunks     : 0
   ├─ up         : 0
   └─ down       : 0
tarunwadhwa13 commented 4 years ago

Have enabled debug logs to get better insights to the issue.