Fluent bit stops uploading logs when there are too many open chunks

MPeli commented 3 years ago

Bug Report

Describe the bug Fluent bit stops uploading logs when there are too many open chunks.

To Reproduce Not really sure how. I will add more details once I found more. I happens on a server where are 1500 log files in a folder (not sure if it matters)

[error] [storage] [cio file] too many open chunks on 'tail.0/6768-1621880199.517401800.flb' (cio_file_up() line=791)

Expected behavior Fluent bit should not get stuck and try to reconnect.

Your Environment

Version used: 1.7.6 installed from https://fluentbit.io/releases/1.7/td-agent-bit-1.7.6-win64.exe
Configuration:
Environment name and version (e.g. Kubernetes? What version?):
Server type and version: Windows 10
Operating System and version: Windows 10 and Windows Server 2019
Filters and plugins: tail, es

Fluent Bit Enterprise - SOS Report
==================================
The following report aims to be used by Fluent Bit and Fluentd community users.

[Fluent Bit]
    Version             1.7.6
    Built Flags          FLB_HAVE_PARSER FLB_HAVE_RECORD_ACCESSOR FLB_HAVE_STREAM_PROCESSOR JSMN_PARENT_LINKS JSMN_STRICT FLB_HAVE_TLS FLB_HAVE_AWS FLB_HAVE_SIGNV4 FLB_HAVE_SQLDB FLB_HAVE_TRACE FLB_HAVE_TIMESPEC_GET FLB_HAVE_PROXY_GO FLB_HAVE_REGEX FLB_HAVE_UTF8_ENCODER FLB_HAVE_LUAJIT

[Operating System]
    Name                Windows
    Version             6.2
    Build               9200

[Hardware]
    Architecture        x64 (AMD or Intel)
    Processors          12

[Built Plugins]
    Inputs              emitter tail dummy statsd storage_backlog stream_processor winlog tcp lib forward random
    Filters             alter_size aws record_modifier throttle throttle_size kubernetes modify nest parser expect grep rewrite_tag lua stdout geoip2
    Outputs             azure azure_blob counter datadog es file forward http influxdb logdna loki nrlogs null slack splunk stackdriver stdout syslog tcp flowcounter gelf websocket cloudwatch_logs kinesis_streams s3

[SERVER] Runtime configuration
    Flush               5.000000
    Daemon              On
    Log_Level           Trace

[INPUT] Instance
    Name                tail.0 (tail, id=0)
    Flags
    Threaded            No
    Tag                 <appname>-<pid>
    Mem_Buf_Limit       95.4M
    Path                C:\path\*_*.Xlog.log,D:\path\*_*.Xlog.x64.log
    Multiline           off
    Parser_Firstline    xlog_parser
    Path_Key            path
    Offset_Key          1
    DB                  fluent.db
    Read_from_Head      On
    Tag_Regex           ^.*\\(?<appname>.+)_(?<pid>[0-9]+).Xlog.*.log$
    Buffer_Chunk_Size   128k
    Buffer_Max_Size     256k
    Ignore_Older        10d
    Routes              es.0

[INPUT] Instance
    Name                storage_backlog.1 (storage_backlog, id=1)
    Flags
    Threaded            No
    Tag                 storage_backlog.1
    Routes              es.0

[OUTPUT] Instance
    Name                es.0 (es, id=0)
    Match               *
    TLS Active          Yes
    TLS.Verify          Off
    TLS.Ca_File         (not set)
    TLS.Crt_File        (not set)
    TLS.Key_File        (not set)
    TLS.Key_Passwd      (not set)
    Retry Limit         no limit
    Host.TCP_Port       443
    Host.Name           abcd.eu-west-1.es.amazonaws.com
    Index               fluent-bit
    Logstash_Format     true
    HTTP_User           aaa
    HTTP_Passwd         bbb
    Trace_Output        On
    Trace_Error         On
    Buffer_Size         False

fluent-bit-too-many-open-chunks.log

MPeli commented 3 years ago

To Reproduce

Prepare 120 files (~550Mb it total) which will be monitored
set storage.max_chunks_up 32
delete db files and backlog
Go offline
start fluent bit
wait till you get too many open chunks on and there are no new log lines in fluent bit log. Fluent bit will seem to be stuck.
Go online (it has no effect)

fluent-bit.log

igorpeshansky commented 3 years ago

Seems to be the same as #3338…

josephflynch commented 3 years ago

This is affecting GCP customer(s).

niedbalski commented 3 years ago

Hello @MPeli

I am following the reproducer procedure outlined in your last comment but on linux using the latest master build.

So far the issue isn't reproducible, can you please review the following procedure and configuration matches.

Created 120 files of ~600MiB.
```
$ ls -lh /data/*.xlog.log | wc -l
120
```

$ du -ach /data/*.xlog.log | grep -i total 606M total

2.1 set storage.max_chunks_up 32
```sh
$ docker-compose logs fluentbit | grep -i chunks
fluentbit               | [2021/06/09 00:16:38] [ info] [storage] normal synchronization mode, checksum disabled, max_chunks_up=32

Configuration file in use

[SERVICE]
    Flush               5.000000
    Daemon              Off
    Log_Level           Trace
    storage.max_chunks_up     32

[INPUT]
    Name                tail
    Mem_Buf_limit       95.4M
    Path                /data/*.xlog.log
    Multiline       off
    Parser_Firstline    xlog_parser
    Path_Key            path
    Offset_Key          1
    DB                  fluent.db
    Read_from_Head      On
    Buffer_Chunk_Size   128k
    Buffer_Max_Size     256k
    Ignore_Older        10d

[OUTPUT]
    Name        es
    Match *
    Host elasticsearch-master
    Port 9200
    Index fluentbit

qingling128 commented 3 years ago

We observed the same issue with 1.7.6, but it was fixed by 1.7.8. Please give that a try.

github-actions[bot] commented 3 years ago

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] commented 3 years ago

This issue was closed because it has been stalled for 5 days with no activity.

jkschulz commented 2 years ago

This doesn't seem to be fixed in 1.7.8 or 1.8.4. I tested a custom build of 1.8.4 w/ chunkio 1.1.5 on Windows 2016, and when I forced it to lose connection (by messing up DNS resolver), the CPU was driven way up, logs started growing with errors, and FB kept restarting. Some examples of logs:

[error] [upstream] connection #-1 to logging.googleapis.com:443 timed out after 10 seconds
[ warn] [net] getaddrinfo(host='logging.googleapis.com', err=11): Could not contact DNS servers
[error] [storage] [cio file] hash does not match (2329640969 != 1104745215)
[error] [storage] [cio file] cannot verify checksum on 'tail.1/3792-1638856019.33686500.flb' (cio_file_open() line=386)

After a bit, started to see [error] [storage] [cio file] too many open chunks on 'tail.1/6392-1638856403.90869100.flb' (cio_file_up() line=791) in the logs. Once I saw those, I restored connection. CPU dropped back down, but the too many open chunks errors and others continued in the logs, and a few logs (202) were able to get through that came from that tail input, before they suddenly stopped and no more were sent. Another input, winlog (which sent logs with a diff output plugin from the tail input) was able to send logs as expected after connection was restored.

Links

custom Ops Agent build REPO_SUFFIX=20211201-2 built from 1.8.4-chunkio-1.1.5 branch
full config
logs, non-truncated version

Steps to reproduce

Create Windows 2016 VM

$ gcloud compute instances create win2016 --image-project windows-cloud --image-family windows-2016 --machine-type=n1-standard-16
# Write down username+password from output for use later
$ gcloud compute reset-windows-password --quiet win2016   
# For connecting via Microsoft Remote Desktop
$ gcloud compute start-iap-tunnel --local-host-port=localhost:13000 win2016  3389

Connect via Microsoft Remote Desktop using username and password from setup

Open Powershell as admin, and install custom Ops Agent build

> (New-Object Net.WebClient).DownloadFile("https://dl.google.com/cloudagents/add-google-cloud-ops-agent-repo.ps1", "${env:UserProfile}\add-google-cloud-ops-agent-repo.ps1")
> $env:REPO_SUFFIX="20211201-2"; & $env:UserProfile\add-google-cloud-ops-agent-repo.ps1 -AlsoInstall

Mess up DNS. Make sure to close each of these windows after making the changes, otherwise it will not take effect.
Observe logs. Observe CPU continuously increasing for Fluent Bit in Task Manager. Wait until you see too many open chunks in logs.
```
> notepad C:\ProgramData\Google\Cloud Operations\Ops Agent\log\logging-module.log
```
Restore connection by reverting changes from 2 steps ago. Set IPv4 properties back to Obtain DNS server address automatically.
Observe that CPU for Fluent Bit drops back down. Too many open chunks messages continue getting logged. Check Logs UI. Observe that some logs were ingested right after connection was restored, but then they suddenly stop and never return. The logs that aren't ingested at this point start at:
```
[2021/12/07 06:01:35] [ info] [input:storage_backlog:storage_backlog.2] queueing tail.1:5904-1638855941.703861300.flb
```

Create Windows Event log and observe that it is sent successfully.

> New-Eventlog –LogName Application –Source “oa-test-logger”
> Write-Eventlog –LogName Application –Source “oa-test-logger” –EntryType Error –EventID 1 -Message “after reconnecting"

jkschulz commented 2 years ago

Also a while ago I tested this on Fluent Bit 1.7.8 (via Ops Agent 2.0.3) and it failed in the same way. Logs blew up when network connection was disrupted. After disconnecting, there were errors like connection #1300 to logging.googleapis.com:443 timed out after 10 seconds in the logs, then after a few mins of that, lots of errors like [error] [storage] [cio file] too many open chunks on. When i finally reconnected, the too many open chunks errors continued, and I don't think the connection was reestablished bc no new logs made it through to cloud logging

logs: https://gist.github.com/jkschulz/b5c4a966e9592e939e15068d707702de

fluent / fluent-bit

Fluent bit stops uploading logs when there are too many open chunks #3541

Bug Report

Steps to reproduce