fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.56k stars 1.52k forks source link

tail: cannot resume collector tail.N:0, already running #8972

Open willholley opened 1 month ago

willholley commented 1 month ago

Bug Report

When starting fluentbit, we frequently encounter the following errors:

[2024/06/17 17:19:20] [error] [input] cannot resume collector tail.1:0, already running
[2024/06/17 17:19:27] [error] [input] cannot resume collector tail.1:2, already running
[2024/06/17 17:19:27] [error] [input] cannot resume collector tail.1:6, already running

We have a perhaps unusual configuration that uses multiple tail inputs to watch the same files.

The thinking behind this was to have logically independent pipelines for different outputs so that if one output is unavailable and causes a pipeline to pause, we could still continue to ingest logs to our secondary output.

Both tail inputs use sqlite to track offsets, though distinct databases are used for each tail to avoid conflicts with the rows which are keyed on inode.

Define multiple tail inputs with db tracking enabled, tailing the same directories. With the agent watching multiple files, restart the agent service.

Expected behavior

I'd expect the agent to handle this situation without error. It seems like a race condition, perhaps due to an assumption that inode is a unique identifier for all tail inputs?

Your Environment

patrick-stephens commented 1 month ago

Can you include your actual config or an example one that reproduces the problem? I'd also just double check stepping up to 3.0.7 does not resolve it.

willholley commented 1 month ago

I'll work on a reproducer. I wonder whether https://github.com/fluent/fluent-bit/issues/8972 is related - with lots of tail inputs we have to massage the order of the inputs to get the databases to initialize correctly, suggesting some kind of race condition.