Open jasonreevessimmons opened 1 year ago
A caveat about the strategy I used above - if you're facing high log volumes, you may realize that lots of events get "stuck" in SQS. As the unprocessed events pile up in SQS, individual Fluentd node performance will drop, likely because as the SQS queue is polled a BUNCH of events come back and it's unable to process them efficiently.
You probably won't see this adverse reaction in a single node architecture. My architecture has three Fluentd nodes polling a single SQS queue. Without my patch, I was losing data, but with my patch, I lost data anyway due to needing to purge the SQS queue periodically.
This patch may be useful for some situations, and should probably be implemented as a configuration option like toss_event_if_no_regexp_match
or something like that.
Describe the bug
Although the
aws_sdk_sqs/queue_poller
module does indeed delete only at the end of the code block passed to it while polling (assuming:skip_delete
isfalse
), the use ofnext unless @match_regexp.match?(key)
short circuits the block and the delete action occurs.My set up involves multiple Fluentd nodes pointing to one SQS queue. Fluentd stores events in S3 using the hostname as part of the path, and a regex to match the hostname is used to pull events back into Fluentd, because I want the node that originally processed the event to process it the second time for sending to OpenSearch.
To Reproduce
Using multiple Fluentd nodes, send events to S3 using the unique hostname as part of the path.
Ingest the events using the S3 input plugin and an appropriately configured SQS queue, using
match_regexp
to match the hostname part of the path.If an event processed by host A is picked up by host B, for example, the event won't be processed but will be deleted and will never make its way to your ultimate destination.
Expected behavior
The expected behavior would be that the event would not be deleted, but left on the queue for the appropriate host to process.
Your Environment
Your Configuration
Your Error Log
Additional context
I have a patch that works to prevent this issue. You all may prefer a more nuanced approach, but this works for me: