fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.81k stars 1.58k forks source link

Proposal: Allow Output Plugins to optionally control their own OK/Retry/Error metrics #6141

Open PettitWesley opened 2 years ago

PettitWesley commented 2 years ago

Background

Fluent Bit Flushing and Metrics Mechanism

In Fluent Bit, logs are ingested from inputs and then batched into chunks of roughly 2MB. These are buffered and then sent to outputs. An output flush event is supposed to either succeed or fail to send the entire chunk. Flush events are ideally not supposed to be dependent on each other, enabling concurrency/workers to send multiple chunks at a time.

So for each chunk, an output must either return FLB_OK, FLB_RETRY, or FLB_ERROR. These returns are counted and become the Fluent Bit prometheus metrics: https://docs.fluentbit.io/manual/administration/monitoring

S3 Output Buffering

The S3 plugin has its own buffering mechanism, because it serves a unique use case. Customers expect large files, potentially several GB in size in their S3 bucket. Tons of little 2 MB files, one created for each chunk, would be a poor user experience.

Thus, AWS implemented custom buffering in out_s3. The plugin accepts data from flush events and then buffers it into the file system to create large files. Thus, it does not send data in most flush events. Thus, it always returns FLB_OK for all flushes (unless there was an error writing to the buffer files). This means that the prometheus metrics for S3 are meaningless.

Problem Statement

Because S3 has its own buffering and retry strategy, the prometheus output metrics are useless for customers. Actually, they are worse than useless- they are misleading. In most cases, the metrics for S3 will show only success, since it only ever returns FLB_OK. Thus, customers may think their uploads are all succeeding even though they are not.

Goal

S3 output metrics in the prometheus endpoint should be meaningful and useful to users. S3 customers do not think in terms of failed chunks, they think in terms of failed file uploads.

Proposal: Allow Output Plugins to optionally control their own OK/Retry/Error metrics

Currently, there is code that increments the error, retry and success metrics when flush events are completed here:

The proposal is that Fluent Bit supports an output flag, FLB_OUTPUT_OMIT_METRICS which if set will bypass the metric counter code above.

Instead, the plugin will have the option of incrementing the cmetric counters itself. This would allow the S3 output to increment the counters based on file uploads.

Usefulness for other plugins

This could potentially be useful for other outputs as well. Many outputs sometimes have to make multiple requests to send a single chunk. For example, if an API (ex: AWS CloudWatch PutLogEvents) has a 1 MB max payload size, then each ~2MB chunk could take 2+ requests to upload.

Output plugin developers may prefer to decide that the meaning of the retry/success metrics for their plugin is based on API calls, not chunks. Essentially, the argument here is that end-users do not think in terms of chunks, its an internal Fluent Bit concept, and thus the prometheus metrics are maximally useful if they are tied to end-user needs.

jeongukjae commented 1 year ago

+1

It must be more useful if the metrics are also controlled in Golang output plugins.

jmcarp commented 1 year ago

What's the status of this proposal? Given that metrics aren't meaningful for the s3 output plugin, do you all have a recommendation for monitoring it?

PettitWesley commented 1 year ago

@jmcarp unfortunately, I don't really have any good ideas besides that you can monitor for the rates of different error and warn messages, including these: https://github.com/aws/aws-for-fluent-bit/issues/702

Sorry!

jmcarp commented 1 year ago

I'm not sure how far along this proposal is, but what do you think about adding new metrics to the s3 output plugin like these custom metrics in the stackdriver plugin: https://github.com/fluent/fluent-bit/blob/master/plugins/out_stackdriver/stackdriver_conf.c#L524-L568? That wouldn't fix the problem of the generic metrics being misleading for the s3 plugin, but at least it would be possible to monitor the plugin using metrics.

PettitWesley commented 1 year ago

@jmcarp yea that has been my plan for some time, unfortunately I have not gotten time to implement it, and won't soon. If you or anyone else would like to try, let me know, and I can try to help.