fluent / fluent-bit

Fast and Lightweight Logs and Metrics processor for Linux, BSD, OSX and Windows
https://fluentbit.io
Apache License 2.0
5.57k stars 1.53k forks source link

Maximum number of attempt in case of failure #6407

Open pierluigilenoci opened 1 year ago

pierluigilenoci commented 1 year ago

Is your feature request related to a problem? Please describe. In the event of an error when sending logs to elasticsearch, there are two possible behaviors: the log is discarded or fluentbit continues indefinitely to try to send it again. In case the log is a malformed json ES will never accept it and the pod tries to send it to infinity. (ex: https://github.com/wongnai/kube-slack/issues/51 )

Describe the solution you'd like I would like to be able to configure the maximum number of attempts to send before discarding and possibly be able to configure an alternative destination of the logs that can not be sent. (es: S3, log folder...)

Describe alternatives you've considered Actually we use "fluentbit.io/exclude: true" annotation on pod that generates malformed json to avoid this problem.

Ref: https://github.com/fluent/fluent-bit/issues/1098

lecaros commented 1 year ago

Hi @pierluigilenoci, to what kind of error are you referring? (4xx, 5xx, other) Do you have a sample? We won't retry 4xx, for example.

pierluigilenoci commented 1 year ago

@lecaros the specific example is in the ticket link in the description, I report it here too for simplicity.

Basically, if a pod produces logs in this way:

{"name":"kube-slack","hostname":"kube-slack-5fc4b6c55c-chc42","pid":1,"level":30,"msg":"Slack message sent","time":"2019-02-04T16:39:55.107Z","v":0}

Note: This JSON log have time field inside.

When docker saves it on /var/log/containers produces this:

{"log":"{"name":"kube-slack","hostname":"kube-slack-7cf99d5dbd-ffpd7","pid":1,"level":30,"msg":"Slack message sent","time":"2019-02-05T13:24:26.193Z","v":0}\n","stream":"stdout","time":"2019-02-05T13:24:26.193357608Z"}

As you can see this JSON is invalid because there is the key time twice. For this reason, Elasticsearch rejects this JSON forwarded by fluent-bit.

patrick-stephens commented 1 year ago

I'm not sure I follow, you can specify a retry limit already: https://docs.fluentbit.io/manual/administration/scheduling-and-retries Set it to N (e.g. 1) and it won't retry sending it after the N failure(s) (first for the example). Or is there something more involved here?

The fallback option is another request I think entirely so we should split that out.

pierluigilenoci commented 1 year ago

The problem is just this:

The request is basically to implement a fallback to avoid losing logs that may be important, but malformed to be digested by ES.

patrick-stephens commented 1 year ago

So you want a middle case where it's a recoverable error to then do something to "fix" the error? I think that probably needs something specific to ES/Opensearch as it sounds very specific.

pierluigilenoci commented 1 year ago

@patrick-stephens I was actually thinking of simply configuring a second output for to use as a "fallback" to the first.

For example (using the very example in the documentation page you sent):

[OUTPUT]
    Name        http
    Host        192.168.5.6
    Port        8080
    Retry_Limit False

[OUTPUT]
    Name            es
    Host            192.168.5.20
    Port            9200
    Logstash_Format On
    Retry_Limit     5
    Fallback        http # You can rename the option as you like. 
patrick-stephens commented 1 year ago

Right so in fact do chuck away the logs but just send them to another output as well (what happens if that fails?). It probably fits nicer and people can do it right now in fact - I've done similar with in-cluster Loki clusters receiving all data then external receivers (e.g. Grafana Cloud) so data is always locally available if networking is knackered.

In this case it is formalising a fallback aspect, i.e. if X fails try Y. You probably want to use aliases rather than just a type of output (e.g. you may have multiple HTTP outputs above). Also that output can presumably still match data as well and have a fallback too (although a death spiral of fallbacks having a cycle would be fun!).

Just clarifying the request really, not suggesting anything. :+1:

pierluigilenoci commented 1 year ago

@patrick-stephens of course it's just a suggestion and the example is just to show what I mean.

And it is not possible to think of all the possible mental contortions that any whimsical programmer could have. Let's say that if 99% of the uses are covered, we're on course. 😝

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale label.

pierluigilenoci commented 1 year ago

/remove-lifecycle stale

pierluigilenoci commented 8 months ago

@patrick-stephens, any news?

patrick-stephens commented 8 months ago

I am not working on this I'm afraid @pierluigilenoci, it may be worth highlighting it in the community channels to encourage some contribution