Open pierluigilenoci opened 1 year ago
Hi @pierluigilenoci, to what kind of error are you referring? (4xx, 5xx, other) Do you have a sample? We won't retry 4xx, for example.
@lecaros the specific example is in the ticket link in the description, I report it here too for simplicity.
Basically, if a pod produces logs in this way:
{"name":"kube-slack","hostname":"kube-slack-5fc4b6c55c-chc42","pid":1,"level":30,"msg":"Slack message sent","time":"2019-02-04T16:39:55.107Z","v":0}
Note: This JSON log have time
field inside.
When docker saves it on /var/log/containers produces this:
{"log":"{"name":"kube-slack","hostname":"kube-slack-7cf99d5dbd-ffpd7","pid":1,"level":30,"msg":"Slack message sent","time":"2019-02-05T13:24:26.193Z","v":0}\n","stream":"stdout","time":"2019-02-05T13:24:26.193357608Z"}
As you can see this JSON is invalid because there is the key time
twice.
For this reason, Elasticsearch rejects this JSON forwarded by fluent-bit.
I'm not sure I follow, you can specify a retry limit already: https://docs.fluentbit.io/manual/administration/scheduling-and-retries Set it to N (e.g. 1) and it won't retry sending it after the N failure(s) (first for the example). Or is there something more involved here?
The fallback option is another request I think entirely so we should split that out.
The problem is just this:
The request is basically to implement a fallback to avoid losing logs that may be important, but malformed to be digested by ES.
So you want a middle case where it's a recoverable error to then do something to "fix" the error? I think that probably needs something specific to ES/Opensearch as it sounds very specific.
@patrick-stephens I was actually thinking of simply configuring a second output
for to use as a "fallback" to the first.
For example (using the very example in the documentation page you sent):
[OUTPUT]
Name http
Host 192.168.5.6
Port 8080
Retry_Limit False
[OUTPUT]
Name es
Host 192.168.5.20
Port 9200
Logstash_Format On
Retry_Limit 5
Fallback http # You can rename the option as you like.
Right so in fact do chuck away the logs but just send them to another output as well (what happens if that fails?). It probably fits nicer and people can do it right now in fact - I've done similar with in-cluster Loki clusters receiving all data then external receivers (e.g. Grafana Cloud) so data is always locally available if networking is knackered.
In this case it is formalising a fallback aspect, i.e. if X fails try Y. You probably want to use aliases rather than just a type of output (e.g. you may have multiple HTTP outputs above). Also that output can presumably still match data as well and have a fallback too (although a death spiral of fallbacks having a cycle would be fun!).
Just clarifying the request really, not suggesting anything. :+1:
@patrick-stephens of course it's just a suggestion and the example is just to show what I mean.
And it is not possible to think of all the possible mental contortions that any whimsical programmer could have. Let's say that if 99% of the uses are covered, we're on course. 😝
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days. Maintainers can add the exempt-stale
label.
/remove-lifecycle stale
@patrick-stephens, any news?
I am not working on this I'm afraid @pierluigilenoci, it may be worth highlighting it in the community channels to encourage some contribution
Is your feature request related to a problem? Please describe. In the event of an error when sending logs to elasticsearch, there are two possible behaviors: the log is discarded or fluentbit continues indefinitely to try to send it again. In case the log is a malformed json ES will never accept it and the pod tries to send it to infinity. (ex: https://github.com/wongnai/kube-slack/issues/51 )
Describe the solution you'd like I would like to be able to configure the maximum number of attempts to send before discarding and possibly be able to configure an alternative destination of the logs that can not be sent. (es: S3, log folder...)
Describe alternatives you've considered Actually we use "fluentbit.io/exclude: true" annotation on pod that generates malformed json to avoid this problem.
Ref: https://github.com/fluent/fluent-bit/issues/1098