Open sdwerwed opened 7 months ago
Looks like the fluentd is not retrying on error 400 and dropping data. We want to not lose the data due to some temporary misconfiguration on the Opensearch or if there is some limit being reached.
This is why Fluentd provides secondary mechanism to prevent data losses. Why not try to use it? https://docs.fluentd.org/output/secondary_file
This backup chunks are able to restore with fluent-logger-ruby: https://groups.google.com/g/fluentd/c/6Pn4XDOPxoU/m/CiYFkJXXfAEJ?pli=1
Looks like the fluentd is not retrying on error 400 and dropping data. We want to not lose the data due to some temporary misconfiguration on the Opensearch or if there is some limit being reached.
This is why Fluentd provides secondary mechanism to prevent data losses. Why not try to use it? https://docs.fluentd.org/output/secondary_file
This backup chunks are able to restore with fluent-logger-ruby: https://groups.google.com/g/fluentd/c/6Pn4XDOPxoU/m/CiYFkJXXfAEJ?pli=1
Thanks for this, I can check it and maybe implement it as a current workaround.
But shouldn't the plugin keep retrying? This is the reason why I have set a buffer of 60GB, in case of pushing issues to accumulate the data on the buffer till the Opensearch being fixed.
Looks like the fluentd is not retrying on error 400 and dropping data. We want to not lose the data due to some temporary misconfiguration on the Opensearch or if there is some limit being reached.
This is why Fluentd provides secondary mechanism to prevent data losses. Why not try to use it? https://docs.fluentd.org/output/secondary_file This backup chunks are able to restore with fluent-logger-ruby: https://groups.google.com/g/fluentd/c/6Pn4XDOPxoU/m/CiYFkJXXfAEJ?pli=1
Thanks for this, I can check it and maybe implement it as a current workaround.
But shouldn't the plugin keep retrying? This is the reason why I have set a buffer of 60GB, in case of pushing issues to accumulate the data on the buffer till the Opensearch being fixed.
No, it shouldn't. Because there is no recovering mechanism to handle the error. 400 error is sometimes really hard to resolve when trying to attempt resending. Perhaps, specifying _retrytag might fit in for your case: https://github.com/fluent/fluent-plugin-opensearch?tab=readme-ov-file#retry_tag
This is because Fluentd's retrying mechanism is too coupled for associated conditions. This is a reason we choose to give up resending when 400 status is occurred.
Looks like the fluentd is not retrying on error 400 and dropping data. We want to not lose the data due to some temporary misconfiguration on the Opensearch or if there is some limit being reached.
This is why Fluentd provides secondary mechanism to prevent data losses. Why not try to use it? https://docs.fluentd.org/output/secondary_file This backup chunks are able to restore with fluent-logger-ruby: https://groups.google.com/g/fluentd/c/6Pn4XDOPxoU/m/CiYFkJXXfAEJ?pli=1
Thanks for this, I can check it and maybe implement it as a current workaround. But shouldn't the plugin keep retrying? This is the reason why I have set a buffer of 60GB, in case of pushing issues to accumulate the data on the buffer till the Opensearch being fixed.
No, it shouldn't. Because there is no recovering mechanism to handle the error. 400 error is sometimes really hard to resolve when trying to attempt resending. Perhaps, specifying _retrytag might fit in for your case: https://github.com/fluent/fluent-plugin-opensearch?tab=readme-ov-file#retry_tag
This is because Fluentd's retrying mechanism is too coupled for associated conditions. This is a reason we choose to give up resending when 400 status is occurred.
Are you suggesting a workflow that starts with an Input, then moves through a Filter, into Opensearch Output Plugin, and utilizes the Secondary File Output Plugin (leveraging the retry tag for matching), followed by a manual script execution as outlined here (https://groups.google.com/g/fluentd/c/6Pn4XDOPxoU/m/CiYFkJXXfAEJ?pli=1)?
How do we ensure the file doesn't grow excessively large without implementing some form of rotation?
I appreciate this as a temporary solution, thank you.
It would be ideal to have a more comprehensive, automated solution supported by Fluentd and its plugins, eliminating the need for manual intervention across 100 AKS Clusters and avoiding the necessity for additional developer resources. I understand this is a complex issue. An optimal solution would allow for configurations through flags such as enable_retry_on_400 with customizable retry durations, for example, a maximum of 10 days or even unlimited.
There is no automated solution for this case. There are quite various cases to consider how to handle retrying mechanism and reemit into another data pipeline. So, it's impossible to implement write-at-once without errors or complete solution for retrying mechanism by sending through the network stack(TCP/IP).
(check apply)
Steps to replicate
He had a case that the max open shards had been reached in Openeasrch so fluentd was getting an error
Error:
The error is ok and expected, but we did not expect to lose the data. Once we increased the maximum number of open shards in opensearch the old logs were never being pushed. Looks like the fluentd is not retrying on error 400 and dropping data. We want to not lose the data due to some temporary misconfiguration on the Opensearch or if there is some limit being reached.
Configuration
Expected Behavior or What you need to ask
We expected the data to be stored in the buffer and retry till it was successful and not lose the data. How to achieve that when getting similar errors?
Using Fluentd and OpenSearch plugin versions
Ubuntu
Kubernetes
Fluentd fluentd 1.16.2
OpenSearch plugin version fluent-plugin-opensearch (1.1.4) opensearch-ruby (3.0.1)
OpenSearch version v 2.10.0