fluent / fluent-plugin-opensearch

OpenSearch Plugin for Fluentd
Apache License 2.0
58 stars 20 forks source link

The plugin is not retrying on specific errors and dropping the data on error 400 #134

Open sdwerwed opened 7 months ago

sdwerwed commented 7 months ago

(check apply)

Steps to replicate

He had a case that the max open shards had been reached in Openeasrch so fluentd was getting an error

Error:

[warn]: #0 send an error event to @ERROR: error_class=Fluent::Plugin::OpenSearchErrorHandler::OpenSearchError error="400 - Rejected by OpenSearch [error type]: illegal_argument_exception [reason]: 'Validation Failed: 1: this action would add [2] total shards, but this cluster currently has [2999]/[3000] maximum shards open;'"

The error is ok and expected, but we did not expect to lose the data. Once we increased the maximum number of open shards in opensearch the old logs were never being pushed. Looks like the fluentd is not retrying on error 400 and dropping data. We want to not lose the data due to some temporary misconfiguration on the Opensearch or if there is some limit being reached.

Configuration

<match **>
  @log_level info
  @type opensearch
  host "#{ENV['OPENSEARCH_URL']}"
  port 443
  user "#{ENV['OPENSEARCH_USERNAME']}"
  password "#{ENV['OPENSEARCH_FLUENTD_PASSWORD']}"
  include_timestamp true
  scheme https
  ssl_verify false
  ssl_version TLSv1_2

  id_key _hash
  index_date_pattern "now/d"
  target_index_key target_index
  index_name xxxxxx
  templates {"fluentd-logs-template": "/opt/bitnami/fluentd/conf/template.conf"}
  reload_connections false
  reconnect_on_error true
  reload_on_failure true
  log_os_400_reason true
  bulk_message_request_threshold 20m
  tag_key fluentd
  request_timeout 15s

  <buffer>
    @type file
    path /opt/bitnami/fluentd/logs/buffers/
    flush_thread_count 2
    flush_interval 10s
    chunk_limit_size 160m
    total_limit_size 58g
  </buffer>
</match>

Expected Behavior or What you need to ask

We expected the data to be stored in the buffer and retry till it was successful and not lose the data. How to achieve that when getting similar errors?

Using Fluentd and OpenSearch plugin versions

cosmo0920 commented 7 months ago

Looks like the fluentd is not retrying on error 400 and dropping data. We want to not lose the data due to some temporary misconfiguration on the Opensearch or if there is some limit being reached.

This is why Fluentd provides secondary mechanism to prevent data losses. Why not try to use it? https://docs.fluentd.org/output/secondary_file

This backup chunks are able to restore with fluent-logger-ruby: https://groups.google.com/g/fluentd/c/6Pn4XDOPxoU/m/CiYFkJXXfAEJ?pli=1

sdwerwed commented 7 months ago

Looks like the fluentd is not retrying on error 400 and dropping data. We want to not lose the data due to some temporary misconfiguration on the Opensearch or if there is some limit being reached.

This is why Fluentd provides secondary mechanism to prevent data losses. Why not try to use it? https://docs.fluentd.org/output/secondary_file

This backup chunks are able to restore with fluent-logger-ruby: https://groups.google.com/g/fluentd/c/6Pn4XDOPxoU/m/CiYFkJXXfAEJ?pli=1

Thanks for this, I can check it and maybe implement it as a current workaround.

But shouldn't the plugin keep retrying? This is the reason why I have set a buffer of 60GB, in case of pushing issues to accumulate the data on the buffer till the Opensearch being fixed.

cosmo0920 commented 7 months ago

Looks like the fluentd is not retrying on error 400 and dropping data. We want to not lose the data due to some temporary misconfiguration on the Opensearch or if there is some limit being reached.

This is why Fluentd provides secondary mechanism to prevent data losses. Why not try to use it? https://docs.fluentd.org/output/secondary_file This backup chunks are able to restore with fluent-logger-ruby: https://groups.google.com/g/fluentd/c/6Pn4XDOPxoU/m/CiYFkJXXfAEJ?pli=1

Thanks for this, I can check it and maybe implement it as a current workaround.

But shouldn't the plugin keep retrying? This is the reason why I have set a buffer of 60GB, in case of pushing issues to accumulate the data on the buffer till the Opensearch being fixed.

No, it shouldn't. Because there is no recovering mechanism to handle the error. 400 error is sometimes really hard to resolve when trying to attempt resending. Perhaps, specifying _retrytag might fit in for your case: https://github.com/fluent/fluent-plugin-opensearch?tab=readme-ov-file#retry_tag

This is because Fluentd's retrying mechanism is too coupled for associated conditions. This is a reason we choose to give up resending when 400 status is occurred.

sdwerwed commented 7 months ago

Looks like the fluentd is not retrying on error 400 and dropping data. We want to not lose the data due to some temporary misconfiguration on the Opensearch or if there is some limit being reached.

This is why Fluentd provides secondary mechanism to prevent data losses. Why not try to use it? https://docs.fluentd.org/output/secondary_file This backup chunks are able to restore with fluent-logger-ruby: https://groups.google.com/g/fluentd/c/6Pn4XDOPxoU/m/CiYFkJXXfAEJ?pli=1

Thanks for this, I can check it and maybe implement it as a current workaround. But shouldn't the plugin keep retrying? This is the reason why I have set a buffer of 60GB, in case of pushing issues to accumulate the data on the buffer till the Opensearch being fixed.

No, it shouldn't. Because there is no recovering mechanism to handle the error. 400 error is sometimes really hard to resolve when trying to attempt resending. Perhaps, specifying _retrytag might fit in for your case: https://github.com/fluent/fluent-plugin-opensearch?tab=readme-ov-file#retry_tag

This is because Fluentd's retrying mechanism is too coupled for associated conditions. This is a reason we choose to give up resending when 400 status is occurred.

Are you suggesting a workflow that starts with an Input, then moves through a Filter, into Opensearch Output Plugin, and utilizes the Secondary File Output Plugin (leveraging the retry tag for matching), followed by a manual script execution as outlined here (https://groups.google.com/g/fluentd/c/6Pn4XDOPxoU/m/CiYFkJXXfAEJ?pli=1)?

How do we ensure the file doesn't grow excessively large without implementing some form of rotation?

I appreciate this as a temporary solution, thank you.

It would be ideal to have a more comprehensive, automated solution supported by Fluentd and its plugins, eliminating the need for manual intervention across 100 AKS Clusters and avoiding the necessity for additional developer resources. I understand this is a complex issue. An optimal solution would allow for configurations through flags such as enable_retry_on_400 with customizable retry durations, for example, a maximum of 10 days or even unlimited.

cosmo0920 commented 7 months ago

There is no automated solution for this case. There are quite various cases to consider how to handle retrying mechanism and reemit into another data pipeline. So, it's impossible to implement write-at-once without errors or complete solution for retrying mechanism by sending through the network stack(TCP/IP).