Add functionality to discard only failed chunks

ialidzhikov commented 5 years ago

From fluentd docs:

Fluentd will abort the attempt to transfer the failing chunks on the following conditions:

- The number of retries exceeds retry_max_times (default: none)
- The seconds elapsed since the first retry exceeds retry_timeout (default: 72h)

In these events, all chunks in the queue are discarded. If you want to avoid this, you can enable retry_forever to make Fluentd retry indefinitely.

So let's say you have a queue with 1000 chunks for example and if you use fluent-plugin-elasticsearch to dispatch the events to different elasticsearch intances - for example:

host elasticsearch-logging.${record['kubernetes']['namespace_name']}.svc

In this case if 1 chunks fails (exceeds the retry_max_times because its elastic instance was down) all the 1000 chunks in the queue will be discarded (999 will potentially succeed because you send them dynamically to different elastic hosts). So isn't reasonable to add support to discard only failing chunks from the queue?

repeatedly commented 5 years ago

Hmm... currently, we have dropping first chunk for buffer overflow case: https://docs.fluentd.org/v1.0/articles/output-plugin-overview#control-flushing So adding same feature to retry limit case seems reasonable for unstable destination.

ialidzhikov commented 5 years ago

Thank you for the reply @repeatedly . The issue is also accepted by the elasticsearch plugin. Instead of preparing a fix which is specific for the elasticsearch dynamic plugin I think the right place is in the buffer logic. Is it possible to add a new buffer parameter retry_exceeded_action with values drop_all_chunks and drop_failed_chunks:

<buffer>
   retry_exceeded_action (drop_all_chunks|drop_failed_chunks)
</buffer>

I think users like us with dynamic host configuration will prefer the drop_failed_chunks option because the host of each chunks is different. It will be okay also from backward compatibility side - the default option will be drop_all_chunks. It should be also not so difficult from development side, right - only the failed chunk will be purged after retry exceed instead of clearing all the queue? What do you think about such new parameter?

ialidzhikov commented 5 years ago

@repeatedly and @cosmo0920 can I start to implement a draft PR with the proposal above?

ialidzhikov commented 5 years ago

I had a look in the code. Currently the retry is per output plugin. What do you think about having the retry configurable - per output plugin or per chunk? Do you also have an idea for solution of the problem? Currently we use elasticsearch_dynamic plugin to send logs to n elasticsearch instances and m of them could be unavailable (scaled down) by design. Currently our config looks like:

<match**>
    @id elasticsearch_dynamic
    @type elasticsearch_dynamic
    host elasticsearch-logging.${record['kubernetes']['namespace_name']}.svc
    @include /etc/fluent/config.d/es_plugin.options
    <buffer tag, time>
      @type file
      # Limit the number of queued chunks
      queued_chunks_limit_size 4096
      # The number of threads of output plugins, which is used to write chunks in parallel
      flush_thread_count 32
      # the size limitation of this buffer plugin instance
      total_limit_size 6GB
      path /foo/bar

      chunk_limit_size 50MB
      chunk_full_threshold 0.9
      timekey 300

      flush_mode interval
      flush_interval 60s
      timekey_wait 0
      flush_at_shutdown true
      flush_thread_interval 30.0
      overflow_action drop_oldest_chunk

      retry_type periodic
      retry_wait 75
      retry_randomize false
      retry_max_times 4
    </buffer>
    # avoiding backup of unrecoverble errors
    <secondary>
      @type null
    </secondary>
</match>

We don't want all the queue to be cleared because of some failed chunks. We also don't want to drop the oldest one because it could be retried only one time and could be dropped because of other chunk failures.

repeatedly commented 5 years ago

Sorry for delay response. I missed your comment.

What do you think about having the retry configurable - per output plugin or per chunk?

What does "per chunk" means? Do you want to change the behaviour like kafka plugin's ignore_exceptions?

https://github.com/fluent/fluent-plugin-kafka/blob/9360ac2cd5e6f6312c2831a1cfb1d18587ef824d/lib/fluent/plugin/out_kafka2.rb#L45

ialidzhikov commented 5 years ago

Thank you for your reply @repeatedly . The approach with ignore_exceptions is just a way to say, "hey, I fail to send the chunk, I will discard it". uken/fluent-plugin-elasticsearch#562 is proposed.

I would also add that I would like to have retry mechanism per host. Instead of dropping the failed one chunks, I would like to apply a retry policy and when this policy exceeds then only the chunks for the given host to be dropped. Currently the whole output plugin is designed to work with single static host, the retry state is only one for output plugin.

ignore_exceptions is a good workaround but still the feature for retry per host is missing.

WDYT? Maybe I miss something?

repeatedly commented 5 years ago

ignore_exceptions is a good workaround but still the feature for retry per host is missing.

The hard point is host depends on plugin implementation, e.g. out_file doesn't have host. So if we implement this feature, tag or record based condition is better.

fluent / fluentd

Add functionality to discard only failed chunks #2280