Open ialidzhikov opened 5 years ago
Hmm... currently, we have dropping first chunk for buffer overflow case: https://docs.fluentd.org/v1.0/articles/output-plugin-overview#control-flushing So adding same feature to retry limit case seems reasonable for unstable destination.
Thank you for the reply @repeatedly . The issue is also accepted by the elasticsearch plugin. Instead of preparing a fix which is specific for the elasticsearch dynamic plugin I think the right place is in the buffer logic.
Is it possible to add a new buffer parameter retry_exceeded_action
with values drop_all_chunks
and drop_failed_chunks
:
<buffer>
retry_exceeded_action (drop_all_chunks|drop_failed_chunks)
</buffer>
I think users like us with dynamic host configuration will prefer the drop_failed_chunks
option because the host of each chunks is different.
It will be okay also from backward compatibility side - the default option will be drop_all_chunks
. It should be also not so difficult from development side, right - only the failed chunk will be purged after retry exceed instead of clearing all the queue? What do you think about such new parameter?
@repeatedly and @cosmo0920 can I start to implement a draft PR with the proposal above?
I had a look in the code. Currently the retry is per output plugin. What do you think about having the retry configurable - per output plugin or per chunk? Do you also have an idea for solution of the problem? Currently we use elasticsearch_dynamic plugin to send logs to n elasticsearch instances and m of them could be unavailable (scaled down) by design. Currently our config looks like:
<match**>
@id elasticsearch_dynamic
@type elasticsearch_dynamic
host elasticsearch-logging.${record['kubernetes']['namespace_name']}.svc
@include /etc/fluent/config.d/es_plugin.options
<buffer tag, time>
@type file
# Limit the number of queued chunks
queued_chunks_limit_size 4096
# The number of threads of output plugins, which is used to write chunks in parallel
flush_thread_count 32
# the size limitation of this buffer plugin instance
total_limit_size 6GB
path /foo/bar
chunk_limit_size 50MB
chunk_full_threshold 0.9
timekey 300
flush_mode interval
flush_interval 60s
timekey_wait 0
flush_at_shutdown true
flush_thread_interval 30.0
overflow_action drop_oldest_chunk
retry_type periodic
retry_wait 75
retry_randomize false
retry_max_times 4
</buffer>
# avoiding backup of unrecoverble errors
<secondary>
@type null
</secondary>
</match>
We don't want all the queue to be cleared because of some failed chunks. We also don't want to drop the oldest one because it could be retried only one time and could be dropped because of other chunk failures.
Sorry for delay response. I missed your comment.
What do you think about having the retry configurable - per output plugin or per chunk?
What does "per chunk" means? Do you want to change the behaviour like kafka plugin's ignore_exceptions
?
Thank you for your reply @repeatedly . The approach with ignore_exceptions
is just a way to say, "hey, I fail to send the chunk, I will discard it". uken/fluent-plugin-elasticsearch#562 is proposed.
I would also add that I would like to have retry mechanism per host. Instead of dropping the failed one chunks, I would like to apply a retry policy and when this policy exceeds then only the chunks for the given host to be dropped. Currently the whole output plugin is designed to work with single static host, the retry state is only one for output plugin.
ignore_exceptions
is a good workaround but still the feature for retry per host is missing.
WDYT? Maybe I miss something?
ignore_exceptions is a good workaround but still the feature for retry per host is missing.
The hard point is host
depends on plugin implementation, e.g. out_file doesn't have host
.
So if we implement this feature, tag or record based condition is better.
From fluentd docs:
So let's say you have a queue with 1000 chunks for example and if you use fluent-plugin-elasticsearch to dispatch the events to different elasticsearch intances - for example:
In this case if 1 chunks fails (exceeds the
retry_max_times
because its elastic instance was down) all the 1000 chunks in the queue will be discarded (999 will potentially succeed because you send them dynamically to different elastic hosts). So isn't reasonable to add support to discard only failing chunks from the queue?