Open qingling128 opened 6 years ago
Sorry for delay response.
What does 'bad' mean? I assume bad entries can't be flushed to destination, right?
If so, you can use emit_error_event
to send bad entries to @ERROR
label: https://docs.fluentd.org/v1.0/articles/api-plugin-helper-event_emitter
For bad chunk, we will add backup feature to avoid useless retry: https://github.com/fluent/fluentd/issues/1856
The bad
(can be flushed to destination at a later time) entries here are not due to client error, but some temporary server side errors. We would like to retry them later instead of dropping them.
Seems like we can use router.emit
to retry them. We just need to figure out a way to back up for a period of time before re-enter them. router.emit
does not seem to have a retry_wait
param. Any idea if that's possible via the existing support? Or we'll come up with something on our own.
The bad (can be flushed to destination at a later time) entries here are not due to client error, but some temporary server side errors.
Currently, we have default retry mechanizm for it. Hmm... I didn't understand the situation correctly. There are several failure situtations:
router.emit_error_event
to rescue it and send normal events to the destination.I assume your problem is Temporal failure
case but your pipeline contains good and bad entries in one buffer chunk, right?
@repeatedly - Yes! Our problem is Temporary failure
but with ready-to-go
and failed-but-need-to-be-retried
entries in the same buffer chunk. By throwing an exception, we can retry everything. But instead we want to try the failed-but-need-to-be-retried
ones only.
I see... I understood the situation.
Currently, output plugin doesn't support such retry/rollback feature.
How to check ready-to-go
or failed-but-need-to-be-retried
? Try to write first and the response contains the list of error event?
@repeatedly - We have a local binary running that provides metadata for log entries based on a label. Our Fluentd output plugin makes one request per log entry if the metadata for that label is not cached locally. Depending on whether the metadata is available, we might choose to retry a subset of log entries.
The reason behind this: We are using one Fluentd instance to send logs for multiple containers. There is a short period of time gap (under one minute) between when a container starts running and when we will have the metadata available for that container. For any logs that were emitted by the container during this gap, we want to buffer the logs until later. But we don't wanna to buffer everything in the same chunk that is from other "mature" containers who has metadata available already.
@portante and his co-workers report similar issue.
The below is his explanation:
For example, there might be 100 records in a chunk, and 13 of them are bad. We want to have the ES plugin, or any other output plugin, send off the 87 that are good, but return the 13 that are bad.
ref: https://github.com/uken/fluent-plugin-elasticsearch/pull/398#issuecomment-380301130
Originally reported in Bugzilla@Red hat
.
@qingling128 I see. Does your record have specific field for your metadata?
In v1, fluentd decide target chunk by chunk keys, <buffer CHUNK_KEYS>
.
So if one, or more, field refers the corresponding metadata, good and bad chunks are separated before flush.
@cosmo0920 I saw the patch. Why @ERROR
label is not fit for this case?
DLQ seems the specific version of @ERROR
label.
@cosmo0920 I saw the patch. Why
@ERROR
label is not fit for this case? DLQ seems the specific version of@ERROR
label.
According to Red Hat's bugzilla, It seems that there is performance issue. In ES plugin case, UTF-8 encoding conversion is too heavy to handle before sending fluent-plugin-elasticsearch.
@repeatedly @cosmo0920 As I stated in the originally issue, there was not necessarily a performance issue but a problem with the the way chunks were consumed by the fluent-elasticsearch-plugin. If any part of deserializing and processing records in the chunk failed, an exception would bubble up and the chunk would be retried indefinitely; there was no way to get past bad processing. Do we have an example of using the router with error label that would negate the need of #1948 so I could close it?
@repeatedly - Yes, our records do have a specific field local_resource_id
that we rely on to retrieve metadata.
The problem with deciding chunk by local_resource_id
is that it will force us to flush log entries (this happens after retrieving the metadata locally) into Logging API (via HTTP/RPC requests) slower, because it results in separate requests for each local_resource_id
.
In a container world, this means if a Fluentd instance is handling logs from 50 containers, the number of requests we make to Logging API will be 50 * original request number. Previously, we can bundle log entries from 50 containers into one Logging API request. Now we have to make 50 requests (from 50 flushes). This will significantly slow us down.
Hi, any news about this feature request?
Hi,
Any chance there is a way to do partial retries in Fluentd? Right now we maintain an output plugin https://github.com/GoogleCloudPlatform/fluent-plugin-google-cloud. For each buffer chunk the plugin processes, we might have some "good" entries and some "bad" entries. We'd like to enable users to flush the "good" entries right away but retry the "bad" entries later.
In the current implementation, seems that an output plugin can only throw an exception to indicate that this chunk is not processed and should be retried. The problem is that we either block the "good" entries as well, or we retry the "good" entries again and generate duplicate logs.
Is there a known way to do partial retries correctly with the built-in support?
Alternatively we are thinking about writing a new filter plugin to handle this case or dig into the core code and contribute / maintain a fluentd fork with partial retries.
Wanna get some sanity check before we went too far away unnecessarily.
Thanks, Ling