elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
68 stars 3.5k forks source link

Sending DLQ messages to disk not optimal in all cases #8242

Open hartfordfive opened 7 years ago

hartfordfive commented 7 years ago

Logstash: 5.6

Currently it seems the only option for the dead letter queue is to write the files to disk. This option works well if you're running instances of Logstash on VMs (with persistent drives) that aren't within groups that are auto-scallable although doesn't tend to work as well in scenarios where someone might be using Logstash in Kubernetes auto-scaled containers without persistent volume claims. You could end up either loosing events from the DLQ but also you can't evenly distribute the task of processing/indexing DLQ events to other logstash instances evenly.

For example, If you end up having a single pipeline that indexes a large volume of events and also causes a large amount of mapping conflicts, you'll be considerably increasing the load on that instance, and you can't share the task of processing its generated DLQ. Considering this, it would be great if an option could be added to publish DLQ messages to a Kafka topic directly. This way, many logstash instances could subscribe to this topic via the Kafka input and then evenly share the task of processing/indexing the messages.

guyboertje commented 7 years ago

@hartfordfive The concept that you propose was part of the DLQ Design discussions we had.

The challenge with DLQ revolves around the idea of an event reaching its intended destination and if that can't be done, for all kinds of reasons, then the minimal response must be to write the event via DLQ implementation that is as least error prone as possible. We delivered this implementation first. We did intend that other implementations may be tried before the file based last resort - but network IO based destinations are unreliable too, hence the fallback.

As the file containing the DLQ is is the fallback and if it is under threat of loss then there can be no "no data loss" guarantees at all.

Feel free to comment on the DLQ topics #8436 issue as I think it is material to this issue.