Open garethhumphriesgkc opened 9 months ago
Note also that under no situation is recovery of the failed node automatic. Once the write permissions have been removed logstash shuts down either itself or that pipeline and requires manual intervention to come back up.
Thank you for the thoughtful and well-reduced report. I agree that the behavior is surprising, and that we should work to find a way to eliminate this failure-case.
While I don't have an immediate solution, I can add some commentary that may help us narrow down on what the behavior should be.
There are a couple things at play here:
The pipeline stopping isn't backpressure per-se (such as TCP receive-window reduction), but as the pipeline is shut down, its inputs are shut down, and most inputs that have inbound connections will simply hang up and the connecting client will handle being hung up on by retrying or rerouting.
This is interesting in the pipeline-to-pipeline case because the downstream pipeline shutting down as a result of a plugin crash does not propagate to the upstream pipeline, and therefore does not shut down the input plugin in that upstream pipeline. The upstream pipeline's inputs continue to run and receive events without hanging up, but now the upstream pipeline's outputs are blocked while they wait for the crashed pipeline to come back (it won't). Without a PQ the inputs become blocked relatively quickly, which translates to the TCP receive-window filling and TCP back-pressure which may or may not be handled by the connected client.
In a way, not propagating the crash is by-design and part of what makes several of the pipeline-to-pipeline design patterns work to do things like reload transformation or output pipeline definitions without restarting the inputs.
Currently, your GoodThing behavior relies on (a) the File input hard-crashing when it encounters a permissions issue, and (b) that crash cascading to the shutting down of the input plugin that is receiving data. If the File output doesn't crash, or if the crashing pipeline isn't the same pipeline as the one the input is running in, it will result in your BadThing behavior (which is a cascade of back-pressure that results in the input's TCP receive window filling, and the connected client handling that propagated back-pressure by blocking).
For historic reasons, pipelines that crash stay stopped until a human intervenes. The logic (at least at the time) was that a crash is by definition an extraordinary circumstance that was not planned for, and a human intervention is needed anyway. Additionally, automated restarts of a pipeline in a repeated-crashing state can cause data-loss with some inputs and is a VeryDifficult problem to solve in a generalized form, since they can cause events to be routed to a non-working pipeline that cannot process them; with the default memory queue, or with inputs that have no application-level acknowledgement scheme, this can be a bad problem.
This is certainly made more complex with the introduction and increased adoption of pipeline-to-pipeline. Again, I don't have answers, but I hope this commentary helps make sense of what is going on so that we can scheme toward a solution.
Hi,
Thanks for your thoughts, I think we're on the same wavelength. It's certainly not a one-line fix, and arguably not even a bug per-se, but I felt it was worth raising.
Note that the chmod is to simulate any kind of backend failure - running out of disk space is what prompted the investigation and repro, and it feels like it's just the pipeline stopping that's the problem. I don't know of a way to gracefully stop a single pipeline, but I suspect doing so will have the same effect.
I've not yet seen failover at all in the BadThing case due to the TCP window filling. I get a warning in the log for every event, but nothing on stdout (or in the file obviously) ad infinitum. I have the pipeline size set tiny, so there shouldn't be any queuing of note within logstash:
pipeline.batch.size: 1
pipeline.workers: 1
I re-ran the repro with an event being generated every second, ran the chmod after 10 events, and after about 6 hours I got sick of waiting and manually docker kill
ed the receiver. Immediately it was gone the other one picked up at event 21370 - so every intermediate event was lost and there was no sign of the initial receiver deciding to stop accepting. I crudely calculated that the TCP window should fill after around 15 minutes at most.
I also found during that test that two seconds worth of events (2 events at 1 per second) made it to stdout but not disk before failover happened - so it seems that any failing sub-pipeline will lose data. Granted, it's hard (impossible?) to avoid this cleanly.
There's a lot to think about here, and the actual solution may be quite involved. For my use-case I think a band-aid fix would suffice:
Logstash information: docker:
docker.elastic.co/logstash/logstash:8.12.1
JVM (e.g.
java -version
): bundledOS version (
uname -a
if on a Unix-like system): Ubuntu 24.04Description of the problem including expected versus actual behaviour:
I have found what seems to be an issue with back-pressure propagation in lumberjack/beats plugins.
I have written and attached a simple poc with docker composed of a data generator and two data receivers. The generator creates an event every 5 seconds and sends it to one of the receivers via the lumberjack output (with both receivers configured in
hosts:[]
). Each receiver will write any event it receives to disk and print it on stdout.When the receivers have a single output stanza with multiple outputs configured, as soon as one output is unable to process a message successfully, the entire pipeline blocks, which applies back-pressure to the input. This back-pressure propagates across the network to the generator (via a simple disconnect, I expect), which chooses a new destination from
hosts: []
. No data is lost and fail over is instantaneous. This is a GoodThing™.However, when the receivers are configured with the same outputs as pipelines rather than inline in the output stanza, this doesn't happen. The pipeline stalls but doesn't disconnect upstream, so failover never happens. This is a BadThing™.
The relevant difference can be boiled down to this:
vs this:
Steps to reproduce:
I have attached a docker-compose stack that demonstrates the issue:
Edit docker-compose.yml to switch between the two scenarios.
When it starts, you will see the generator start sending data to one receiver. You can then break the filesystem output on that receiver by running
docker exec output-pipelines-poc-recv1-1 chmod 444 /tmp/raw/
. What happens next depends on how the output is configured:Both outputs directly in the
output
stanza:Both outputs via output pipelines: