elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
14.09k stars 3.48k forks source link

Change shutdown behaviour when PQ is enabled. #8458

Open jakelandis opened 6 years ago

jakelandis commented 6 years ago

Logstash, by (by default) will not shutdown while events are still in flight. This is to help protect against data loss.

However, when Logstash is used with the Persistent Queue (PQ), the PQ protects against data loss by only acknowledging the batch of events upon close of the batch (which can only happen after reaching the output). Events remain in the PQ until acknowledged.

These two strategies against data loss (shutdown and PQ) are abit at odds with each other. It is not necessary to halt the shutdown process when the PQ is enabled AND events are still in flight since the PQ protects against data loss.

This behavior can have negative consequence in particular when used with centralized configuration management and an output is stalled. For example an ES host moves to a new IP, the output is stalled with inflight events, so you go into the central config management to fix the IP address, however due to the shutdown behavior protecting against data loss, Logstash can not restart the pipeline. This would require an admin to log into every Logstash box and forcibly restart each Logstash instance. This should not necessary when PQ is enabled.

pipeline.unsafe_shutdown is nearly identical to what want as the default behavior for the PQ, however, we should expose a configuration to timeout to outputs to finish before forcing the shutdown (to help avoid duplicates). Optionally we could check to see if the inflight event count is going down in addition to the timeout. Also the pipeline.unsafe_shutdown name is abit ominous to simply recommend using this option with the PQ.

Note - setting pipeline.unsafe_shutdown is set as a startup option when PQ is enabled should be a sufficient workaround, just note that there is a 15 second (hard coded) delay to push healthy data out, else duplicates may happen.

EDIT: Looking at the code abit more... not sure if pipeline.unsafe_shutdown is acceptable for pipeline restarts since I don't think the pipeline worker threads that are hung are explicitly killed.

jsvd commented 6 years ago

However, when Logstash is used with the Persistent Queue (PQ), the PQ protects against data loss by only acknowledging the batch of events upon close of the batch (which can only happen after reaching the output). Events remain in the PQ until acknowledged.

One thing we should keep in mind is that few inputs properly support acking and retransmission like the beats input. Having an unsafe_shutdown like behaviour for PQ makes data loss less likely, but we're still making that decision to drop data that the source may not retransmit by default.

That said I agree that the "ES ip changed and now logstash is stuck" is a real scenario. For the most part, this should be a very seldom occurrence and a "kill -9" is enough. For more dynamic scenarios like a cloud environment and remote management, it may make sense to have a remote force shutdown command or event some notion of: "this pipeline uses only beats input, so we can turn off immediately and safely".

jakelandis commented 6 years ago

@jsvd - doesn't the input worker for the pipeline get shutdown before pipeline workers ?

Meaning that input thread would be joined (and no longer accepting data) by the time we go the to pipeline workers shutdown (which is what is checking for the stalling).

e.g. we would still check for the stalling, but if the PQ was enabled we take the pipeline.unsafe_shutdown path in the code, and sprinkle in some config/logic to help prevent duplicates.

jsvd commented 6 years ago

If the pipeline is stalled on an output, the PQ can become full and so exert back pressure. I know it's a more convoluted scenario, but if someone goes to the effort of having PQ on, the choice of having data loss should always be made through an explicit action, by the user.

jakelandis commented 6 years ago

I think I understand - If the queue is full then inputs are blocked, and non-retrying inputs would loose data.

However, there is really no re-course for the user in that case. It seems that they only get to choose between loosing data v.s. doing memory dump to and doing some forensics to grab the non-queued events in memory ?

If I understand correctly, this is an issue for both PQ and non-PQ, and is tangential but not directly related to proposal here.

Perhaps a more robust solution would be to use the DLQ (or something like it) and push any inflight events (either pre or post queue) there before shutdown, such that independent of queue implementation shutdown is always safe w/r/t to data loss ?

jsvd commented 6 years ago

However, there is really no re-course for the user in that case. It seems that they only get to choose between loosing data v.s. doing memory dump to and doing some forensics to grab the non-queued events in memory ?

Indeed. My concern here came from, historically, seeing situations where buggy plugins would cause logstash to terminate the pipeline, or init scripts would restart logstash periodically, and defaulting to accept "unsafe" termination would cause hours of data loss. The reasons for most of these situations are now gone with much cleaner pipeline and plugin lifecycle semantics, the PQ and the "safe" shutdown strategy.

If I understand correctly, this is an issue for both PQ and non-PQ, and is tangential but not directly related to proposal here. Perhaps a more robust solution would be to use the DLQ (or something like it) and push any inflight events (either pre or post queue) there before shutdown, such that independent of queue implementation shutdown is always safe w/r/t to data loss ?

++ this can be addressed in another issue