[Design] plugins shutdown sequence refactor proposal

colinsurprenant commented 9 years ago

We need to refactor the shutdown sequence involving the input plugins and establish a clear shutdown semantic.

Current Design

Currently the base plugin class exposes the following methods related to shutdown:

shutdown
- only the lumberjack input plugin (of all input/filter/output/codec) call this method
- never overridden
- does nothing
finished
- called from many plugins
- never overridden
- does nothing
teardown
- called from many plugins
- overridden by many plugins for the plugin shutdown bookkeeping
finished?
- only the rackspace input plugin uses it
running?
- only the sqs input uses it
terminating?
- a hand full of plugins uses it

Current input plugin shutdown sequence description: There are basically 3 situations:

1- all input plugins have exited normally, they have somehow reached the end of their input:

the plugin will exit its run method, the plugin thread will exit
the pipeline thread will call the plugin teardown method

2- a SIGINT or SIGTERM is received and the pipeline needs to terminate the input plugins and the pipeline:

a) SIG* signal is trapped in the pipeline thread
b) Pipeline#shutdown is called from the signal handler:
- all input plugins threads are injected an exception + wakeup like this:
thread.raise(LogStash::ShutdownSignal) thread.wakeup
- all input plugins teardown methods are called in sequence
- (b) normally the plugins exit their run method and threads at which point the pipeline thread will again call the input teardown methods

3- an uncatched exception in the input plugin

the plugin teardown method is called from the pipeline thread
the plugin is “restarted” by calling the plugin run method again in a begin/retry
Problems
current semantic is not clear and shutdown handling is inconsistent throughout the plugins
injecting a rogue exception by doing a raise in the live thread is a very bad idea and can lead to many weird problems. see this post discussing the related problems
per the previous reason, a LogStash::ShutdownSignal exception can happen virtually anywhere, event in the teardown method
teardown is called twice on the input plugins in a SIGINT/SIGTERM situation
thread safety issues are not clearly defined/understood in the current design
Suggested Refactor

Here's my suggestion for a first iteration cleanup. Once cleaned up, we will be able to more easily build upon it and start thinking about decoupling further the plugin execution/lifecycle from the pipeline in further iterations.

We basically need 2 things for the input plugins:

1- signal the plugin it need to stop 2- ask the plugin to do its bookkeeping once it is stopped

For this I suggest that the plugins exposes only 3 methods: stop, stop? and close.

stop is called from outside the plugin thread (from the pipeline thread) and is to ask the plugin to stop. this method must be thread-safe bacause it will be called from outside the plugin thread.
stop? returns true if the stop method was called and can be used within the plugin to verify if a stop was requested.
close is called once when the plugin is stopped to perform any final bookkeeping.

The way to know a plugin is stopped is simply by waiting the return of the plugin run method. When the run method exits and the plugin thread exits, this is only when the close method should be called, once.

The same semantic can be applied to the filter and output plugins which would support 2 shutdown modes:

1- the current mode of inserting a SHUTDOWN event in the queue so that all inflight events will be processed before the filter and output plugins are closed 2- a new mode using the stop method for an immediate shutdown in which case the inflight events will be lost when not using queue persistence, in other words, with queue persistence, we will be able to do an immediate shutdown without loosing inflight events.

jordansissel commented 9 years ago

I'll get you some feedback on this soon.

jordansissel commented 9 years ago

For this I suggest that the plugins exposes only 3 methods: stop, stop? and close.

I might be misreading a bit, but if we consider the public interfaces, is 'stop' the only public interface of these three? Seems like 'stop?' would only be invoked inside a plugin to inquire if it should stop, and 'close' would only be invoked by the plugin as part of of the 'run' method (for inputs) completing.

What we have today is a kind of in-band signalling (the LogStash::ShutdownEvent), and you propose an out-of-band signal (plugin.stop). I think this is good, but I wonder if in-band is even needed anymore once we have a way to interrupt with persistence of in-flight state. Thoughts?

jsvd commented 9 years ago

The public interfaces would be stop and close. close (the current teardown) should be called by the inputworker in the pipeline in the ensure block after run:

def inputworker
  plugin.run
rescue ..
  # ..
ensure
  plugin.close #bookkeeping
end

Only stop? would be private because it's used by the plugin itself.

One more detail: since only inputs need an out of band signal (they don't receive events, they generate them), filters and outputs can still use the shutdown event. This is good to ensure data safety, changing filters and outputs to use stop/stop? would potentially bypass a non-empty queue and lose data.

[EDIT] I now think stop? should also be public, since the pipeline needs it to know when to ignore exceptions when plugins are shutting down

jsvd commented 9 years ago

I've created PR #3812 (now #3895) that implements @colinsurprenant original idea with a couple of modifications:

restricts the stop, stop?, close refactor to the LogStash::Inputs::Base class: this only concerns the shutting down of inputs. filters and outputs still use the ShutdownEvent;
renames close back to teardown to reuse the current semantics of teardown: "perform additional bookkeeping at shutdown time". This reduces the refactor work.

This PR then changes the Plugin API to:

add stop (public), stop? (private) to LogStash::Inputs::Base
remove shutdown, finished. finished?, running?, terminating? from LogStash::Plugin

I've opened a meta issue to track which plugins need to be refactored: https://github.com/elastic/logstash/issues/3813

jsvd commented 9 years ago

I've also added a PR against logstash-devutils to add a test most input plugins can use to test that stop properly exits run https://github.com/elastic/logstash-devutils/pull/32

ph commented 9 years ago

Just asking, is there any reason not to have a running? method which is basically this:

def running?; !stop?; end

it make writing code in plugin a bit more positive..

while running?
end

instead of

while !stop?
end

purbon commented 9 years ago

+2 on being positive :-)

On Tue, 1 Sep 2015 18:42 Pier-Hugues Pellerin notifications@github.com wrote:

Just asking, is there any reason not to have a running? method which is basically this:

def running?; !stop?; end

it make writing code in plugin a bit more positive..

while running?end

instead of

while !stop?end

— Reply to this email directly or view it on GitHub https://github.com/elastic/logstash/issues/3210#issuecomment-136787538.

jsvd commented 9 years ago

Note: another common pattern will be break if stop?

Nothing against running?, though I'd prefer to reduce the api to just 3 calls: .teardown, .stop and 1 other method to check if stop was called. Not sure if it's worth to widen the api. We can just replace stop? with running?

jordansissel commented 9 years ago

I prefer positive predicates, but in this case break unless running? feels ... weird. We have to have a negative to determine when to stop (break if stop? vs break unless running?) so we can't avoid it, I don't think. The alternative is something wierder while running? ... end that makes it maybe less explicit about when we can break. I dunno. I'm neutral on stop? vs running? partially because stop is either a noun or a command, where running? is more a question or an active verb. Questions feel more appropriate for predicate methods running? or should_stop? or osmething. I don't know. Neutral or leaning towards positive naming :)

jordansissel commented 9 years ago

since only inputs need an out of band signal

When persistence lands, won't we want out-of-band signalling to terminate a filter/output?

jsvd commented 9 years ago

Yes they'll need something similar, but I'm not sure if writing that code in advance is the "right thing". Whenever persistence is introduced, this api could move the LogStash::Plugin class, it shouldn't require another huge refactoring since it's just adding new stuff (stop, stop?)

jsvd commented 9 years ago

Votes? :)

# stop?
break if stop?
next unless stop?
while !stop?
 # ..
end

# stopping?
break if stopping?
next unless stopping?
while !stopping?
 # ..
end

# stop_called?
break if stop_called?
next unless stop_called?
while !stop_called?
 # ..
end

# stop_requested?
break if stop_requested?
next unless stop_requested?
while !stop_requested?
 # ..
end

# should_stop?
break if should_stop?
next unless should_stop?
while !should_stop?
 # ..
end

# terminating?
# potentially also renaming plugin.stop to plugin.terminate
break if terminating?
next unless terminating?
while !terminating?
 # ..
end

# running?
break unless running?
next if running?
while running?
 # ..
end

[EDIT] added running?

purbon commented 9 years ago

From your examples @jsvd I would avoid the should option, I feels strange to me. On the other side, what is the problem of having only stop?, if need more context I like the idea of having things like requested_to_stop? ... but I think this are very personal things.

ph commented 9 years ago

I am okay with stopping? and running? since they both are positive predicate and I don't see an issue of having them both. Their concept and usage is similar to #present? and #empty?

purbon commented 9 years ago

yeah! @ph like being positive! and one can be just the negation of the other.

colinsurprenant commented 9 years ago

I don't like stopping? since it implies that it is in the state of performing a stop but in reality it only reports that the stop method was called. In that respect stop_called?, stop_requested? and should_stop? are better but I think stop? is just as well self explanatory and shorter and cleaner :P so my vote is stop?.

I don't dislike double non-positives ;) running? could be just fine too but since the api is to call stop, then having stop? is more obviously related IMO.

colinsurprenant commented 9 years ago

since we are introducing braking API changes, I would bite the bullet and include the OOB signalling for filters and outputs too. if we postpone we'll have to do another major version bump. also, we'll be ready for persistence when it lands without necessarily doing a major version bump.

colinsurprenant commented 9 years ago

as I commented in #3812 - if we are to break the API and must go through all plugins, any reason to keep teardown around? IMO close is more appropriate and commonly used for this?

jsvd commented 9 years ago

yep we had this discussion before, I've provided the example of test suites as use of the teardown metaphor. At this point I don't mind whatever we call it, I'll just need to rescan all plugins and open new tasks for refactoring if we decide to rename it.

As for including the OOB signalling for filters/outputs, it will not break the existing API, just increase it (unless we rename teardown to close), so no reason to do it now. The only thing necessary to do NOW in the filters/outputs is to remove the "useless" methods like finished?

jsvd commented 9 years ago

Refactor work needed on the plugins

Input plugins

Input plugins are now shutdown by an external call to plugin.stop instead of catching LogStash::ShutdownSignal exception. Unless overridden, stop will simply makes stop? return true, thus allowing run to poll this and return after seeing the change.

In some plugins extra work must be done in stop to instruct run that it's time to return. For example, in the logstash-input-udp it's necessary to call @socket.close to make the blocking read on the socket raise an exception, thus breaking out the loop. So, different input plugins will require different stop strategies.

Refactoring an input plugin involves:

removing rescue of ShutdownSignal exception
understand the nature of the run loop: is it a while? a consumer.subscribe {|event| }? is there a blocking operation on a socket/fd?
use stop? and/or override stop to make run return
put any other cleanup/bookkeeping tasks in close (currently done in teardown)
remove any calls to shutdown, finished, finished?, running? or terminating?

Then for testing you can use the shared example provided in https://github.com/elastic/logstash-devutils/pull/32 in the following manner:

describe LogStash::Inputs::Http do
  let(:port) { rand(5000) + 1025 }
  it_behaves_like "an interruptible input plugin" do
    let(:config) { { "port" => port } }
  end
end

Filters and Outputs

Both will still shutdown using the ShutdownEvent sent by the pipeline so no major changes necessary. The work in these plugins is:

to remove calls that are no longer in the pipeline <-> plugin contract: shutdown, finished, finished?, running? or terminating?
rename teardown to close

elastic / logstash