Asynchronous start of input plugins

andrewvc commented 8 years ago

This line of inquiry was inspired by the question here: https://github.com/logstash-plugins/logstash-input-rabbitmq/issues/58

This is how it works today, which is great, but has a problem I will describe below.

Logstash connects to rabbitmq server starts ingesting logs
RabbitMQ Goes down, logstash repeatedly retries connecting
Connection attempt succeeds, logstash succeeds in processing logs.

The above is great, however, if in step 2. Logstash is restarted, ALL inputs are blocked until the one RabbitMQ input connected to the down server is connected since its register method blocks waiting for an RMQ server.

I propose the following:

Input plugins should have their register method invoked asynchronously
Input plugins should have an internal state machine with the states :starting, :running, :stopped, :permanent_error, :transient_error. These states could be accessible via API, and state transitions would be logged.

These state machines could be really useful from an operator standpoint if exposed from an API or used in a centralized config mgmt system.

talevy commented 8 years ago

by "asynchronously" you mean to have each input running on its own thread?

andrewvc commented 8 years ago

@talevy they currently do all run in separate threads, this means that 'register' will run in a separate thread as well. Current each input's register method is run in the main thread. If one plugin's register method is stuck, subsequent ones won't start and workers won't start either.

talevy commented 8 years ago

as in:

change this existing snippet[1] into something like:

...
@inputs.each do |plugin|
    input_thread = Thread.new do
        plugin.register
        inputworker(plugin)
    end
    @input_threads << input_thread
end

andrewvc commented 8 years ago

@talevy exactly. I do think we'd want that state machine stuff added in for debugability however.

guyboertje commented 8 years ago

one thing with FSMs that I find incredibly useful is storing the state history so for example (I know that the states mentioned in the original post are speculative) to know that one reached transient error from running or starting would be important knowledge.

jcarterch commented 8 years ago

@andrewvc, @talevy, I've been troubleshooting this issue and came across this and the source discussion on this problem. Is @talevy's solution usable with minor tweaking, or was this deprioritized due to high level of effort?

I'm looking to have up to 30 rabbit nodes spread out across different locations, and being unable to restart logstash without losing ingest if a single node is down is a bit scary!

edit: I'm not familiar with how logstash plugins are designed, but I know for example that redis input does not block in this type of situation. Is this potentially something that could/should be changed within the logstash-input-rabbitmq plugin?

elastic / logstash

Asynchronous start of input plugins #4931