Feature Request: Setting up a Highly Available (HA) pipeline with Logstash Nodes

heymatthew commented 9 years ago

As a kibana user watching transactional data, I would like to configure a logstash pipeline with strong guarantees that events are not lost so that I can use ELK as a forensic analysis tool.

In logstash 1.4, the pipeline currently reads off a connection and acks as soon as it's made it into memory.

This is a problem if the underlying software or the hardware were to fail while logstash is processing a message. Some examples of software failure:

Operating system stability
Codec stacktrace on edge case (e.g. network outages on virtualised switches)
VM Hypervisor failure

Scope:

This change will extend 'the life of an event in logstash'
Extend the lumberjack Input with needs_ha option
Extend the elasticsearch Output with provides_ha option
Opt in: both needs_ha and provides_ha should be opt in as it will likely have a performance hit

Out of scope:

Existing Inputs and Outputs plugins should still operate without code changes
Setting up clustering technologies to be Highly Available (HA)

Requirements:

The Pipeline should only allow one HA output to simplify when an 'ack' is ready to be sent
The Pipeline should allow one or more inputs to ack on delivery to an HA output
Outputs labeled provides_ha should only write when all events have been processed from the input's spooled batch
Input labeled needs_ha should only ack spooled event batches when all events have been written to the HA Output
Clients who have not received an ack should resend if disconnect happens before they're acked. Most clients which implement ack already do this.
Events should be cancelled as soon as possible when an Input disconnects to stop them writing to an Output.
Cancelled events from drop{} should not block other events being written/acked.
Cloned events should only ack on delivery of both messages to an HA output.

Configuration example:

input {
  lumberjack {
    port            => 2000
    ssl_certificate => "/path/to/selfsigned.crt"
    ssl_key         => "/path/to/selfsigned.key"
    needs_ha        => true
  }
}

output {
  elasticsearch {
    host        => "172.17.0.2"
    provides_ha => true
  }
}

heymatthew commented 9 years ago

These changes are based on logstash branch 1.4 because it's what we're running on production at the moment. We're prepared can rebase on 1.5 if these changes line up with the roadmap.

This work was done by @alcinnz and myself over ~2 months. We've been peer programming this up because we're running up a logstash cluster on some fairly flakey openstack nodes that have network issues in open-vswitch. A combination of network outages, the compress spooler codec and logstash cause stack traces while the output is processing messages, and this was causing message loss in the queue.

Yes, this codec has a bug that should be addressed with another ticket, but we're tasked to solve this at a higher level.

Is this a feature that lines up with logstash's roadmap? Are there additional tests you'd like to see on this feature branch?

heymatthew commented 9 years ago

Also, if you want to check out these changes, you need to:

Build the lumberjack gem from https://github.com/catalyst/logstash-forwarder branch feature/ha-proof-of-concept
Checkout https://github.com/catalyst/logstash branch feature/ha-proof-of-concept and install lumberjack gem.
Run with example configuration detailed in this issue's description.

Feedback welcome :).

suyograo commented 9 years ago

Hi @theflimflam thanks for raising this issue and working on this. Adding end-to-end acks are definitely part of Logstash' roadmap. Improving resiliency is an important focus in our next version and we plan to tackle it in stages. We will review this issue soon (we are wrapping up a release candidate for 1.5)

heymatthew commented 9 years ago

Cool bananas. I'm doing a few more tests and may be doing some interactive rebasing to keep the patchset small and easy to follow.

When 1.5 is stable, I'll rebase on that.

How is master being used in the logstash project? Will that become the new 2.0? Are the logstash branching conventions documented someplace?

jordansissel commented 9 years ago

I haven't done a full review of your proposal yet, but here's some early feedback. As @suyograo mentioned, end-to-end acknowledgements is on the roadmap for attempting, as is increasing message resiliency - message resiliency and end-to-end acks are separate thing.

The Pipeline should only allow one HA output to simplify when an 'ack' is ready to be sent

"Only one output can do acknowledgements" seems like a significant limitation. I don't like this.

guarantees that events are not lost

This doesn't specify the messaging guarantee clearly enough. In general, there are roughly 3 guarantee modes I see in messaging: At least once, exactly once, and at most once.

"Not lost" can be satifisfied by both "at least once" and "exactly once" modes. Logstash currently leans towards "at least once" when we have put energy into it.

I feel like trying to implement "exactly once" (as I see proposed in this ticket) will be extremely complex in code, hard to debug, and limiting (only one output can provide it as proposed, for example).

"Exactly once" as a concept can be almost perfectly achieved probalistically(1) without a need for complex inter-component communication to do partial and full acknowledgement of event sets. The probabalistic implementation is feels (to me) that it will be easier to write, easier to test, lower latency, more resilient to multi-hop transportation problems, etc.

(1) (pardon any math errors, it's almost midnight) by probabilstic, I mean that for any given event sent end-to-end, it has probability P of being lost. If you want to achieve a certain probability G of message success, the probability of all N messages being lost is something like P^N. To achieve G probability of at least 1 message succeeding, we can solve for N in this equation: G = 1 - P^N where G is chosen by us, P is known failure probability. For example, a coin-toss scenario (P = 0.5), to achieve G=0.9995 (99.95% success) you'd send: log_0.5(1 - 0.9995) == 11 copies of a message. The coin toss is an extreme example as most message transport has much higher probability than in the normal case.

I like the probabalistic model because I think it fits more problems than an acknowledgement-based approach to "exactly once". In a true "exactly once" you would have the application emitting the event not complete its transaction until the endpoint had acknowledged receipt of the log of that event. Think of a web application that wouldn't finish serving a page until the log of that event was stored and acknowledged in Elasticsearch and sync'd to disk and all replicas (Elasticsearch doesn't provide an API for "is this document sync'd to disk" at this time, to my knowledge).

Further, "exactly once" transmission doesn't account for catastrophic failures occuring to data at rest (natural disaster destroying data stores) or major corruption (all records lost to corruption).

Cancelled events from drop{} should not block other events being written/acked.

This gets a bit complex once we see taht the multiline filter takes N events and drops (cancels) most of them while it is merging events. This means that you can get N events into an input and have 1 event leave an output. The event to "acknowledge" at this point is unclear. An event can also be mutated beyond recognition, or even cloned with the clone filter, so it's even less clear how that behavior should be handled.

It's also notable that most inputs/outputs use transmission protocols that don't support object acknowledgements at all (tcp, udp, irc, redis, etc). There are ways we can improve this (though not perfectly), and it would provide much more value to support more than just lumberjack, for example.

I appreciate your efforts in working towards this feature!

I don't know if we have figured out the right design for this feature, so we'll need to iron that out before we consider the code details.

It's also unclear if we have yet agreed as to the scope. Is "highly available" simply a concept within one logstash agent process itself? Or would the these "HA acks" reach from Elasticsearch all the way back to the origin of the event?

heymatthew commented 9 years ago

Hey @jordansissel, I would not be personally offended if the changes in this form were not merged, especially with this on the roadmap already. Also would be happy to tweak the scope if it would help align this with the logstash project's goals. I just needed to code something out to get a feel for the scope, so lets call this a prototype.

Our company is motivated to get the ELK stack on a very unreliable cloud environment where we experience virtual machines going down inexplicably, and weekly rolling network outages. I've got something like 80% time (woo hoo) to hash out a solution for this for the time being and I'd like to make that useful to the logstash project.

Highly Available (for this context) is a concept that you can apply to any tech that can be clustered with redundancy, and forms a quorum between members to assert endpoints that are not clustered will not accept messages. Messages being processed by logstash are being treated as a half processed, so if the input or output goes down, we try call cancel as soon as we can to stop double ups. As you've pointed out, it's very hard to guarantee exactly one message so we're going for the "at least once" scenario.

You can configure actual HA clusters with Elasticsearch (write consistency) and RabbitMQ (mirroring). However, needs_ha and provides_ha are names to hang the imagination on, painting the inputs and outputs with an intent. If there's a better term we can change it out, completely open for suggestions. e.g. we could use ack_on_delivery on both inputs and outputs to mean the same setup in their respective contexts.

The only reason we've said there can only be one HA output is to reduce scope. What we were trying to avoid is keeping state about if a message has been delivered to one endpoint but not the other. This can still be done as you can register multiple callback procs on event objects, the input would have to know a bit more about the pipeline though, and I worry this would be too big a change.

Could we possibly make this the subject of a future change to keep this scope small?

If you need strong guarantees to multiple outpus, you can always do this with fanout in RMQ and a second logstash node. Someone dabbling in clustering techniques in their pipelines would likely have the skills to configure that sort of setup without too much help.

This feature would only make sense to pair up inputs that support or require ack to mark as processed, and outputs that can send to technologies that can cluster. If an output like Redis doesn't support ack, then it'll simply can not support this setup. That doesn't mean you can't use it, it just means you can fall back to the current way Redis inputs and outputs work in the pipeline.

The inputs we've targeted for the prototype are logstash forwarder, log courier, and RMQ (through publisher ack). The outputs we're targeting are Elasticsearch and RabbitMQ.

So a setup like this:

                       ___                   ___
                      |   |                 |   |
                      |RMQ|                 |ES |
                      |   |                 |   |
                       ‾‾‾                   ‾‾‾
                        |                     |
 ___        ___        ___        ___        ___
|   |      |   |      |   |      |   |      |   |
|LSF| ---> |LS1| ---> |RMQ| ---> |LS2| ---> |ES |
|   |   (A)|   |(B)   |   |   (C)|   |(D)   |   |
 ‾‾‾        ‾‾‾        ‾‾‾        ‾‾‾        ‾‾‾
                        |                     |
                       ___                   ___
                      |   |                 |   |
                      |RMQ|                 |ES |
                      |   |                 |   |
                       ‾‾‾                   ‾‾‾
LSF = Logstash Forwarder
LS1,LS2 = Logstash
RMQ = RabbitMQ
ES = Elasticsearch

Would not ack on input (A) till (B) had been written to, and would only ack (C) when (D) had been written to. (D) would not ack back to (A), but we'd be able to give stronger assurances about messages over volatile networks, or even if LS1 or LS2 were completely lost.

Also, we've been working on RMQ patches. It turns out it's not too much work on top of what's been done. We can have them in the PR or not. Either way, I'm putting these on a small production system this week to see how they do :).

I look forward to hearing your thoughts on the patches. They're not really ready in their current state because they're based on 1.4.

Some related work:

Adding round robin failover for RabbitMQ module elasticsearch/logstash/pull/1943
Adding round robin failover for RabbitMQ module logstash-plugins/logstash-output-rabbitmq/pull/4

Looking forward to hearing back after your review.

What changes to the patches and/or scope would make this work acceptable?

heymatthew commented 9 years ago

Has anyone had a chance to review the scope on this ticket?

The associated patches are now 417 commits away from 1.5 and I'm worried associated code will be suffering from bitrot.

untergeek commented 9 years ago

While this is something we're excited about and would like to merge, we're in code freeze for Logstash 1.5, so this will not be included in the Logstash 1.5 release. I apologize for the inconvenience, but yes, things will need to be refactored to work with the code that is released.

heymatthew commented 9 years ago

PRs associated with this proof of concept have been closed, and I'm afraid I don't really have the time to work on this anymore.

Feel free to hold this ticket open, but I suggest closing it in favour of #2632.

elastic / logstash

Feature Request: Setting up a Highly Available (HA) pipeline with Logstash Nodes #2579