elastic / logstash

Logstash - transport and process your logs, events, or other data
https://www.elastic.co/products/logstash
Other
14.18k stars 3.49k forks source link

Event Mills or how to turn a text stream into an Event stream #4858

Open guyboertje opened 8 years ago

guyboertje commented 8 years ago

Motivation

After talking about core changes to include a persistent queue, we decided to divide up some functionality that is now in the inputs and put some before the Persistent Queue (PQ?) and some after.

We will remove the schizophrenia where some input sources provide byte oriented data and others provide line oriented data. We will ensure that all inputs that can will provide byte oriented data.

Any inputs that naturally provide Events streams will not change.

The concept of line and multiline as codecs are deprecated because they are boundary detectors and not decoders. Codecs will be split into decoders and encoders both available in the same LS library. Decoders are specifically for protocol/format handling.

Decoders go after the PQ.

Event boundary detection.

In byte oriented data we need to find where each event starts and stops. Most of the time this is at new line (LF) characters, but not always. In some cases the event boundaries span multiline lines. I have some POC state-machines that allow for a continuous detection of both line and multiline.

Identity management.

When looking for event boundaries in byte oriented data, chunks from different origins must be kept separate by a property - identity. In the case of the File Input, each file is a different origin and in the case of the TCP Input we could receive byte oriented data from any origin in any connection so ideally the far end should transmit the identity.

Event Mills

An Event Mill is used by the Input to feed byte oriented data into and Events should come out the other side. Based on the LS Input config it should know whether to include multiline capabilities. The Mill should called with an identity and some bytes. Internally it should create a new machine per identity. For line and multiline it should look like this. Input -> (identity, byte oriented data) -> LineFSM -> (line) -> Input [callback] -> (hash) -> Eventifier -> (event) -> PQ or Input -> (identity, byte oriented data) -> MultilineFSM -> (lines as one string) -> Input [callback] -> (hash) -> Eventifier -> (event) -> PQ

If possible the Event Mill will be written as a JRuby extension.

Summary:

Data makes a Journey via some Transport mechanism from the Origin to the Mill to the PQ Storage.

colinsurprenant commented 8 years ago

One thing I'd like to clarify and maybe this should be in another discussion but in our last discussion about the persistence interface their seemed to be some confusion about the decoupling using the PQ between the input+milling and the filter+output.

guyboertje commented 8 years ago

Yes it is a less understood concept.

guyboertje commented 8 years ago

Thanks to @purbon with questions about multiline xml docs that have no \n character at the end of the file we need a multiline (or pattern boundary detector) that operates on raw byte data and not only data that has been previously put through a line (or character boundary detector).

guyboertje commented 8 years ago

NOTE: I removed the comment that showed POC FSM code - it is too premature for them.

guyboertje commented 8 years ago

I will analyse the input plugins to see how Event Mills may be used and what changes are required to the inputs to support the minimum effort to yield Events to the PQ.

guyboertje commented 8 years ago

We need to differentiate when byte oriented data is plain text where character or pattern boundary detection may be applied and if/when it must be decoded first.

We may need a user directive to tell us whether a chunk of byte oriented data is actually a full event that can be decoded after the queue or whether it is an encoded chunk that needs decoding and milling to find the events within it.

Many inputs use local decoration of the Event. It will be problematic to include enough metadata or directives (i.e. context) in the event before the PQ such that a generic decoration function can be applied to the Event after it is read from the PQ in the filter-output stage.

There is needs to be provision for Charset conversion before any EventMill.

For illustration purposes:

Discussion Point from A(i) above Is this is a special case of B?

Discussion Point from A(ii) above How much of a requirement is it to unpack the raw bytes into text that boundary detection can be done on it?

Discussion Point from B above Whether ProtocolDecode should occur before the Event is generated and persisted or after the Event is taken from the PQ for further processing - is dependent on whether the LocalDecorate and GlobalDecorate can also be moved to after the PQ too. From an input, is it feasible to register its Decorator class (and Decoder class if directed) with a lookup structure in the pipeline - so that a worker thread would be able to find, use and cache these classes by using a field in the metadata of each event?

Discussion Point There is a very small chance that a serialised hash contains the raw bytes in a field and that the raw bytes would need event boundary detection and event generation i.e. Milling. Is it feasible to be able to Mill and generate secondary Events after the PQ? How will the worker thread be directed to do this? Would we disallow the use of PatternMills (multiline) or only cater for Pattern boundary detection within the raw bytes from one Event? e.g.

{"message": "log line 1::-::log line 2::-::log line 3::-::log line 4::-::"}

Where ::-:: is the boundary pattern

guyboertje commented 8 years ago

Inputs::Beats

guyboertje commented 8 years ago

Inputs::CouchDBChanges

guyboertje commented 8 years ago

Inputs::Elasticsearch

guyboertje commented 8 years ago

Inputs::EventLog

guyboertje commented 8 years ago

Inputs::Exec

guyboertje commented 8 years ago

Inputs::File

guyboertje commented 8 years ago

Inputs::Ganglia

guyboertje commented 8 years ago

Inputs::Gelf

guyboertje commented 8 years ago

Inputs::Generator

guyboertje commented 8 years ago

Inputs::Graphite

Discussion Point For sources that do not supply a timestamp in the data, do we need an 'ingest' timestamp in the metadata (do we put one in regardless)?

guyboertje commented 8 years ago

Inputs::Http

jordansissel commented 8 years ago

Thanks to @purbon with questions about multiline xml docs that have no \n character at the end of the file

Here's an oddity for ya -- I have a little USB stick that talks to my power meter at home to gather power usage. The interface it presents when plugged in to my computer is a serial port that emits XML documents continuously. I wonder, for XML documents in general, if it would make sense to use to have an XML document mill? I use REXML::Parsers::StreamParser, but it's probably slow, but it does let me stream XML documents and emit them as each document is completed. Something to think about.

andrewvc commented 8 years ago

@jordansissel since we're jruby only maybe we should use a Java parser? REXML is famously slow / not perfectly conforming. Maybe https://docs.oracle.com/javase/tutorial/jaxp/stax/ ?

The API isn't terrible. I've written a Wikipedia XML parser using it: https://github.com/andrewvc/wikiparse/blob/java/src/main/java/wikielastic/wiki/WikiParser.java

jordansissel commented 8 years ago

Yeah I don't have opinions on the implementation, just wanted to offer another use case (milling xml documents). +1 on avoiding rexml for speed reasons

On Tuesday, April 12, 2016, Andrew Cholakian notifications@github.com wrote:

@jordansissel https://github.com/jordansissel since we're jruby only maybe we should use a Java parser? REXML is famously slow / not perfectly conforming. Maybe https://docs.oracle.com/javase/tutorial/jaxp/stax/ ?

The API isn't terrible. I've written a Wikipedia XML parser using it: https://github.com/andrewvc/wikiparse/blob/java/src/main/java/wikielastic/wiki/WikiParser.java

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/elastic/logstash/issues/4858#issuecomment-209186391

guyboertje commented 8 years ago

Inputs::HTTP_Poller

guyboertje commented 8 years ago

@jordansissel,@andrewvc: For me the biggest unanswered question is whether the user wants to A) decode the XML into an Event or to B) simply put the xml string into a message field of an Event and output it as such.

Case: B

Case A: one of...

  1. do a full XML stream -> new Event decode in the input and no Milling is required.
  2. do B and decode into existing Event after the PQ.

For A2 - we will tokenise twice.

guyboertje commented 8 years ago

@jordansissel, @andrewvc:

Another interesting twist with pattern boundary detection is whether it is line or byte oriented. Example. Pattern: start: "--- begin ---", end: "--- end ---" File:

--- begin ---\n
line 1\n
line 2\n
--- end ---\n
garbage 1\n
--- begin ---\n
line 3\n
--- end ---\n

if it is byte oriented and exclusive then the Event messages look like this... \nline 1\nline2\n and \nline 3\n if it is byte oriented and inclusive then the Event messages look like this... --- begin ---\nline 1\nline2\n--- end---\n and --- begin ---\nline 3\n--- end ---\n

if it is line oriented and exclusive then the Event messages look like this... line 1\nline2 and line 3 if it is line oriented and inclusive then the Event messages look like this... --- begin ---\nline 1\nline2\n--- end--- and --- begin ---\nline 3\n--- end ---

andrewvc commented 8 years ago

@guyboertje the HTTP poller is not JSON only. I've used it in the past to deal with CSV. It should just return plain data as a string. Users can use a JSON filter if needed.

guyboertje commented 8 years ago

@andrewvc - thanks for the update.

guyboertje commented 8 years ago

As we proposed in the last meeting, a new config option mill is required.

I suggest, after some analysis and convergence talk with the beats team, we need to define a channel inside the mill config. Composed of compute elements that allow the user to best specify exactly the transforms required for their source data and source input.

However:

input {
  file {
    codec => line | multiline | json_lines
    path => ...
    ...
    type => "sometype"
  }
}

Will generate a invalid config error.

Generic apache log file - using a < 5.0 config

input {
  file {
    path => ...
    ...
    type => "sometype"
  }
}

Will add a line mill to the input at register because file will produce bytes and the event has one line.

Generic apache log file - using a >= 5.0 config

input {
  file {
    path => ...
    ...
    type => "sometype"
  }
  mill => {
    encoding {
      charset => UTF-8
      force => true
    }
    line { end => LF }
  }
}

For a file of pretty printed JSON objects comma separated

{
  ...
},
{
  ...
},
{
  ...
}

input {
  file {
    path => ...
    ...
    type => "sometype"
  }
  mill => {
    encoding {
      charset => UTF-8
      force => true
    }
    line { end => LF }
    multiline {
      begin => "\A{\z"
      end => "\A},?\z"
      inclusive => true
    }
  }
}

filter {
  if [type] == "sometype" {
    json {
      source => "message"
    }
  }
}
zslayton commented 6 years ago

I'm working on a codec plugin that would be much better implemented with the help of a mill. Is this improvement still planned? I notice that many of the issues related to updating the codec model haven't been updated since mid-2016.

colinsurprenant commented 6 years ago

@zslayton we haven't moved forward up with the mills concept yet and there is no short-term plan for it either (that does not mean it will not happen at some point).