Event Mills or how to turn a text stream into an Event stream

guyboertje commented 8 years ago

Motivation

After talking about core changes to include a persistent queue, we decided to divide up some functionality that is now in the inputs and put some before the Persistent Queue (PQ?) and some after.

We will remove the schizophrenia where some input sources provide byte oriented data and others provide line oriented data. We will ensure that all inputs that can will provide byte oriented data.

Any inputs that naturally provide Events streams will not change.

The concept of line and multiline as codecs are deprecated because they are boundary detectors and not decoders. Codecs will be split into decoders and encoders both available in the same LS library. Decoders are specifically for protocol/format handling.

Decoders go after the PQ.

Event boundary detection.

In byte oriented data we need to find where each event starts and stops. Most of the time this is at new line (LF) characters, but not always. In some cases the event boundaries span multiline lines. I have some POC state-machines that allow for a continuous detection of both line and multiline.

Identity management.

When looking for event boundaries in byte oriented data, chunks from different origins must be kept separate by a property - identity. In the case of the File Input, each file is a different origin and in the case of the TCP Input we could receive byte oriented data from any origin in any connection so ideally the far end should transmit the identity.

Event Mills

An Event Mill is used by the Input to feed byte oriented data into and Events should come out the other side. Based on the LS Input config it should know whether to include multiline capabilities. The Mill should called with an identity and some bytes. Internally it should create a new machine per identity. For line and multiline it should look like this. Input -> (identity, byte oriented data) -> LineFSM -> (line) -> Input [callback] -> (hash) -> Eventifier -> (event) -> PQ or Input -> (identity, byte oriented data) -> MultilineFSM -> (lines as one string) -> Input [callback] -> (hash) -> Eventifier -> (event) -> PQ

If possible the Event Mill will be written as a JRuby extension.

Summary:

Data makes a Journey via some Transport mechanism from the Origin to the Mill to the PQ Storage.

colinsurprenant commented 8 years ago

One thing I'd like to clarify and maybe this should be in another discussion but in our last discussion about the persistence interface their seemed to be some confusion about the decoupling using the PQ between the input+milling and the filter+output.

guyboertje commented 8 years ago

Yes it is a less understood concept.

guyboertje commented 8 years ago

Thanks to @purbon with questions about multiline xml docs that have no \n character at the end of the file we need a multiline (or pattern boundary detector) that operates on raw byte data and not only data that has been previously put through a line (or character boundary detector).

guyboertje commented 8 years ago

NOTE: I removed the comment that showed POC FSM code - it is too premature for them.

guyboertje commented 8 years ago

I will analyse the input plugins to see how Event Mills may be used and what changes are required to the inputs to support the minimum effort to yield Events to the PQ.

guyboertje commented 8 years ago

We need to differentiate when byte oriented data is plain text where character or pattern boundary detection may be applied and if/when it must be decoded first.

We may need a user directive to tell us whether a chunk of byte oriented data is actually a full event that can be decoded after the queue or whether it is an encoded chunk that needs decoding and milling to find the events within it.

Many inputs use local decoration of the Event. It will be problematic to include enough metadata or directives (i.e. context) in the event before the PQ such that a generic decoration function can be applied to the Event after it is read from the PQ in the filter-output stage.

There is needs to be provision for Charset conversion before any EventMill.

For illustration purposes:

the terms LocalDecorate and GlobalDecorate refer to the decorate functions that are applied to Events in an Input prior to the PQ.
the terms NewlineMill and PatternMill refer to Mills that do \n boundary detection and /regex/ pattern boundary detection (multiline) respectively.
the terms JsonDecode and ProtocolDecode refer to the decode function that is applied to A) the raw bytes if (i) no event boundary detection needed or (ii) before any boundary detection; B) the event message text emitted by an Event Mill - when directed that the event message source is a serialised hash like structure.

Discussion Point from A(i) above Is this is a special case of B?

Discussion Point from A(ii) above How much of a requirement is it to unpack the raw bytes into text that boundary detection can be done on it?

Discussion Point from B above Whether ProtocolDecode should occur before the Event is generated and persisted or after the Event is taken from the PQ for further processing - is dependent on whether the LocalDecorate and GlobalDecorate can also be moved to after the PQ too. From an input, is it feasible to register its Decorator class (and Decoder class if directed) with a lookup structure in the pipeline - so that a worker thread would be able to find, use and cache these classes by using a field in the metadata of each event?

Discussion Point There is a very small chance that a serialised hash contains the raw bytes in a field and that the raw bytes would need event boundary detection and event generation i.e. Milling. Is it feasible to be able to Mill and generate secondary Events after the PQ? How will the worker thread be directed to do this? Would we disallow the use of PatternMills (multiline) or only cater for Pattern boundary detection within the raw bytes from one Event? e.g.

{"message": "log line 1::-::log line 2::-::log line 3::-::log line 4::-::"}

Where ::-:: is the boundary pattern

guyboertje commented 8 years ago

Inputs::Beats

underlying data source produces a hash and an identity
adds contextual tags
change: use milling based on directives with LocalDecorate and GlobalDecorate

guyboertje commented 8 years ago

Inputs::CouchDBChanges

underlying data source produces line oriented json encoded chunks.
uses buftok to line orient the data
uses Json.load directly
adds special fields to hash
adds special metadata to hash
caches sequence id to disk
change: remove buftok and Json
change: use NewlineMill, JsonDecode with LocalDecorate and GlobalDecorate

guyboertje commented 8 years ago

Inputs::Elasticsearch

underlying data source produces a hash with the event data hash in a key called _source
adds normal decoration
adds conditional decoration based on docinfo config option
conditional decoration targets @metadata or custom field
change: no Mills required
change: use LocalDecorate and GlobalDecorate

guyboertje commented 8 years ago

Inputs::EventLog

suggest: retire this plugin in favor of WinLogBeat
underlying data source produces a hash
uses an orig_key => new_key map to transform hash into an Event
change: no Mills required
change: use LocalDecorate and GlobalDecorate

guyboertje commented 8 years ago

Inputs::Exec

underlying data source produces bytes.
uses any codec to produce event.
adds special fields to hash command and host
change: use milling based on directives with LocalDecorate and GlobalDecorate
change: may need to use ProtocolDecode if the bytes need unpacking.

guyboertje commented 8 years ago

Inputs::File

underlying data source produces lines with identity (path)
uses any codec to produce events
adds special fields to hash path and host
change: data source should provide bytes, identity and read position
change: data source should accept new position update for sincedb progress tracking
design: Mills must provide positional offsets when passing text for Eventifing.
change: use milling based on directives with LocalDecorate and GlobalDecorate

guyboertje commented 8 years ago

Inputs::Ganglia

underlying data source produces binary and an ip address struct
only some packets received on the socket are eligible for event production - meta and data
the receiver holds meta in local state while waiting for a data packet
uses Custom code to protocol decode the source bytes
adds special fields to hash host -> ip address
design: this is a fine example of a data source that requires a decoder before events can be generated.

guyboertje commented 8 years ago

Inputs::Gelf

underlying data source produces a json encoded string.
conditionally renames event keys.
conditionally remaps event keys.
change: capture json string in the event message and store in PQ. direct to use json decoder.
change: one of - use Json decoder after the PQ; use Event#from_json to build the event.
change: use LocalDecorate and GlobalDecorate.

guyboertje commented 8 years ago

Inputs::Generator

underlying data source produces lines from a configurable set or stdin
adds special fields to hash host, sequence
change: persist line directly as an event message, store host and sequence in metadata
change: use LocalDecorate and GlobalDecorate

guyboertje commented 8 years ago

Inputs::Graphite

is a subclass of the Inputs::Tcp
rudimentary decoder on message field
conditionally create timestamp (timestamp is created at ingest time not filter time)
change: option to create a Graphite decoder for after the PQ.

Discussion Point For sources that do not supply a timestamp in the data, do we need an 'ingest' timestamp in the metadata (do we put one in regardless)?

guyboertje commented 8 years ago

Inputs::Http

underlying data source produces bytes for a mime-type.
mime-type may be used to determine which decoder should be used.
adds special fields to hash host and headers.
change: create simple Event, message is source bytes.
change: add host and headers to Event metadata.
change: use LocalDecorate and GlobalDecorate; lookup decoder from mime-type or none

jordansissel commented 8 years ago

Thanks to @purbon with questions about multiline xml docs that have no \n character at the end of the file

Here's an oddity for ya -- I have a little USB stick that talks to my power meter at home to gather power usage. The interface it presents when plugged in to my computer is a serial port that emits XML documents continuously. I wonder, for XML documents in general, if it would make sense to use to have an XML document mill? I use REXML::Parsers::StreamParser, but it's probably slow, but it does let me stream XML documents and emit them as each document is completed. Something to think about.

andrewvc commented 8 years ago

@jordansissel since we're jruby only maybe we should use a Java parser? REXML is famously slow / not perfectly conforming. Maybe https://docs.oracle.com/javase/tutorial/jaxp/stax/ ?

The API isn't terrible. I've written a Wikipedia XML parser using it: https://github.com/andrewvc/wikiparse/blob/java/src/main/java/wikielastic/wiki/WikiParser.java

jordansissel commented 8 years ago

Yeah I don't have opinions on the implementation, just wanted to offer another use case (milling xml documents). +1 on avoiding rexml for speed reasons

On Tuesday, April 12, 2016, Andrew Cholakian notifications@github.com wrote:

@jordansissel https://github.com/jordansissel since we're jruby only maybe we should use a Java parser? REXML is famously slow / not perfectly conforming. Maybe https://docs.oracle.com/javase/tutorial/jaxp/stax/ ?

The API isn't terrible. I've written a Wikipedia XML parser using it: https://github.com/andrewvc/wikiparse/blob/java/src/main/java/wikielastic/wiki/WikiParser.java

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/elastic/logstash/issues/4858#issuecomment-209186391

guyboertje commented 8 years ago

Inputs::HTTP_Poller

underlying data source produces a protocol encoded or plain string on success and a hand crafted Event on failure.
conditionally adds special fields/tags/metadata to Event (too numerous to list here)
change: drop codec, use Event#from_json to build success event.
change: use LocalDecorate before the PQ and GlobalDecorate after - this is due to the highly conditional way local decoration is done, there would be way too much context that would need persisting
change: OTOH, it can be argued that its not too much context and the LocalDecorate can occur after the PQ.

guyboertje commented 8 years ago

@jordansissel,@andrewvc: For me the biggest unanswered question is whether the user wants to A) decode the XML into an Event or to B) simply put the xml string into a message field of an Event and output it as such.

Case: B

we need to detect the start and end positions of the XML doc and extract the text between the two.
looking at stax, it seems easy to get start_document and end_document callbacks but its a bit harder to get the location from the line and column query. Their docs say approximate because of entity expansion and the location is only valid after the start_document callback returns.
put the excerpt in the message field of a new Event.

Case A: one of...

do a full XML stream -> new Event decode in the input and no Milling is required.
do B and decode into existing Event after the PQ.

For A2 - we will tokenise twice.

guyboertje commented 8 years ago

@jordansissel, @andrewvc:

Another interesting twist with pattern boundary detection is whether it is line or byte oriented. Example. Pattern: start: "--- begin ---", end: "--- end ---" File:

--- begin ---\n
line 1\n
line 2\n
--- end ---\n
garbage 1\n
--- begin ---\n
line 3\n
--- end ---\n

if it is byte oriented and exclusive then the Event messages look like this... \nline 1\nline2\n and \nline 3\n if it is byte oriented and inclusive then the Event messages look like this... --- begin ---\nline 1\nline2\n--- end---\n and --- begin ---\nline 3\n--- end ---\n

if it is line oriented and exclusive then the Event messages look like this... line 1\nline2 and line 3 if it is line oriented and inclusive then the Event messages look like this... --- begin ---\nline 1\nline2\n--- end--- and --- begin ---\nline 3\n--- end ---

andrewvc commented 8 years ago

@guyboertje the HTTP poller is not JSON only. I've used it in the past to deal with CSV. It should just return plain data as a string. Users can use a JSON filter if needed.

guyboertje commented 8 years ago

@andrewvc - thanks for the update.

guyboertje commented 8 years ago

As we proposed in the last meeting, a new config option mill is required.

I suggest, after some analysis and convergence talk with the beats team, we need to define a channel inside the mill config. Composed of compute elements that allow the user to best specify exactly the transforms required for their source data and source input.

However:

input {
  file {
    codec => line | multiline | json_lines
    path => ...
    ...
    type => "sometype"
  }
}

Will generate a invalid config error.

Generic apache log file - using a < 5.0 config

input {
  file {
    path => ...
    ...
    type => "sometype"
  }
}

Will add a line mill to the input at register because file will produce bytes and the event has one line.

Generic apache log file - using a >= 5.0 config

input {
  file {
    path => ...
    ...
    type => "sometype"
  }
  mill => {
    encoding {
      charset => UTF-8
      force => true
    }
    line { end => LF }
  }
}

For a file of pretty printed JSON objects comma separated

{
  ...
},
{
  ...
},
{
  ...
}

input {
  file {
    path => ...
    ...
    type => "sometype"
  }
  mill => {
    encoding {
      charset => UTF-8
      force => true
    }
    line { end => LF }
    multiline {
      begin => "\A{\z"
      end => "\A},?\z"
      inclusive => true
    }
  }
}

filter {
  if [type] == "sometype" {
    json {
      source => "message"
    }
  }
}

zslayton commented 6 years ago

I'm working on a codec plugin that would be much better implemented with the help of a mill. Is this improvement still planned? I notice that many of the issues related to updating the codec model haven't been updated since mid-2016.

colinsurprenant commented 6 years ago

@zslayton we haven't moved forward up with the mills concept yet and there is no short-term plan for it either (that does not mean it will not happen at some point).

elastic / logstash