Ingest Node - enrich data as it gets in via pipelines

martijnvg commented 9 years ago

related issues from other projects:

Beats: https://github.com/elastic/beats/issues/805, Merged! Hurray!
Kibana: https://github.com/elastic/kibana/issues/5974

There are many use-cases where it is important to enrich incoming data. This enrichment may be something simple like using a regular expression to extract metadata from an existing field, or something more advanced like a geoip lookup or language identification. The filter stage of the Logstash processing pipeline provides great examples of the ways in which data is often enriched. Node ingest implements a new type of ES node, which performs this enrichment prior to indexing.

Node ingest is a pure Java implementation of the filters in logstash, integrated with Elasticsearch. It works by wrapping the bulk/index APIs, executing a pipeline that is composed of multiple processors to enrich the documents. A processor is just a component that can modify incoming documents (the source before ES turns it into a document). A pipeline is a list of processors grouped under an unique id. If node ingest is enabled then the index and bulk apis can reroute the the request with documents through a pipeline.

The ingest plugin runs on dedicated client nodes and after bulk and index requests have been enriched these index and bulk request continue their way into the cluster.

Node ingest will be a plugin in the elasticsearch project, implementing 2 main aspects:

The first is a pure Java implementation for Pipeline, Processor, as well as initial processor implementation of grok, geoip, kv/mutate, date. This java implementation can then be reused in others places, such as logstash itself, reindex API, and so on. In the first version of the ingest plugin the processor implementations can reside in the ingest plugin, but the framework and processor implementations shouldn’t rely on any ES specific code, so that later on it can be moved to an isolated library.

The second part is the integration with Elasticsearch. This includes interception of the bulk/index APIs, management APIs (stats and so on in future phase), storage and live reload of the configuration, supporting multiple "live" pipelines, and simulation of pipeline execution.

The goal of the ingest plugin is to make data enrichment easier and it will not replace logstash at all. The ingest plugin should make data enrichment in most of the cases easier when events are only stored in Elasticsearch. For example when only file beat is used to ship logs, a logstash instance will no longer be required. In cases where events are stored in multiple outputs a Logstash installation is required. Also at some point Logstash will reuse the pipeline/processor framework, so the end goal is that both Elasticsearch and Logstash will benefit from the ingest initiative.

Development happens in a feature branch: https://github.com/elastic/elasticsearch/tree/feature/ingest

Current node ingest tasks:

[x] Hook into the index and bulk APIs. If the ingest plugin is enabled a pipeline_id parameter is available to select what pipeline should be used to preprocess the documents before the index/bulk APIs get executed. #13941
[x] Manage pipeline configuration. Pipelines are stored as a document in an index. Each node ingest node will have the pipelines in memory around to be used when needed. A background process makes sure that an ingest node will eventually get the modifications. #13941
[x] The pipeline document enrichment shouldn't be non blocking and happen via a dedicated thread pool. #13990
[x] Add first version of CRUD pipeline APIs. #14047
[x] Data substructure manipulation. Processors should be able to introduce new nested fields with no pre-existing parent structure. For example, It should be possible to create a new field "location.lat" without "location" existing. #14250
[x] Add grok processor. #14132
[x] Add geoip processor. #14208
[x] Add date processor. #14184
[x] Add kv/mutate processor. #14253
[x] Strict configuration validation #14552
[x] geoip processor output fields should be configurable #14582
[x] Add simulate API. This allows pipelines to be tested out before actually being used. This api accepts a pipeline definition and actual documents and the output is the transformed documents and optionally showing how each document gets modified after each processor. #14572
[x] Do not fail whole bulk request if any pipeline for a single document fails #14888
[x] Split mutate processor into separate processors for each function (e.g. update, remove, etc) #14938
[x] Add support for setting nested fields in document #14250
[x] Data and metadata manipulation. #14644
[x] Processors and factors should throw exception on the interface.
[x] Throw exception when grok expression does not match #15132
[x] Add ability to provide custom patterns within Grok Processor config definition #15167
[x] Reduce number of fields operated on by processors to just one. #15133
[x] Support for ingest transient metadata #15036
[x] on failure pipeline handler #14548 / #15565 (tal)
[x] append processor #14324 #15577
[x] Ingest nodes should update pipelines in a sync manner #14998 / #15203 (mvg)
[x] support for templating in any processor that sets a field value #14990 #15415 (mvg)
[x] Add support for ingest node boolean flag to enable/disable ingest at the node level. If a node with node.ingest set to false receives an ingest request, it should explicitly fail. (mvg) #15610
[x] Figure out if geoip2 library can be used without suppressing jvm access checks. (mvg) https://github.com/maxmind/GeoIP2-java/pull/52
[x] Ingest forks a thread for each bulk item in a bulk request. Instead ingest should use one thread to process an entire bulk request. (mvg) #15593
[x] Ingest should only try the load the pipelines if the .ingest index has been started. (mvg) #15203
[x] Cut over to a non-guice structure. We should at most bind one class to an instance directly rather thatn use the dep-injection framework that we have to un-do once we get rid of juice. #15203 (mvg)
[x] Change pipeline_idparam name to pipeline. #15618
[x] Add index template for the .ingest index, which should be installed by default. #15001 #15631
[x] Move ingest infrastructure to core
[x] Move processors with no dependencies to core
[x] Make grok a module as it has external deps, but it will be installed by default
[x] Make geoip a plugin which needs to be installed manually
[x] If node ingest has been disabled then it should redirect a bulk/index request that has the ingest parameter to a node that has ingest enabled.
[x] Add more descriptive error messages to pipeline factory exceptions (tal) #16010
[x] In addition to the isolated processor unit tests we should also test combination of processors. #15247
[x] Benchmark ingest plugin. #14425 (mvg&tal)
[x] Move DedotProcessor into its own plugin #16322
[x] Add processor tags to on_failure metadata #16202
[x] Documentation before release #16009

possible v2 tasks:

Make it possible to let the ingest plugin know what pipeline to use via index settings / index template.
configuration management
_simulate w/ verbose should show document diffs between processor results #14698
Composite pipelines. It would be convenient to pre-define certain pipelines that process specific things that can be re-used for other documents. For example, there may be a pipeline that processes date + geoip, and these operate on fields that are common to other documents that may require further processing.
Add ability to show stats in _simulate response to show resource usage and execution times of pipelines/processors
A pipeline should be able to choose what (custom) thread it uses. Some pipelines just do some simple modifications to the incoming documents while other may reach out to external systems to enrich the incoming documents. #14616
Grok discover api #15041
Add notion of transactions for processors which mutate multiple fields within a document. This will allow on_failure processors to receive a document with a pre-failed-processor state (ref: https://github.com/elastic/elasticsearch/issues/14548#issuecomment-161799133)
Add compare processor. #14647
Add json processor, which converts a json string into json.
Add kv processor.

timini commented 9 years ago

Very interesting feature would use this

marcelhallmann commented 8 years ago

When will this plugin ready to use? Sounds very interesting!

Will the plugin be usabel with elastic 1.x?

javanna commented 8 years ago

hi @marcelhallmann we don't have a date nor a targeted version for now, but you can monitor the progress in this meta issue. We will have a first release whenever all of the needed features for the first phase are in. We are developing against master (3.x) and considering backporting to 2.x. We will not backport to 1.x though.

McStork commented 8 years ago

As many, I am looking forward to this. On this Ingest Node feature, how does one contribute with custom filters? We plan on developing one and it would be great if it could be linked to Ingest Node, without us needing to develop it for Logstash.

timini commented 8 years ago

Update on this?

martijnvg commented 8 years ago

@McStork @timini If you're interested in writing your own processor you could take a look at the geoip processor which has been developed as a plugin: https://github.com/elastic/elasticsearch/tree/master/plugins/ingest-geoip

Beware that this is unreleased code and that things may change. Also it is unknown when ingest gets released.

McStork commented 8 years ago

@martijnvg Thanks!

mgcrea commented 8 years ago

I'm looking forward to this feature to ingest json and replace fluentd. :+1:

ryanmaclean commented 8 years ago

This looks fantastic! Which branch did this end up in? I'm getting a 404 on https://github.com/elastic/elasticsearch/tree/feature/ingest (linked to above, just prior to the checklist)

javanna commented 8 years ago

@ryanmaclean the branch was merged to master and deleted. Ingest node and all of its processors will be released with the next major release (namely 5.0).

timini commented 8 years ago

So this is available in master now?

Any documentation out there?

javanna commented 8 years ago

The first version of the docs is published as part of our reference: https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html .

javanna commented 8 years ago

hi @luto65 I feel like this discussion would be better suited for our discuss forums. Would you mind posting your questions there? Then if the result of the discussion is a feature request, or a bug, a new issue can be opened on github. Thanks!

martijnvg commented 8 years ago

Closing this issue, all required tasks for phase 1 are completed.

elastic / elasticsearch

Ingest Node - enrich data as it gets in via pipelines #14049