elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.15k stars 24.84k forks source link

Ingest Node - enrich data as it gets in via pipelines #14049

Closed martijnvg closed 8 years ago

martijnvg commented 9 years ago

[currently open issues]

related issues from other projects:

There are many use-cases where it is important to enrich incoming data. This enrichment may be something simple like using a regular expression to extract metadata from an existing field, or something more advanced like a geoip lookup or language identification. The filter stage of the Logstash processing pipeline provides great examples of the ways in which data is often enriched. Node ingest implements a new type of ES node, which performs this enrichment prior to indexing.

Node ingest is a pure Java implementation of the filters in logstash, integrated with Elasticsearch. It works by wrapping the bulk/index APIs, executing a pipeline that is composed of multiple processors to enrich the documents. A processor is just a component that can modify incoming documents (the source before ES turns it into a document). A pipeline is a list of processors grouped under an unique id. If node ingest is enabled then the index and bulk apis can reroute the the request with documents through a pipeline.

The ingest plugin runs on dedicated client nodes and after bulk and index requests have been enriched these index and bulk request continue their way into the cluster.

Node ingest will be a plugin in the elasticsearch project, implementing 2 main aspects:

The first is a pure Java implementation for Pipeline, Processor, as well as initial processor implementation of grok, geoip, kv/mutate, date. This java implementation can then be reused in others places, such as logstash itself, reindex API, and so on. In the first version of the ingest plugin the processor implementations can reside in the ingest plugin, but the framework and processor implementations shouldn’t rely on any ES specific code, so that later on it can be moved to an isolated library.

The second part is the integration with Elasticsearch. This includes interception of the bulk/index APIs, management APIs (stats and so on in future phase), storage and live reload of the configuration, supporting multiple "live" pipelines, and simulation of pipeline execution.

The goal of the ingest plugin is to make data enrichment easier and it will not replace logstash at all. The ingest plugin should make data enrichment in most of the cases easier when events are only stored in Elasticsearch. For example when only file beat is used to ship logs, a logstash instance will no longer be required. In cases where events are stored in multiple outputs a Logstash installation is required. Also at some point Logstash will reuse the pipeline/processor framework, so the end goal is that both Elasticsearch and Logstash will benefit from the ingest initiative.

Development happens in a feature branch: https://github.com/elastic/elasticsearch/tree/feature/ingest

Current node ingest tasks:

possible v2 tasks:

timini commented 9 years ago

Very interesting feature would use this

marcelhallmann commented 8 years ago

When will this plugin ready to use? Sounds very interesting!

Will the plugin be usabel with elastic 1.x?

javanna commented 8 years ago

hi @marcelhallmann we don't have a date nor a targeted version for now, but you can monitor the progress in this meta issue. We will have a first release whenever all of the needed features for the first phase are in. We are developing against master (3.x) and considering backporting to 2.x. We will not backport to 1.x though.

McStork commented 8 years ago

As many, I am looking forward to this. On this Ingest Node feature, how does one contribute with custom filters? We plan on developing one and it would be great if it could be linked to Ingest Node, without us needing to develop it for Logstash.

timini commented 8 years ago

Update on this?

martijnvg commented 8 years ago

@McStork @timini If you're interested in writing your own processor you could take a look at the geoip processor which has been developed as a plugin: https://github.com/elastic/elasticsearch/tree/master/plugins/ingest-geoip

Beware that this is unreleased code and that things may change. Also it is unknown when ingest gets released.

McStork commented 8 years ago

@martijnvg Thanks!

mgcrea commented 8 years ago

I'm looking forward to this feature to ingest json and replace fluentd. :+1:

ryanmaclean commented 8 years ago

This looks fantastic! Which branch did this end up in? I'm getting a 404 on https://github.com/elastic/elasticsearch/tree/feature/ingest (linked to above, just prior to the checklist)

javanna commented 8 years ago

@ryanmaclean the branch was merged to master and deleted. Ingest node and all of its processors will be released with the next major release (namely 5.0).

timini commented 8 years ago

So this is available in master now?

Any documentation out there?

javanna commented 8 years ago

The first version of the docs is published as part of our reference: https://www.elastic.co/guide/en/elasticsearch/reference/master/ingest.html .

javanna commented 8 years ago

hi @luto65 I feel like this discussion would be better suited for our discuss forums. Would you mind posting your questions there? Then if the result of the discussion is a feature request, or a bug, a new issue can be opened on github. Thanks!

martijnvg commented 8 years ago

Closing this issue, all required tasks for phase 1 are completed.