elastic / elasticsearch

Free and Open, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
68.54k stars 24.35k forks source link

[Ingest Pipeline] Ability to split documents #56769

Open m-adams opened 4 years ago

m-adams commented 4 years ago

It is common for tools to output data in a combined format where one document may contain several entities. For example, a tool that scans several hosts for compliance or vulnerabilities or an API that provides an update to every train/bus etc. We really want to split all entities out to separate docs while copying some high-level information. This is possible using Logstash and the Split filter but not possible with Ingest Pipelines.

The feature would allow this kind of document to be processed and split without having to include Logstash in the ingest chain.

elasticmachine commented 4 years ago

Pinging @elastic/es-core-features (:Core/Features/Ingest)

ianmuscat commented 4 years ago

If it helps, this was my use case for needing this feature https://discuss.elastic.co/t/split-json-array-into-multiple-events-using-ingest-pipelines/238519

jeffvestal commented 3 years ago

This would be really helpful. I'm sending in batches of metrics from a source where it is difficult to change the format to have them sent in one at a time or in bulk format. If I could split the array of json measurements up with an ingest pipeline I could avoid having to put in an additional parsing layer.

example part of an ingested doc. I want to split out the json measurements in data into individual documents

 {
        "_index" : "metrics_test",
        "_type" : "_doc",
        "_id" : "IsWOhHcBCOjOntNtCq76",
        "_score" : 1.0,
        "_source" : {
          "data" : [
            {
              "type" : "measure1",
              "date" : "2021-02-08T00:44:32-06:00",
              "value" : "164",
              "unit" : "count"
            },
            {
              "type" : "measure1",
              "date" : "2021-02-08T00:55:16-06:00",
              "value" : "22",
              "unit" : "count"
            },
...
kofi-grid commented 3 years ago

I could also use this. I'm trying to cut out logstash and can't without this!

hungnguyen-elastic commented 3 years ago

this is very much needed! +1

ghost commented 3 years ago

this is very much needed! +1

ChenTsungYu commented 3 years ago

this is very much needed! +1

christophercutajar commented 3 years ago

this is very much needed! +1

cvanhalt commented 3 years ago

this is very much needed! +1

SpencerLN commented 3 years ago

This is one of the final remaining items preventing us from decommissioning our Logstash instances and fully migrating to beats + Ingest Pipeline. We have multiple data sources that include arrays in the JSON data that need to be split into their own documents while potentially inheriting some properties from the parent document.

For example, input:

{
    "user_id": "abc123",
    "time": "1994-11-05T13:15:30Z",
    "events": [
        {
            "event_name": "view_page",
            "event_metadata": "blah"
        },
        {
            "event_name": "click_submit",
            "event_metadata": "blah"
        }
    ]
}

output:

{
    "user_id": "abc123",
    "time": "1994-11-05T13:15:30Z",
    "event": {
        "event_name": "view_page",
        "event_metadata": "blah"
    }
}
{
    "user_id": "abc123",
    "time": "1994-11-05T13:15:30Z",
    "event": {
        "event_name": "click_submit",
        "event_metadata": "blah"
    }
}
burgermannetje commented 3 years ago

Is there any confirmation that this will be realized? I can see this has been added to the enhancements and 'needs-triage'. Is there any information on where this is in the works?

hendry-lim commented 2 years ago

+1 We have the same requirements from our customer. We are enriching our data with the enrich processor that may match multiple documents. We need to index the same document + enriched fields into multiple documents if the enrichment matched multiple documents.

LeonBirk commented 2 years ago

We have the same issue - the split processor httpjson inputof the filebeat threatintel module does not work properly for our use case (get Attributes from events in MISP as documents).

response.split:
  target: body.response
  split:
    target: body.Event.Object
    split:
      target: body.Event.Object.Attribute

still leaves us with

    {
        "Attribute": [
            {
                "category": "Network activity",
                "deleted": false,
                "to_ids": true,
                "value": "https://redacted.net/ls/click?upn=5c-2BN7OI7J"
            },
            {
                "category": "Network activity",
                "to_ids": true,
                "type": "domain",
                "uuid": "76bfee8d-4d2f-4aee-aba6-ab714b1e65ab",
                "value": "redacted.net"
            }
        ],
        "ObjectReference": [
            {
                "Object": {
                    "distribution": "5"
                },
                "comment": "",
                "deleted": false,
                "uuid": "895b6048-1bb1-4f6a-bdb4-cf7fb45f4fcc"
            }
        ],
        "comment": "Redirector URL contained in mail",
        "event_id": "3835"

    }

This could be resolved through splitting docs with an ingest pipeline.

djmcgreal-cc commented 2 years ago

Given the Processor interface's execute() definition, it looks like it would be impossible to implement a split without substantial changes.

That is a great shame!

ghost commented 2 years ago

I am also eager to have it.

emmanuelmathot commented 2 years ago

+1

legoguy1000 commented 2 years ago

I too would love to see this feature

smnschndr commented 2 years ago

+1

felix-lessoer commented 2 years ago

Pretty important feature!

miastensdotter commented 2 years ago

+1

petericebear commented 2 years ago

+1 This would be really helpful for availability usecases where a room with multiple availability dates and pricing comes within a single document. This would be a real timesaver if it is going to made.

itaykat commented 2 years ago

+1

luizhlelis commented 2 years ago

+1

ylasri commented 2 years ago

+1

renormalist commented 2 years ago

+1

JeppeMariagerLam commented 2 years ago

+1

Rick25-dev commented 2 years ago

Any update? It has been 2 years @elastic/es-core-features (:Core/Features/Ingest)

praveenvinay commented 2 years ago

+1

ericleong86 commented 1 year ago

+1 this is needed. We often query APIs to multiple products and it always return results in an array format. Need to split the results into multiple documents instead. Creating different API requests to call for specific object is not practical in my use case.

fatihakafou commented 1 year ago

+1 This is needed!

LBoraz commented 1 year ago

+1 lack of this feature forces an unnecessary trip to logstash

djptek commented 1 year ago

Depending on your use use case, Filebeat has a processor which

https://www.elastic.co/guide/en/beats/filebeat/current/filebeat-input-httpjson.html#response-split

which may be relevant as it allows you to:

"...convert a map, array, or string into multiple events..."

and optionally

"...fields from the parent document (at the same level as target) will be kept..."

expertalex commented 1 year ago

+1 it will help us avoid using elastic-serverless-forwarder with expand_event_list_from_field

janniten commented 1 year ago
fixcer commented 1 year ago

+1 Pretty important feature!

skbriink commented 1 year ago

+1 would love this!

tomgregoryelastic commented 1 year ago

+1

timor-raiman commented 1 year ago

+1

jeskoriesner commented 1 year ago

+1

renzedj commented 1 year ago

+1

thornade commented 11 months ago

+1

shiya-kohn commented 11 months ago

👍 We need this for blue-green sharding

llermaly commented 11 months ago

This would be very useful to split text body into chunks to overcome the 512 tokens limit of the embeddings models

luc-anise commented 10 months ago

+1

carlopuri commented 9 months ago

Please add this feature; I'll solve use cases with feature enabled on ingest pipeline. Thanks a lot

matt-isett commented 9 months ago

This would be very useful to split text body into chunks to overcome the 512 tokens limit of the embedding models

DrMxxxxx commented 8 months ago

This would be a veeeeery helpful feature. I have the need to split an array in one document to several documents.

thekofimensah commented 8 months ago

It's been 3 years @elastic How many more +1s are needed?

dijkstrajesse commented 8 months ago

+1

Lexinga commented 5 months ago

+1