elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
969 stars 24.82k forks source link

Ingest Pipelines does not resolve {{{_id}}} during bulk upsert #89194

Open amnonshgong opened 2 years ago

amnonshgong commented 2 years ago

Elasticsearch Version

7.10.2

Installed Plugins

No response

Java Version

openjdk version "15.0.1" 2020-10-20

OS Version

Linux 2f78a71fa18c 5.10.104-linuxkit #1 SMP Thu Mar 17 17:08:06 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Problem Description

When using upserts inside a bulk request with an ingest pipeline, the {{{_id}}} template snippet does not resolve to the provided document id. This is a divergence from the regular, non-bulk upsert command where this template snippet does get resolved.

Some context / motivation According to Elastic documentation, the recommended way for pagination is using search_after and it is recommended using a tiebreaker field in sort (see here). A natural candidate meeting the uniqueness requirement is the document _id, but since it is not defined as a doc_values according to the documentation: _The _id field is restricted from use in aggregations, sorting, and scripting. In case sorting or aggregating on the _id field is required, it is advised to duplicate the content of the _id field into another field that has doc_values enabled._ Therefore, not supporting this use case means that we cannot use search_after if our system requires bulk upserts to be done.

Steps to Reproduce

Setup Index

DELETE /upsert-test/

PUT /upsert-test/

PUT /_ingest/pipeline/test-default-pipeline
{
  "processors" : [
    { "set" : { "field" : "pipeline.default.executed", "value" : "true" } },
    { "set" : { "field" : "pipeline.default.docId", "value" : "{{{_id}}}" } }
  ]
}

PUT /_ingest/pipeline/test-final-pipeline
{
  "description" : "Ingest pipeline for setting paging context",
  "processors" : [
    { "set" : { "field" : "pipeline.final.executed", "value" : "true" } },
    { "set" : { "field" : "pipeline.final.docId", "value" : "{{{_id}}}" } }
  ]
}

PUT /upsert-test/_settings
{
  "index": {
    "default_pipeline": "test-default-pipeline",
    "final_pipeline": "test-final-pipeline"
  }
} 

Run bulk update

POST _bulk
{ "update" : {"_id" : "1", "_index" : "upsert-test"} }
{ "doc": { "text": "doc updated" }, "upsert": { "text": "doc upserted" }}

Check document

# Command
GET /upsert-test/_doc/1

# Output
{
  "_index" : "upsert-test",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "_seq_no" : 0,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "pipeline" : {
      "default" : {
        "docId" : "",
        "executed" : "true"
      },
      "final" : {
        "docId" : "",
        "executed" : "true"
      }
    },
    "text" : "doc upserted"
  }
}

Actual Result Document 1 got upserted, and both default and final pipelines got executed. However, the docId was not set.

Expected Result I'd expect the result of the bulk operation above to be the same as of the following upsert operation that does set the docId field:

# Upsert document #2
POST /upsert-test/_update/2
{
  "upsert": { "text": "doc upserted" },
  "doc":    { "text": "doc updated" } 
  }
}

# Get document
GET /upsert-test/_doc/2

# Result
{
  "_index" : "upsert-test",
  "_type" : "_doc",
  "_id" : "2",
  "_version" : 1,
  "_seq_no" : 1,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "pipeline" : {
      "default" : {
        "docId" : "2",
        "executed" : "true"
      },
      "final" : {
        "docId" : "2",
        "executed" : "true"
      }
    },
    "text" : "doc upserted"
  }
}

Logs (if relevant)

No response

elasticsearchmachine commented 2 years ago

Pinging @elastic/es-data-management (Team:Data Management)