elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.98k stars 24.75k forks source link

Bulk API documentation does not indicate that pipelines are only supported update operations using the `upsert` action and no other update actions #114811

Open bczifra opened 1 week ago

bczifra commented 1 week ago

Elasticsearch Version

All versions since ingest pipelines were introduced

Installed Plugins

No response

Java Version

bundled

OS Version

N/A

Problem Description

The Bulk API documentation mentions a pipeline query parameter:

(Optional, string) ID of the pipeline to use to preprocess incoming documents. If the index has a default ingest pipeline specified, then setting the value to _none disables the default ingest pipeline for this request. If a final pipeline is configured it will always run, regardless of the value of this parameter.

However, it doesn't indicate that for update operations, this parameter is only supported for upsert actions and no other update actions. Moreover, the final pipeline will not run for update operations.

I believe this is because Elasticsearch doesn't have access to the complete document for the other update actions and, as such, it can't reliably execute any ingest pipelines because they may rely on the other fields for that document.

Please document this behavior: searching for pipeline on the Bulk API page doesn't yield any related information.

Related:

Steps to Reproduce

  1. Create an ingest pipeline
PUT _ingest/pipeline/bulk-test-pipeline
{
  "processors": [
    {
      "set": {
        "field": "field_b",
        "copy_from": "field_a",
        "override": true,
        "ignore_empty_value": true,
        "ignore_failure": true
      }
    }
  ]
}
  1. Create a document using the bulk API and use the above pipeline
POST _bulk?pipeline=bulk-test-pipeline
{"create":{"_index":"bulk-test","_id":"doc_1"}}
{"field_a": "value-a-1"}

Results: Expected. On document creation, the pipeline ran as expected and the value from field_a is correctly copied to field_b

    "hits": [
      {
        "_index": "bulk-test",
        "_id": "doc_1",
        "_score": 1,
        "_source": {
          "field_a": "value-a-2",
          "field_b": "value-a-1"
        }
      }
    ]
  }
}
  1. Update the above document using the bulk API + pipeline:
POST _bulk?pipeline=bulk-test-pipeline
{"update":{"_index":"bulk-test","_id":"doc_1"}}
{"doc": {"field_a": "value-a-2"}}

Results: Unexpected. On document update, the pipeline parameter does not seem to be honoured. The new value from field_a is not copied to field_b:

{
  "_index": "bulk-test",
  "_id": "doc_1",
  "_version": 2,
  "_seq_no": 1,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "field_a": "value-a-2",
    "field_b": "value-a-1"
  }
}
  1. Create a second document with the bulk API + pipeline:
POST _bulk?pipeline=bulk-test-pipeline
{"create":{"_index":"bulk-test","_id":"doc_2"}}
{"field_c": "value-c-1"}

Results: Expected. This document contains an existing field that is unrelated to the ingest pipeline. There is currently no value for field_a or field_b

{
  "_index": "bulk-test",
  "_id": "doc_2",
  "_version": 1,
  "_seq_no": 2,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "field_c": "value-c-1"
  }
}
  1. Update the second document with bulk API + pipeline:
POST _bulk?pipeline=bulk-test-pipeline
{"update":{"_index":"bulk-test","_id":"doc_2"}}
{"doc": {"field_a": "value-a-1"}}

Results: Unexpected. On document update, the pipeline parameter does not seem to be honoured. The value from field_a is not copied to field_b. This test shows that the issue is not related to the "Overwrite" flag on the Set processor in the pipeline.

{
  "_index": "bulk-test",
  "_id": "doc_2",
  "_version": 2,
  "_seq_no": 3,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "field_c": "value-c-1",
    "field_a": "value-a-1"
  }
}

Logs (if relevant)

No response

elasticsearchmachine commented 1 week ago

Pinging @elastic/es-docs (Team:Docs)

elasticsearchmachine commented 1 week ago

Pinging @elastic/es-data-management (Team:Data Management)

dakrone commented 1 week ago

Relates to https://github.com/elastic/elasticsearch/issues/104941