elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
1.24k stars 24.85k forks source link

Ingest Script Processor can leak mutable parameter references #91977

Open joegallo opened 1 year ago

joegallo commented 1 year ago

Consider a pipeline like the following:

PUT _ingest/pipeline/test-pipeline-1
{
  "processors" : [
    {
      "script": {
        "lang": "painless",
        "params": {
          "list": [1,2,3,4]
        },
        "source": "ctx.list = params.get('list');"
      }
    },
    {
      "append": {
        "field": "list",
        "value": 10
      }
    }
  ]
}

And then repeatedly invoke the _simulate API against that pipeline:

POST /_ingest/pipeline/test-pipeline-1/_simulate
{
  "docs": [
    {
      "_index": "index",
      "_id": "id",
      "_source": {
        "message": "hello"
      }
    }
  ]
}

Each invocation will result in growing a new '10' because the script is leaking a reference to the underlying mutable 'list' parameter:

{
  "docs" : [
    {
      "doc" : {
        "_index" : "index",
        "_id" : "id",
        "_version" : "-3",
        "_source" : {
          "message" : "hello",
          "list" : [
            1,
            2,
            3,
            4,
            10,
            10,
            10,
            10,
            10,
            10,
            10 # weeeeeeeeeeeeeeeeeeeeeee
          ]
        },
        "_ingest" : {
          "timestamp" : "2022-11-28T17:46:35.03053Z"
        }
      }
    }
  ]
}

Fixing this in the calling code is relatively easy, the pipeline should be rewritten as:

PUT _ingest/pipeline/test-pipeline-1
{
  "processors" : [
    {
      "script": {
        "lang": "painless",
        "params": {
          "list": [1,2,3,4]
        },
        "source": "ctx.list = new ArrayList(params.get('list'));"
      }
    },
    {
      "append": {
        "field": "list",
        "value": 10
      }
    }
  ]
}

Now the list parameter is defensively copied and so we're no longer banging on an unintentionally static object.


Note: the params object itself is unmodifiable, you can't invent/inject a new parameter at runtime via script, however it's only shallowly immutable -- the tree of maps and lists that it represents is made of mutable maps and lists.

IMHO we should update ScriptProcessor to either return a shared deeply immutable param tree, or we should duplicate the params tree on each invocation in which case it's still mutable but no longer shared.

The current behavior is harmless enough when ingest pipelines merely leak the mutable reference, but it can cause rare and hard to track down race-condition bugs when the pipelines actually do mutate the shared mutable reference.

elasticsearchmachine commented 1 year ago

Pinging @elastic/es-data-management (Team:Data Management)

joegallo commented 1 year ago

Based on discussion today, the next step here is to benchmark the non-breaking change version of this where the params are deeply copied and not directly mutable. (Or, perhaps, copy on write?)

joegallo commented 1 month ago

When we get around to fixing this bug, we need to remember that parameters can be deeply nested. One of the parameters could be a list of maps of lists of maps, etc.

In an ideal world, none of these ever would have been mutable to begin with. Since that ship has sailed, however, I imagine the simplest version to implement is to always build a new parameters object-tree on each invocation, which isn't necessarily super performant.

And of course a deeply nested copy on write object tree is easy enough to speculate about on paper, but devilishly hard to implement...

joegallo commented 1 month ago

https://github.com/elastic/elasticsearch/issues/53679 seems to be related to this, and I agree with https://github.com/elastic/elasticsearch/issues/53679#issuecomment-773550143 about the most ideal solution -- the question in my mind is how we get there from here (what with it being a breaking change to some people, at least).