elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
977 stars 24.82k forks source link

Using reindex api with slices in creating child documents have different routing id from parent id #28465

Open shribigb opened 6 years ago

shribigb commented 6 years ago

Elasticsearch version (bin/elasticsearch --version): 6.1.3

JVM version (java -version): 1.8-56

OS version (uname -a if on a Unix-like system): Ubuntu 16 LTS

Description of the problem including expected versus actual behavior:

When creating child documents using reindex api with slicing and painless scripting, I noticed that value for routing id was different from the parent id. When I re-tried without slicing I noticed expected behavior. To isolate the issue, I created single node single shard index.

Steps to reproduce:

  1. I first indexed all the parent docuements. I made following call using reindex API
    POST _reindex?slices=9&requests_per_second=-1&wait_for_completion=true
    {
    "source": {
     "index": "index-cutover-v2",
     "type": "files",
     "query": {"terms": {"filesets": [31712]}}
    },  
    "script": {
    "source": "if (ctx._type == 'files') {ctx._source.file_flag_join = params.file_flag_join; ctx._id=ctx._source.id+'file'; ctx._type='doc' } else {ctx.op = 'noop'}",
    "lang": "painless",
    "params": {
      "file_flag_join" : {
        "name": "file"
      }
    }
    },
    "dest": {
    "index": "index-cutover-v5"
    }
    }
  2. Once the reindexing was done I used _refresh API to make data available for searching.
  3. After step 2, I indexed child documents using following api call
    POST _reindex?slices=9&requests_per_second=-1&wait_for_completion=true
    {
    "source": {
     "index": "index-cutover-v2",
     "type": "flags",
     "query": {"terms": {"filesets": [31712]}}
    },  
    "script": {
    "source": "if (ctx._type == 'flags') {ctx._type='doc';ctx._id=ctx._source.id+'flag';ctx._routing=ctx._source.file+'file';ctx._source.file_flag_join = params.file_flag_join;ctx._source.file_flag_join.parent = ctx._routing;} else {ctx.op = 'noop'}",
    "lang": "painless",
    "params": {
      "file_flag_join" : {
        "name": "flag"
      }
    }
    },
    "dest": {
    "index": "index-cutover-v5"
    }
    }
    "
    1. After step 3 was finished I reran _refresh API.
    2. I tried to pull single flag document and I received following.
      {
      "_index": "index-cutover-v5",
      "_type": "doc",
      "_id": "1186231450flag",
      "_score": 2,
      "_routing": "110486171file",
      "_source": {
      "severity": 4,
      "reason": "",
      "file_flag_join": {
          "parent": "110486274file",
          "name": "flag"
      },
      "file": 110486171,
      "offset": 10111689,
      "filesets": [
          29825,
          31712
      ],
      "id": 1186231450
      }
      }

      If you notice, _routing and file_flag_join.parent have different values.

If I remove slicing and try again I dont see this behavior and I see consistency in the routing and parent values.

jimczi commented 6 years ago

The issue is caused by the inner map file_flag_join in the params of the script. If this inner map is copied in the context document and then modified to add new fields, this also modifies the original params. It works when slicing is off because you always override the new field with its new value but when multiple slices execute the same script they modify this inner map concurrently which leads to a race condition. The solution is to clone the map in the source like this: ctx._source.file_flag_join = new HashMap(params.file_flag_join)

We could force a deep copy of the params before execution or make the inner map immutable to avoid this trappy behavior, @nik9000 WDYT ?

nik9000 commented 6 years ago

I think it'd be breaking to make those maps immutable but it is pretty tempting. It feels cleaner than making a copy every time though.

bleskes commented 6 years ago

@nik9000 what should we do here? this feels like serious bug?

nik9000 commented 6 years ago

Either of the things I suggested should work to be honest. It might be better to make a script context for reindex and see if that solves the problem. It might. And if it doesn't we can build on it from there.

elasticmachine commented 5 years ago

Pinging @elastic/es-distributed

fcofdez commented 2 years ago

@henningandersen and I discussed this issue during a triage session and we think this issue should be moved to the script label as the issue seems to be related to how the script is set up using a non thread-safe data structure in the script (maybe this should be forbidden?)

elasticsearchmachine commented 2 years ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

stu-elastic commented 2 years ago

This is an on-going issue with params, we should provide a way for users to access params in a safe way. Since we have metadata() and field(<path>) as replacements for ctx, perhaps a fields-like params API would be worthwhile.