elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.94k stars 24.74k forks source link

Reindex API parent set to null, removes routing as well #26183

Closed saurabhjaluka closed 7 years ago

saurabhjaluka commented 7 years ago

Elasticsearch version: 5.5.1

Plugins installed: none

JVM version: 1.8

OS version: Ubuntu

Description of the problem including expected versus actual behavior: SourceIndex : has parent field in the documents DestinationIndex : no parent field in the documents

When I try to use reindex api with painless script for migration data from source to destination, setting parent to null in the script. It also sets routing to null. My requirement is just to remove the parent field and keep the routing field in destination.

Expected behavior: Routing should not be set to null, just parent should be set to null

Steps to reproduce:

Source index:

curl -XPUT "http://localhost:9200/source-index/?pretty" -d '{
  "mappings": {
      "post": {
         "_parent": { "type": "parent_type" },
         "properties": {
              "title": {
                 "type" : "keyword"
              },
              "description": {
                  "type" : "keyword"
              },
              "articleId": {
                  "type" : "keyword"
              },
              "engineId":{
                  "type": "keyword"
              }
          }
      },
      "parent_type": {
           "properties": {
             "engineId": {
               "type": "keyword"
             },
             "groupIds":{
               "type": "long"
            }
          }
      }
  }
}'

Destination index:

curl -XPUT "http://localhost:9200/dest-index/?pretty" -d '{
  "mappings": {
      "post": {
         "properties": {
              "title": {
                 "type" : "keyword"
              },
              "description": {
                  "type" : "keyword"
              },
              "articleId": {
                  "type" : "keyword"
              },
              "engineId":{
                  "type": "keyword"
              }
          }
      },
      "parent_type": {
           "properties": {
             "engineId": {
               "type": "keyword"
             },
             "groupIds":{
               "type": "long"
            }
          }
      }
  }
}'

Sample Document:

curl -XPOST "localhost:9200/source-index/post/1?routing=12345&parent=12345" -d '{"articleId": "abcd", "title":"hello 1", "description":"this is a test document","engineId":"12345"}'

Fetch Document to verify routing and parent field:

curl localhost:9200/source-index/_search

Response:
{
    "took": 2,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 1,
        "hits": [
            {
                "_index": "source-index",
                "_type": "post",
                "_id": "1",
                "_score": 1,
                "_routing": "12345",
                "_parent": "12345",
                "_source": {
                    "articleId": "abcd",
                    "title": "hello 1",
                    "description": "this is a test document",
                    "engineId": "12345"
                }
            }
        ]
    }
}

Reindex api:

curl -XPOST 'localhost:9200/_reindex?pretty' -H 'Content-Type: application/json' -d'
{
  "source": {
    "index": "source-index"
  },
  "dest": {
    "index": "dest-index",
    "routing":"keep"
  },
 "script":{
 "inline":"ctx._parent = null;"
 }
}'

Verify document at destination index:

curl localhost:9200/dest-index/_search

Response:
{
    "took": 0,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "failed": 0
    },
    "hits": {
        "total": 1,
        "max_score": 1,
        "hits": [
            {
                "_index": "dest-index",
                "_type": "post",
                "_id": "1",
                "_score": 1,
                "_source": {
                    "articleId": "abcd",
                    "description": "this is a test document",
                    "title": "hello 1",
                    "engineId": "12345"
                }
            }
        ]
    }
}

Parent is not present that is correct, but routing is not present as well.

Helpful information:

I debugged ES code and found out in file https://github.com/elastic/elasticsearch/blob/v5.5.1/modules/reindex/src/main/java/org/elasticsearch/index/reindex/AbstractAsyncBulkByScrollAction.java under apply function. When parent is set to newValue the function scriptChangedParent (func def : https://github.com/elastic/elasticsearch/blob/v5.5.1/modules/reindex/src/main/java/org/elasticsearch/index/reindex/TransportReindexAction.java) sets routing as well to the newValue of parent. screenshot from 2017-08-13 18-47-30

Next if call for routing, leaves routing field as it is. As newValue = oldValue (in my case it is 12345). But routing is already set the null in the previous step. screenshot from 2017-08-13 18-48-37

Let me know if extra info is required. Also I would love to contribute to fix this.

nik9000 commented 7 years ago

I see the problem and agree with your analysis. One problem is that _parent is not configurable in 6.0 any more which is (probably) the first version that'd get this fix. For indexes created after 6.0 _parent has been replaced by join fields which don't have this problem because they require explicitly setting the routing everywhere. So you aren't super likely to be able to use this fix by the time it is ready.

I wonder if a work around is good enough. Something like always change the routing in a consistent way so that the condition triggers. It isn't clean, but it'd work. Another option is to manually perform this reindex and/or do it with one of the reindex helpers like the one in the python or perl client. They likely don't have this issue.

saurabhjaluka commented 7 years ago

Thanks @nik9000 . Yeah, I might go for an approach for reindexing using logstash for now.

saurabhjaluka commented 7 years ago

Workaround incase anyone needs it:

migration.sh

pathToLogstash="<path-to-logstash>"
sourceHost="localhost:9200"
targetHost="localhost:9200"

sourceIndex="source-index"
targetIndex="dest-index"

input="input { elasticsearch { hosts => [\"${sourceHost}\"] index => \"$sourceIndex\" size => 5000 scroll => \"5m\" docinfo => true } }"

filter="filter { json { source => \"message\" } mutate { remove_field => [ \"@version\" ] remove_field => [\"@timestamp\"] remove_field => [ \"_parent\" ]} }"

output="output { elasticsearch { index => \"$targetIndex\" hosts => [\"${targetHost}\"] document_type => \"%{[@metadata][_type]}\" document_id => \"%{[@metadata][_id]}\" routing => \"%{engineKey}\" manage_template => false } }"

${pathToLogstash} -e "${input} ${filter} ${output}"
nik9000 commented 7 years ago

Thanks @nik9000 . Yeah, I might go for an approach for reindexing using logstash for now.

Thanks for understanding!

I'm going to close this issue as "wontfix". Sorry!

benbenwilde commented 6 years ago

Why didn't we fix this? I am having issues with this today

benbenwilde commented 6 years ago

@nik9000 I have found a better workaround for this issue that works without having to use a different client or having to modify document ids.

In a painless script, if you want to change/remove _parent but not change _routing:

ctx._parent = null;
ctx._routing = new StringBuffer(ctx._routing);
saurabhjaluka commented 6 years ago

@benbenwilde that's an easy solution, I don't know why I did not think about it. Glad to know you found the solution.

lwpk110 commented 5 years ago

@benbenwilde ,Hey, bro, you solved my problem and gave you 100 likes