nik9000 commented 9 years ago

The time has come for a Elasticsearch to implement a native API for reindexing! The first request I've found for this is (#492) filed back in 2010. With the Task Management API (#15117) will make this easier to manage. This meta ticket will cover the following use cases:

[x] Resharding
[x] Incompatible mapping updates
[x] touching documents to pick up mapping updates made on the fly
[x] Limiting the reindexed documents using a query
[x] Update-by-query style changes to portions of the index at a time

* [ ] Copying an index from a remote cluster into this one

NOTICE: This meta issue gets fairly rambly from here on out. It will change. The list above will change. Everything is up for negotiation and everything needs to be prototyped before we're sure of anything. Things higher on the list are more likely to be in the final product.

Resharding

It'd work like:

# Stop writes to index

curl -XPUT localhost:9200/index_v2 -d'{
  "settings": {
    "number_of_shards": 10
  }
}'

curl -XPOST localhost:9200/a_single_command_to_start_copying_all_documents_from_index_v1_to_index_v2
# Save the returned task id

while [ curl -s "localhost:9200/_task/$TASK_ID?pretty&awaitComplete" ]; do
  echo "not done"
done

# Do any manual checks that index_v2 is ok. Maybe warm it. Maybe raise its number of replicas if you built it with 0 replicas.

curl -XPOST localhost:9200/_aliases -d '{
    "actions": [
        { "remove": { "alias": "index", "index": "index_v1" }},
        { "add":    { "alias": "index", "index": "index_v2" }}
    ]
}
'

curl -XDELETE localhost:9200/index_v2

# Resume writes to index

You see from the example that its not automatic or atomic. It's still an event and it's very similar to an old blog post about changing mappings with no downtime. The advantages of this as opposed to the scroll implementation proposed in the blog post are:

Elasticsearch can handle the messy details of the scroll API like sort: "_doc" and clearing the context when the copy is done and retrying when things fail.
Elasticsearch can to optimize the process to the point where it can do filesystem level things rather than scroll. The first implementation of reindex won't support such optimizations but they are totally possible and could cut the runtime down significantly.

The two curl commands in the middle are the new bits. This should start a background task to perform the copy:

curl -XPOST localhost:9200/a_single_command_to_start_copying_all_documents_from_index_v1_to_index_v2

and this should block for a while waiting for the task to complete:

curl -s "localhost:9200/_task/$TASK_ID?pretty&awaitComplete"

This all piggy backs on the Task Management API (#15117) which isn't done yet, so it'll likely change. The reason this reindex command is a task is because it can take a long time. I, @nik9000, have personally seen these scroll type reindexes take hours for pretty big indexes. So if its going to take hours you'll need a way to cancel it or throttle it. And the task management API should have those ways though I have no idea what they'll look like.

You may ask "Why don't you combine the index creation, alias swap, and index delete into one task?" And that'd be a good question. It won't be part of the first implementation of this but might be part of later ones. Right now I don't like the idea very much. Keep reading. Maybe you'll agree with me. Maybe not. Leave a comment?

Incompatible mapping updates

These'll work almost just like resharding. So much so that I won't give a curl example because I trust you, dear reader, can figure it out. The manual check of the index becomes much more important in this case. It's fairly believable that you'd want to keep both indexes alive for a period of time to test both. A/B testing or something.

The other way that mapping updates differ from resharding is that filesystem level optimization are much much less likely.

`touch`ing documents to pick up mapping updates made on the fly

Some mapping updates can be made to an index on the fly but aren't picked up:

Adding a new field to a property
Adding a new property to a type when "dynamic": false

This offers a fairly complete example of adding a field using the PUT mapping API works and how you could use the reindex API to touch the documents.

This use case differs from the resharding and incompatible mapping update use cases in that the document isn't being added to an empty index, its being updated in an existing index. So if the the reindex process goes to touch the document but its changed between the time that the scroll took its snapshot of the index then the document shouldn't be changed. Luckily, Elasticsearch has built in support for optimistic concurrency control.

Limiting the reindexed documents using a query

This seems like the logical extension to the other use cases more than a use case on its own. Its just a useful optimization on top of the other use cases. For example, you could use a query to only touch documents modified after a certain time.

Update-by-query style changes to portions of the index at a time

"Increment counter on all documents matching this query" is a fairly normal operation on a relational database and Elasticsearch could have it too. Its fairly different internally from the other proposals but could be quite compelling though I admit to not having a good use case for it in mind. The trouble with this use case is that it tempts "increment counter on all documents" operations which are fairly inefficient in Elasticsearch. Its fairly inefficient in any system with concurrency control and most of them implement it anyway, but Elasticsearch makes an effort to make it difficult to do very inefficient things. Its inefficient because in Elasticsearch an update is an atomic delete and index operation and both of those operations are more expensive their relational counterparts. The delete itself is just as cheap but deleted document have to be reclaimed segment at a time rather than the aggressive measures relational datbases use. The index is much more expensive because the whole document has to be reanalyzed.

In many cases it'd be faster to copy the documents to a new index and then do the alias swap dance on it rather than update than it would be to touch every document in the index.

Even with all that it may be a fairly useful API to implement.

Copying an index from a remote cluster into this one

Maybe the most ambitious use case on the list, the idea here is to scroll on a remote cluster and index into the cluster handling the request. This seems like a sensible way to implement basic disaster recovery. It'd be better if the query could subscribe to updates and get them streamed back, but even as is it'd fairly nice to run daily/hourly updates. Especially if the documents had a last_modified_time style column.

naivefun commented 9 years ago

Exactly +1

lukas-vlcek commented 9 years ago

@nik9000 nice! Is there any idea which ES version is targeted?

niemyjski commented 9 years ago

Please make this available in 3.0! Currently we use foundatio to help us with this and it would be really really nice to have this sooner than later: https://github.com/exceptionless/Foundatio/blob/master/src/Elasticsearch/Jobs/ReindexWorkItemHandler.cs

nik9000 commented 9 years ago

@nik9000 nice! Is there any idea which ES version is targeted?

3.0.0 initially but I really want to backport it to 2.3 as well.

Another thing I should mention: I talked with @imotov who is doing the task management. All tasks will have an option for wait_until_completion to wait for the copy. It'll make using the API simpler for smaller copies but isn't something you'd want to use for large copies.

I've started the implementation for the first 4 use cases in #15125. Right now its a plugin - I believe it'll be a prebundled plugin which will be a new thing in 2.3.

xiaoshi2013 commented 9 years ago

Very nice looking

bdharrington7 commented 8 years ago

@nik9000 I'm curious, in your example for resharding you have comments indicating that we would have to stop indexing. What kind of steps would have to be taken if this wasn't possible?

nik9000 commented 8 years ago

@nik9000 I'm curious, in your example for resharding you have comments indicating that we would have to stop indexing. What kind of steps would have to be taken if this wasn't possible?

You'll want to have some way of replaying the same updates on you application side against both indexes. At some point I'd like to be able to install a redirect mechanism in Elasticsearch for the duration of the reindex operation. It seems like an obvious thing. Its not part of what I'm working on now and it introduces yet more complexity around versioning but its important.

nik9000 commented 8 years ago

I've updated the list of things that reindex will do. Right now we don't have external cluster support and I don't know when that'll become a priority.

Right now this is what is left before we can merge the feature/reindex branch down to master:

[x] Progress from the task API: #16461
[x] Retry bulk failures if they are safe to retry. Like rejection exceptions. #16556
[x] Cancelation #16613
[x] Move it from a plugin to a module so it ships with Elasticsearch by default #16619
[x] Actually merge it to master #16861

Here are things that are left to do in the first phase of the project:

[x] Throttling
[x] Backporting to 2.x

honzakral commented 8 years ago

This is indeed a super useful API, cannot wait!

Would it be possible to also, in future versions, provide additional functionality to allow update on the target index except of only index operations? My use case for this is entity centric indexing - imagine you have an index containing events and wish to group them by session. With the reindex api it should be possible to read the source events, apply a script (or just extract a field) to get the ID of a target document and pass it as a parameter to a specified update script.

Another use case we see a lot with users is that they want to move some data out of one index to another. Would it be possible to combine the reindex with delete-by-query essentially? After a document is indexed in the target index a delete operation will be issued on the source index. Of course this couldn't be done atomically, but even on best effort basis this would be super useful for a lot of people - essentially executing reindex and delete-by-query at the same time (on the same point in time snapshot of the index) with no additional guarantees than those two operations have individually.

I am happy to create individual issues for these use cases if they make sense to people.

nik9000 commented 8 years ago

I'm going to close this because reindex is done and live in 2.3.0 and 5.0.0-alpha1. I think @HonzaKral's point is really another feature request. @HonzaKral, can you make a new issue for it? Sorry!

honzakral commented 8 years ago

Done as #17998 and #17997

nik9000 commented 8 years ago

Thanks!

elastic / elasticsearch

Reindex API #15201

Resharding

Incompatible mapping updates

`touch`ing documents to pick up mapping updates made on the fly

Limiting the reindexed documents using a query

Update-by-query style changes to portions of the index at a time

Copying an index from a remote cluster into this one

elastic / elasticsearch

Reindex API #15201

Resharding

Incompatible mapping updates

touching documents to pick up mapping updates made on the fly

Limiting the reindexed documents using a query

Update-by-query style changes to portions of the index at a time

Copying an index from a remote cluster into this one

`touch`ing documents to pick up mapping updates made on the fly