Closed nik9000 closed 8 years ago
Exactly +1
@nik9000 nice! Is there any idea which ES version is targeted?
Please make this available in 3.0! Currently we use foundatio to help us with this and it would be really really nice to have this sooner than later: https://github.com/exceptionless/Foundatio/blob/master/src/Elasticsearch/Jobs/ReindexWorkItemHandler.cs
@nik9000 nice! Is there any idea which ES version is targeted?
3.0.0 initially but I really want to backport it to 2.3 as well.
Another thing I should mention: I talked with @imotov who is doing the task management. All tasks will have an option for wait_until_completion
to wait for the copy. It'll make using the API simpler for smaller copies but isn't something you'd want to use for large copies.
I've started the implementation for the first 4 use cases in #15125. Right now its a plugin - I believe it'll be a prebundled plugin which will be a new thing in 2.3.
Very nice looking
@nik9000 I'm curious, in your example for resharding you have comments indicating that we would have to stop indexing. What kind of steps would have to be taken if this wasn't possible?
@nik9000 I'm curious, in your example for resharding you have comments indicating that we would have to stop indexing. What kind of steps would have to be taken if this wasn't possible?
You'll want to have some way of replaying the same updates on you application side against both indexes. At some point I'd like to be able to install a redirect mechanism in Elasticsearch for the duration of the reindex operation. It seems like an obvious thing. Its not part of what I'm working on now and it introduces yet more complexity around versioning but its important.
I've updated the list of things that reindex will do. Right now we don't have external cluster support and I don't know when that'll become a priority.
Right now this is what is left before we can merge the feature/reindex branch down to master:
Here are things that are left to do in the first phase of the project:
This is indeed a super useful API, cannot wait!
Would it be possible to also, in future versions, provide additional functionality to allow update
on the target index except of only index
operations? My use case for this is entity centric indexing - imagine you have an index containing events and wish to group them by session. With the reindex api it should be possible to read the source events, apply a script (or just extract a field) to get the ID of a target document and pass it as a parameter to a specified update script.
Another use case we see a lot with users is that they want to move some data out of one index to another. Would it be possible to combine the reindex with delete-by-query essentially? After a document is indexed in the target index a delete
operation will be issued on the source index. Of course this couldn't be done atomically, but even on best effort basis this would be super useful for a lot of people - essentially executing reindex and delete-by-query at the same time (on the same point in time snapshot of the index) with no additional guarantees than those two operations have individually.
I am happy to create individual issues for these use cases if they make sense to people.
I'm going to close this because reindex is done and live in 2.3.0 and 5.0.0-alpha1. I think @HonzaKral's point is really another feature request. @HonzaKral, can you make a new issue for it? Sorry!
Done as #17998 and #17997
Thanks!
The time has come for a Elasticsearch to implement a native API for reindexing! The first request I've found for this is (#492) filed back in 2010. With the Task Management API (#15117) will make this easier to manage. This meta ticket will cover the following use cases:
touch
ing documents to pick up mapping updates made on the fly* [ ] Copying an index from a remote cluster into this oneNOTICE: This meta issue gets fairly rambly from here on out. It will change. The list above will change. Everything is up for negotiation and everything needs to be prototyped before we're sure of anything. Things higher on the list are more likely to be in the final product.
Resharding
It'd work like:
You see from the example that its not automatic or atomic. It's still an event and it's very similar to an old blog post about changing mappings with no downtime. The advantages of this as opposed to the scroll implementation proposed in the blog post are:
sort: "_doc"
and clearing the context when the copy is done and retrying when things fail.The two curl commands in the middle are the new bits. This should start a background task to perform the copy:
and this should block for a while waiting for the task to complete:
This all piggy backs on the Task Management API (#15117) which isn't done yet, so it'll likely change. The reason this reindex command is a task is because it can take a long time. I, @nik9000, have personally seen these scroll type reindexes take hours for pretty big indexes. So if its going to take hours you'll need a way to cancel it or throttle it. And the task management API should have those ways though I have no idea what they'll look like.
You may ask "Why don't you combine the index creation, alias swap, and index delete into one task?" And that'd be a good question. It won't be part of the first implementation of this but might be part of later ones. Right now I don't like the idea very much. Keep reading. Maybe you'll agree with me. Maybe not. Leave a comment?
Incompatible mapping updates
These'll work almost just like resharding. So much so that I won't give a curl example because I trust you, dear reader, can figure it out. The manual check of the index becomes much more important in this case. It's fairly believable that you'd want to keep both indexes alive for a period of time to test both. A/B testing or something.
The other way that mapping updates differ from resharding is that filesystem level optimization are much much less likely.
touch
ing documents to pick up mapping updates made on the flySome mapping updates can be made to an index on the fly but aren't picked up:
"dynamic": false
This offers a fairly complete example of adding a field using the PUT mapping API works and how you could use the reindex API to
touch
the documents.This use case differs from the resharding and incompatible mapping update use cases in that the document isn't being added to an empty index, its being updated in an existing index. So if the the reindex process goes to touch the document but its changed between the time that the scroll took its snapshot of the index then the document shouldn't be changed. Luckily, Elasticsearch has built in support for optimistic concurrency control.
Limiting the reindexed documents using a query
This seems like the logical extension to the other use cases more than a use case on its own. Its just a useful optimization on top of the other use cases. For example, you could use a query to only
touch
documents modified after a certain time.Update-by-query style changes to portions of the index at a time
"Increment
counter
on all documents matching this query" is a fairly normal operation on a relational database and Elasticsearch could have it too. Its fairly different internally from the other proposals but could be quite compelling though I admit to not having a good use case for it in mind. The trouble with this use case is that it tempts "incrementcounter
on all documents" operations which are fairly inefficient in Elasticsearch. Its fairly inefficient in any system with concurrency control and most of them implement it anyway, but Elasticsearch makes an effort to make it difficult to do very inefficient things. Its inefficient because in Elasticsearch an update is an atomic delete and index operation and both of those operations are more expensive their relational counterparts. The delete itself is just as cheap but deleted document have to be reclaimed segment at a time rather than the aggressive measures relational datbases use. The index is much more expensive because the whole document has to be reanalyzed.In many cases it'd be faster to copy the documents to a new index and then do the alias swap dance on it rather than update than it would be to touch every document in the index.
Even with all that it may be a fairly useful API to implement.
Copying an index from a remote cluster into this one
Maybe the most ambitious use case on the list, the idea here is to scroll on a remote cluster and index into the cluster handling the request. This seems like a sensible way to implement basic disaster recovery. It'd be better if the query could subscribe to updates and get them streamed back, but even as is it'd fairly nice to run daily/hourly updates. Especially if the documents had a
last_modified_time
style column.