Reindex resiliency - Githubissues

henningandersen commented 5 years ago

We want to make reindex resilient to node restarts and failures, such that reindex can continue to run across such events.

There are two primary problems to solve:

Data node resiliency. Reindex relies on scroll queries which are not resilient.
Coordinator node resiliency. Reindex runs on the host receiving the request and cannot survive if that node dies or is restarted.

Search resiliency

[ ] Search ordered by seq_no and handle query failures by retrying from last seq_no (inclusive)
[ ] Support reindex from remote when source version above 6.6+
[ ] Add support for alternative numeric ordering attribute, particularly useful for remote index against pre-6.5 source.
[ ] Back-off strategy on repeated failures
[ ] Verify overhead of seq_no ordering

Coordinator node resiliency:

[ ] POC to clarify this subject more (#43382)
[ ] Decide on start reindex job action name
- indices:data/write/start_reindex
- indices:admin/reindex/start_reindex
- cluster:admin/reindex/start_reindex
- indices:data/reindex/start_reindex
[ ] Decide on persistent reindex task name
[ ] Evaluate how we want to do timeouts for waiting on initial task creation or reindex task completion
[ ] Refactor common parts from data frames and roll-up
[ ] Add reindex persistent task and remove it when done (#43382)
[ ] Allocation of reindex persistent task (#43382)
[ ] Store progress information periodically into .tasks index
[ ] Resume from existing progress information when allocated to new node
[ ] Make updates to persistent tasks resilient against master failovers
[ ] Support async durability on destination, ensuring data in checkpoint is fsync'ed into destination

Slicing:

[ ] Investigate having multiple in flight search and bulk requests as an alternative

Benchmarking:

[ ] Compare rally original indexing to reindex
[ ] Overhead of scripting and ingest pipelines

Misc:

[ ] Handle write failures by retrying when appropriate
[ ] Refined error handling, filter out known/retryable errors
[ ] HLRC support for new persistent task id.
[ ] Examine if transport client in 7.x can call resilient reindex (workaround).
[ ] Add serialization tests for get reindex request

Docs

[ ] Clarify how to use resilient reindex in reference docs (conflict handling, parameters)

elasticmachine commented 5 years ago

Pinging @elastic/es-distributed

Tim-Brooks commented 5 years ago

I made some updates to the meta issues under coordinator node.

Mpdreamz commented 5 years ago

Should we consider defaulting wait_for_completion to false as a breaking change?

It's a bit trappy in the sense that a client could disconnect and leave the reindex in an unknown state. TCP disconnection could potentially also be considered a way to cancel to the reindex operation when wait_for_completion=true OOTB

henningandersen commented 5 years ago

@Mpdreamz we discussed this in our weekly sync today. It seems both defaults have benefits. The current default gives the easiest OOTB experience for someone new to ES when playing around with it.

Also, we think we need a strong argument for changing the default, to ensure that we only do breaking changes when necessary. Do you have a good case to present on this?

Notice that part of this project intends to introduce reindex as jobs and thus a disconnected client would leave the job in a healthy state though finding the job again will require looking for it through the new reindex job API (probably something like GET _reindex/ or GET _reindex/<job-id>.

For search, cancelling the job on TCP disconnect makes sense, since the result is going nowhere anywhere and a search has no sideeffects. For reindex, the job has sideeffects as its primary purpose and whether or not the user wants to cancel is less obvious. We think being explicit is better, also to ensure that a network issue does not result in stopping the job.

elastic / elasticsearch

Reindex resiliency #42612