elastic / elasticsearch

Free and Open Source, Distributed, RESTful Search Engine
https://www.elastic.co/products/elasticsearch
Other
69.55k stars 24.62k forks source link

Reindex resiliency #42612

Open henningandersen opened 5 years ago

henningandersen commented 5 years ago

We want to make reindex resilient to node restarts and failures, such that reindex can continue to run across such events.

There are two primary problems to solve:

Search resiliency

Coordinator node resiliency:

Slicing:

Benchmarking:

Misc:

Docs

elasticmachine commented 5 years ago

Pinging @elastic/es-distributed

Tim-Brooks commented 5 years ago

I made some updates to the meta issues under coordinator node.

Mpdreamz commented 5 years ago

Should we consider defaulting wait_for_completion to false as a breaking change?

It's a bit trappy in the sense that a client could disconnect and leave the reindex in an unknown state. TCP disconnection could potentially also be considered a way to cancel to the reindex operation when wait_for_completion=true OOTB

henningandersen commented 5 years ago

@Mpdreamz we discussed this in our weekly sync today. It seems both defaults have benefits. The current default gives the easiest OOTB experience for someone new to ES when playing around with it.

Also, we think we need a strong argument for changing the default, to ensure that we only do breaking changes when necessary. Do you have a good case to present on this?

Notice that part of this project intends to introduce reindex as jobs and thus a disconnected client would leave the job in a healthy state though finding the job again will require looking for it through the new reindex job API (probably something like GET _reindex/ or GET _reindex/<job-id>.

For search, cancelling the job on TCP disconnect makes sense, since the result is going nowhere anywhere and a search has no sideeffects. For reindex, the job has sideeffects as its primary purpose and whether or not the user wants to cancel is less obvious. We think being explicit is better, also to ensure that a network issue does not result in stopping the job.