elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.61k stars 8.22k forks source link

Minimize downtime while remedying corrupt document migration failures #100768

Open joshdover opened 3 years ago

joshdover commented 3 years ago

As noted in #100631, when a upgrade migration fails due to a corrupt document in the index, the source index will be left in an unusable state due to the write block being left in place. Unfortunately, we don't have a safe way of automatically cleaning up this write block in the case of failure since other Kibana instances may be able to successfully continue the migration and removing the write block before they're complete could lead to data loss.

What we can do is provide a better experience for admins to handle this situation in order to minimize any downtime they may encounter while addressing the root cause. Possible options:

elasticmachine commented 3 years ago

Pinging @elastic/kibana-core (Team:Core)

joshdover commented 3 years ago

Also of note, that we could safely remove the write block in only the case of corrupt documents in 8.0 when we plan to stop supporting the scenario where Kibana instances are configured with different plugins enabled.

Though, based on the conversation in https://github.com/elastic/kibana/pull/100171#discussion_r640479472, this may already be the case. If so, we could safely remove the write block in 7.x when corrupt objects are detected.

ppf2 commented 3 years ago

Not sure if this will only be addressed in 8.0. It will be nice if we can backport this to 7.x. We had some failed production migrations due to Kibana leaving write blocks in place when encountering Elasticsearch exceptions (not isolated to document corruption during migration). As a result, on Cloud, it successfully upgraded Elasticsearch to a later 7.x version while leaving Kibana in an older 7.minor.

Example:

[.kibana_task_manager] SET_SOURCE_WRITE_BLOCK -> CREATE_REINDEX_TEMP. took: 96ms.

[.kibana_task_manager] [validation_exception]: Validation Failed: 1: this action would add [2] shards, but this cluster currently has [2000]/[2000] maximum normal shards open;

[.kibana_task_manager] migration failed, dumping execution log: