Minimize downtime while remedying corrupt document migration failures

joshdover commented 3 years ago

As noted in #100631, when a upgrade migration fails due to a corrupt document in the index, the source index will be left in an unusable state due to the write block being left in place. Unfortunately, we don't have a safe way of automatically cleaning up this write block in the case of failure since other Kibana instances may be able to successfully continue the migration and removing the write block before they're complete could lead to data loss.

What we can do is provide a better experience for admins to handle this situation in order to minimize any downtime they may encounter while addressing the root cause. Possible options:

Provide a dry run feature to allow admins to easily detect corrupt objects prior to upgrading #55404
Provide a CLI for reseting the index state to so that old Kibana versions can continue working while the admin investigates the root cause
Provide an interactive migration mode - https://github.com/elastic/kibana/issues/100685
"Quarantine" corrupt objects - this idea has many problems (such as breaking Kibana in hard to anticipate ways) and was previously abandoned in #55406
Add support for read-time migrations that don't block upgrades

elasticmachine commented 3 years ago

Pinging @elastic/kibana-core (Team:Core)

joshdover commented 3 years ago

Also of note, that we could safely remove the write block in only the case of corrupt documents in 8.0 when we plan to stop supporting the scenario where Kibana instances are configured with different plugins enabled.

Though, based on the conversation in https://github.com/elastic/kibana/pull/100171#discussion_r640479472, this may already be the case. If so, we could safely remove the write block in 7.x when corrupt objects are detected.

ppf2 commented 3 years ago

Not sure if this will only be addressed in 8.0. It will be nice if we can backport this to 7.x. We had some failed production migrations due to Kibana leaving write blocks in place when encountering Elasticsearch exceptions (not isolated to document corruption during migration). As a result, on Cloud, it successfully upgraded Elasticsearch to a later 7.x version while leaving Kibana in an older 7.minor.

Example:

[.kibana_task_manager] SET_SOURCE_WRITE_BLOCK -> CREATE_REINDEX_TEMP. took: 96ms.

[.kibana_task_manager] [validation_exception]: Validation Failed: 1: this action would add [2] shards, but this cluster currently has [2000]/[2000] maximum normal shards open;

[.kibana_task_manager] migration failed, dumping execution log:

elastic / kibana

Minimize downtime while remedying corrupt document migration failures #100768