Handle cluster_block_exception during reindexing the TM index

ersin-erdal commented 3 days ago

Resolves: https://github.com/elastic/response-ops-team/issues/249

This PR increases task claiming interval in case of cluster_block_exception to avoid generating too many error during TM index reindexing.

To verify:

Run your local Kibana,
Create a user with kibana_system and kibana_admin roles
Logout and login with your new user
Use below request to put a write block on TM index. PUT /.kibana_task_manager_9.0.0_001/_block/write
Observe the error messages and their occurring interval on your terminal.

Use below request on the Kibana console to halt write block.

PUT /.kibana_task_manager_9.0.0_001/_settings
{
"index": {
"blocks.write": false
}
}

elasticmachine commented 3 days ago

:robot: Jobs for this PR can be triggered through checkboxes. :construction: :information_source: To trigger the CI, please tick the checkbox below :point_down: - [ ] Click to trigger **kibana-pull-request** for this PR! - [ ] Click to trigger **kibana-deploy-project-from-pr** for this PR!

pmuellr commented 2 days ago

Haven't reviewed the code yet, but I did take it for a spin.

Notes:

if the write block is still on and Kibana is restarted, messages like this are logged: Task ML:saved-objects-sync-task: Error running task: ML:saved-objects-sync-task, index [.kibana_task_manager_9.0.0_001] blocked by: [FORBIDDEN/8/index write (api)];: cluster_block_exception Guessing this is probably ok, but why would we be trying to write a task, that presumably already exists? Is that the way "ensureScheduled" (or whatever) works w/TM? Not clear if it's all the tasks or just some. Not sure it's worth doing anything about this, if anything it's a great signal that the TM index is write-blocked :-)
when using the update-by-query claimer, there's a long, filled-with-JSON error logged every 3s: Failed to poll for work: { big JSON wad here }. Seems like we should try to not log that every 3s, but perhaps the # of folks using that claimer, by the time we're in version 8.last, will be almost or literally none.

Other than that, seems to work as described. Looks like it's logging the Discovery service message ~1/minute, and then you can see errors updating task claims, etc, as expected. When the block is removed, everything comes back to normal.

elastic / kibana

Handle cluster_block_exception during reindexing the TM index #201297

To verify: