Closed dakrone closed 2 years ago
Pinging @elastic/es-core-infra (Team:Core/Infra)
I have just noticed that for ML a couple of the steps in the plan potentially clash:
Cancel pre-existing tasks running on a node that is marked as shutting down
and:
Check within the plugin lifecycle for the safety of shutdown
For ML we would like to hook into the node shutdown API so that we can tell running ML jobs to persist their latest state and stop working so that when they restart on the new node they won't have to redo any work. This is a massive benefit for ML, especially in an autoscaling world where every ML node will get stopped and replaced as a cluster scales up from a small size.
But since the ML jobs are persistent tasks if they are cancelled by the framework before the plugin gets to react to node shutdown then all the benefit will be lost. So it would be best for ML if the pre-existing tasks weren't cancelled until plugins have said they are ready for shutdown - then ML can stop the jobs in a more controlled way that's unique to node shutdown.
In fact if there is a release where "Cancel pre-existing tasks running on a node that is marked as shutting down" is implemented but "Check within the plugin lifecycle for the safety of shutdown" is not then that will make the situation with ML worse than it is today. The reason is that jobs and datafeeds are separate persistent tasks. If they are cancelled in a random order and that happens to be job first, datafeed second, then the datafeed may fail if it finds its job has been unable to accept data. This would mean that the datafeed wouldn't restart on the replacement node. So we would move to a situation where ML randomly stopped working after node shutdown.
Maybe the implementation order of the different steps needs to be changed. Or if not, at least persistent tasks need to be able to opt out of being cancelled during node shutdown so that there isn't a release where ML is in a worse situation than it's in today.
@droberts195 thanks for the information, I think we can design this in such a way as to accommodate that.
Right now we do no cancellation of persistent tasks (we only avoid assigning new ones to nodes shutting down). I'm also currently looking into the best way to signal from a plugin that it is "safe to shut down". We can look at making the service that would cancel persistent tasks wait until all plugins have signaled that they are okay to shut down, and I think that would work in this case?
Maybe the implementation order of the different steps needs to be changed.
These steps are not necessarily in implementation order, they're also flexible to be added/removed/changed as we develop. They're more of a way for us to tie the different implementation pieces to this issue.
We can look at making the service that would cancel persistent tasks wait until all plugins have signaled that they are okay to shut down, and I think that would work in this case?
Yes, that would be perfect if you could do that.
I agree with the thoughts here but I wanted to add the following:
@droberts195 I don't think this will be a big problem for what ML wants to do but I wanted to note that particularly in the restart case we want the plugins to be quick in getting to a state where they are safe for shutdown. Restarts should be fairly responsive so we wouldn't really want a node taking ~30 minutes to be ready to shutdown. The time from the original shutdown request to the node being ready to be terminated should be on the order of a minute for the restart case. For the removal case its is likely to take time because we need to empty the node before it terminates.
This doesn't affect the conclusions drawn from the discussion above but I did want to highlight it
ML will almost always be able to persist state for all jobs running on a node within a minute or so. It could theoretically take longer but that would mean one of the following:
.ml-state
index are running very slowly and take more than a minute to index the state documents we persistBut even in these cases the disruption to the cluster of not shutting down gracefully needs to be considered. If a job doesn't persist state before the node shuts down then when it restarts on another node (or the same node restarted) after shutdown it will have to revert to the most recent model snapshot it did persist, delete results that are newer than that, and re-analyse the data that is more recent than that model snapshot. So the extra processing required to do this will be a lot more than the processing required to store the latest state before shutdown.
I don't think we'll ever see ML state persistence taking 30 minutes unless there's a major problem with the nodes holding the state index. For large deployments we might see 2-3 minutes. We could always define a cutoff of say 5 minutes where if state persistence still hasn't finished we just kill the processes (by cancelling the persistent tasks). But I think it would be sad to set that cutoff at 1 minute because then that's setting us up for major cluster activity after the restart during ML autoscaling - this is what happens today and what we'd like the node shutdown API to solve for ML.
@droberts195 I think thats fine. Note that there is not at this stage going to be an explicit cutoff here but the intention is that whatever plugins need to do to prepare for shutdown should be quick and no more than a few minutes. There is a user expectation of promptness we need to meet most of the time otherwise users are likely to kill the process manually anyway out of frustration (for the on-prem case) or get frustrated that our Cloud service is slow at performing config changes and upgrades (in the ESS/ECE/ECK case). a few minutes for large deployments still sounds reasonable to me with most deployments being ready within the minute mark the majority of the time
I believe since this API has been released, we can close this issue. Any further work can go into dedicated issues.
This issue supersedes #49064, which will be closed.
The node shutdown API should provide a safe way for operators to shutdown a node ensuring all relevant orchestration steps are taken to prevent cluster instability and data loss. The feature can be used to decommission, power cycle or upgrade nodes.
An example of marking a node as part of the shutdown:
And retrieving the shutdown status:
Here are some high-level tasks that need to be completed for this:
ShutdownAwarePlugin
and stop its work while shutting downPhase 2: