Add node shutdown API for shutting down nodes cleanly

dakrone commented 3 years ago

This issue supersedes #49064, which will be closed.

The node shutdown API should provide a safe way for operators to shutdown a node ensuring all relevant orchestration steps are taken to prevent cluster instability and data loss. The feature can be used to decommission, power cycle or upgrade nodes.

An example of marking a node as part of the shutdown:

PUT /_nodes/<node_id>/shutdown
{
  "type": "remove",¹
  "reason": "shutdown of node so we can remove it from the cluster"²
}
¹ The type of decommission, in this case either a "remove" (the node is never coming back) or a "restart"
² A user-enterable free text block description of the reason why the node is being shut down

And retrieving the shutdown status:

GET /_nodes/<node_id>/shutdown

{
  "node": "data-node-1",
  "node_id": "node-id-1",
  "type": "remove",
  "reason": "shutdown of node so we can remove it from the cluster"
  "status": {¹
     "shutdown_status": "IN_PROGRESS",²
     "shard_migration": {
       "status": "IN_PROGRESS",
       "shard_migrations_remaining": 7,³
       "time_started": "<user readable date>",
       "time_started_millis": 234091892
     },
     "persistent_tasks": {⁴
       "status": "IN_PROGRESS",
       "tasks_remaining": 2,⁵
       "error": "ICouldntStopTheTasksException[i can't do that dave]...etc stacktrack etc...",
       "time_started": "<user readable date>",
       "time_started_millis": 128391987
     },
     "plugins": {⁶
       "status": "NOT_STARTED",
     },
     "data_loss_on_removal": false⁷
  },
  "time_since_shutdown": "1.2h",⁸
  "time_since_shutdown_millis": 4320000,
  "shutdown_started": "<user readable date>",9
  "shutdown_started_millis": 128391987
}
1. Shows the current state of the shutdown for this node. This can be used by operators to track progress
2. Overall shutdown status. Possible values are: "IN_PROGRESS", "COMPLETE", "STALLED". IF the shutdown is STALLED a error field will also be returned containing the reason the shutdown is stalled (e.g. no nodes can take remaining shards)
3. How many shards remain to be migrated off of this node
4. Whether in progress persistent tasks have been halt and new tasks have been blocked
5. The number of tasks that need to be completed before shutdown
6. Whether plugins have indicated that they are ready for shutdown
7. Whether data loss could occur if the node was terminated now
8. How long the shutdown has been ongoing.
9. When the shutdown was initiated.

Here are some high-level tasks that need to be completed for this:

[x] Add cluster state building blocks for tracking node shutdown status (@gwbrown) #70044
- [x] Implement full status API that reads shutdown status (@gwbrown) #71162
[x] Add REST scaffolding and feature flag for the shutdown APIs (@dakrone) #70697
[x] Mechanism for migrating data away from a decommissioned node
- [x] Allocation decider (@gwbrown) #71658
- [x] Ensure status is updated for data migration (@gwbrown) #73873
[x] Mechanism to handle persistent tasks
- [x] Ensure persistent tasks are not assigned to nodes shutting down (@dakrone) #72260
[x] Mechanism for a node being restarted to retain its data (@gwbrown) #75606
[x] Method to avoid needing to stop ILM (@dakrone) #73690
[x] Check within the plugin lifecycle for the safety of shutdown (@dakrone) #73690
- [x] Update ML to make use of the ShutdownAwarePlugin and stop its work while shutting down
[x] Convert system property feature flag into yml setting that cannot be enabled on a non-release build (@dakrone) #74267
[x] Remove feature flag (when ready for release) (@gwbrown) #76588
- [x] Flip feature flag to default to "true" for snapshot builds (@dakrone) #75962

Phase 2:

[x] Add "REPLACE" shutdown type
- [x] Add REST and cluster state support for the "REPLACE" shutdown type (@gwbrown)
- [x] Add allocation decider and change existing deciders to handle node replacements (@dakrone)
[ ] Upgrades to persistent task handling
- [ ] Cancel pre-existing tasks running on a node that is marked as shutting down (@dakrone)
- [ ] Hook persistent task state into shutdown status API (@dakrone)
[ ] Enhance data tier allocation decider to allow migrating to a different tier if all nodes in a certain tier are shutdown (possibly?)

elasticmachine commented 3 years ago

Pinging @elastic/es-core-infra (Team:Core/Infra)

droberts195 commented 3 years ago

I have just noticed that for ML a couple of the steps in the plan potentially clash:

Cancel pre-existing tasks running on a node that is marked as shutting down

and:

Check within the plugin lifecycle for the safety of shutdown

For ML we would like to hook into the node shutdown API so that we can tell running ML jobs to persist their latest state and stop working so that when they restart on the new node they won't have to redo any work. This is a massive benefit for ML, especially in an autoscaling world where every ML node will get stopped and replaced as a cluster scales up from a small size.

But since the ML jobs are persistent tasks if they are cancelled by the framework before the plugin gets to react to node shutdown then all the benefit will be lost. So it would be best for ML if the pre-existing tasks weren't cancelled until plugins have said they are ready for shutdown - then ML can stop the jobs in a more controlled way that's unique to node shutdown.

In fact if there is a release where "Cancel pre-existing tasks running on a node that is marked as shutting down" is implemented but "Check within the plugin lifecycle for the safety of shutdown" is not then that will make the situation with ML worse than it is today. The reason is that jobs and datafeeds are separate persistent tasks. If they are cancelled in a random order and that happens to be job first, datafeed second, then the datafeed may fail if it finds its job has been unable to accept data. This would mean that the datafeed wouldn't restart on the replacement node. So we would move to a situation where ML randomly stopped working after node shutdown.

Maybe the implementation order of the different steps needs to be changed. Or if not, at least persistent tasks need to be able to opt out of being cancelled during node shutdown so that there isn't a release where ML is in a worse situation than it's in today.

dakrone commented 3 years ago

@droberts195 thanks for the information, I think we can design this in such a way as to accommodate that.

Right now we do no cancellation of persistent tasks (we only avoid assigning new ones to nodes shutting down). I'm also currently looking into the best way to signal from a plugin that it is "safe to shut down". We can look at making the service that would cancel persistent tasks wait until all plugins have signaled that they are okay to shut down, and I think that would work in this case?

dakrone commented 3 years ago

Maybe the implementation order of the different steps needs to be changed.

These steps are not necessarily in implementation order, they're also flexible to be added/removed/changed as we develop. They're more of a way for us to tie the different implementation pieces to this issue.

droberts195 commented 3 years ago

We can look at making the service that would cancel persistent tasks wait until all plugins have signaled that they are okay to shut down, and I think that would work in this case?

Yes, that would be perfect if you could do that.

colings86 commented 3 years ago

I agree with the thoughts here but I wanted to add the following:

@droberts195 I don't think this will be a big problem for what ML wants to do but I wanted to note that particularly in the restart case we want the plugins to be quick in getting to a state where they are safe for shutdown. Restarts should be fairly responsive so we wouldn't really want a node taking ~30 minutes to be ready to shutdown. The time from the original shutdown request to the node being ready to be terminated should be on the order of a minute for the restart case. For the removal case its is likely to take time because we need to empty the node before it terminates.

This doesn't affect the conclusions drawn from the discussion above but I did want to highlight it

droberts195 commented 3 years ago

ML will almost always be able to persist state for all jobs running on a node within a minute or so. It could theoretically take longer but that would mean one of the following:

The nodes storing the shards of the .ml-state index are running very slowly and take more than a minute to index the state documents we persist
There are very big jobs running on the node that take a long time to generate their state in JSON document format

But even in these cases the disruption to the cluster of not shutting down gracefully needs to be considered. If a job doesn't persist state before the node shuts down then when it restarts on another node (or the same node restarted) after shutdown it will have to revert to the most recent model snapshot it did persist, delete results that are newer than that, and re-analyse the data that is more recent than that model snapshot. So the extra processing required to do this will be a lot more than the processing required to store the latest state before shutdown.

I don't think we'll ever see ML state persistence taking 30 minutes unless there's a major problem with the nodes holding the state index. For large deployments we might see 2-3 minutes. We could always define a cutoff of say 5 minutes where if state persistence still hasn't finished we just kill the processes (by cancelling the persistent tasks). But I think it would be sad to set that cutoff at 1 minute because then that's setting us up for major cluster activity after the restart during ML autoscaling - this is what happens today and what we'd like the node shutdown API to solve for ML.

colings86 commented 3 years ago

@droberts195 I think thats fine. Note that there is not at this stage going to be an explicit cutoff here but the intention is that whatever plugins need to do to prepare for shutdown should be quick and no more than a few minutes. There is a user expectation of promptness we need to meet most of the time otherwise users are likely to kill the process manually anyway out of frustration (for the on-prem case) or get frustrated that our Cloud service is slow at performing config changes and upgrades (in the ESS/ECE/ECK case). a few minutes for large deployments still sounds reasonable to me with most deployments being ready within the minute mark the majority of the time

dakrone commented 2 years ago

I believe since this API has been released, we can close this issue. Any further work can go into dedicated issues.

elastic / elasticsearch

Add node shutdown API for shutting down nodes cleanly #70338