CloudStack currently offers a 'Maintenance' Mode, which facilitates the live migration of all VMs from a host and removes the host from the cluster for maintenance.
Proposed Feature: "Waiting for Maintenance" Mode
The proposed "Waiting for Maintenance" Mode introduces a preparatory state that addresses scenarios where live migration is impractical or impossible. This feature would enable gradual decommissioning or maintenance while avoiding service disruption.
General Idea of How It Might Work:
1. Operator Responsibilities:****
Customer communication and notification will be managed entirely by the cloud company, outside of CloudStack. This is to inform customers that they are given a time window to voluntarily restart their VMs before the cut off date.
2. CloudStack Responsibilities:
Block the creation of new VMs to the host/cluster marked as 'Waiting For Maintenance'
Ensure restarted VMs are relocated to clusters with matching host tags.
Scenario 1: Decommissioning an Old Compute Cluster
Problem:
Legacy clusters with outdated CPU architectures cannot perform live migration due to compatibility issues
(e.g., VM freezing during migration causing downtime).
Existing VMs must restart to migrate to a new cluster with compatible architectures.
The old cluster remains active, risking the placement of new VMs and hindering decommissioning.
Scenario 2: Maintenance of GPU Clusters with GPU Passthrough
Problem:
GPU passthrough prevents live migration, unlike vGPU setups that allow seamless migration.
Downtime-free maintenance is not feasible, requiring customer cooperation to restart affected VMs.
STEPS TO REPRODUCE
NA
EXPECTED RESULTS
Refer to Above
ACTUAL RESULTS
Not able to facilitate smooth decomissioning of servers for compute where live migration is not possible.
good idea @btzq , I wonder if we need a new state as there is already "prepare for maintenance". this might be overloaded, i.e. set manually instead of automatically. let's investigate.
ISSUE TYPE
COMPONENT NAME
CLOUDSTACK VERSION
CONFIGURATION
OS / ENVIRONMENT
SUMMARY
Current Capability
CloudStack currently offers a 'Maintenance' Mode, which facilitates the live migration of all VMs from a host and removes the host from the cluster for maintenance.
Proposed Feature: "Waiting for Maintenance" Mode
The proposed "Waiting for Maintenance" Mode introduces a preparatory state that addresses scenarios where live migration is impractical or impossible. This feature would enable gradual decommissioning or maintenance while avoiding service disruption.
General Idea of How It Might Work:
1. Operator Responsibilities:****
2. CloudStack Responsibilities:
This is actually a similar process as how AWS Cloud does it: https://aws.amazon.com/maintenance-help/
Use Cases
Scenario 1: Decommissioning an Old Compute Cluster
Problem:
Scenario 2: Maintenance of GPU Clusters with GPU Passthrough
Problem:
STEPS TO REPRODUCE
EXPECTED RESULTS
ACTUAL RESULTS