FastGeert commented 6 years ago

Currently it can take up to 5' before VM's are being failed over. For today's standards this is way to slow, and needs to be brought down to less than 1'.

The implementation should prevent split-brain issues, where the CPU node becomes inaccessible while the vms running on it are still doing iops / networking IO. Currently OVS does not have an API to terminate running edge client connections. A way to prevent two VM's from doing simultaneous IO to the same vdisk is by temporary blocking all network access between the failed CPU node and the storagerouters. Need to investigate if the same can be done for the networking IO of the vm's that might still be running on the failing CPU node.

Another way to prevent split brain issue is by performing a shutdown or power cycle via the ipmi interface of the physical machine. This approach does not require cleanup actions in firewalls when the node gets re-enabled.

FastGeert commented 6 years ago

FYI @hofkensj @Wvandebroeck

FastGeert commented 6 years ago

@grimpy Can you provide a high level overview how you would accomplish this requirement.

hofkensj commented 6 years ago

take a look at https://github.com/0-complexity/openvcloud/issues/587 - is this a full duplicate.

grimpy commented 6 years ago

Need to write api to issue ipmi commands on physical nodes (cloudbroker/node)

grimpy commented 6 years ago

In agentcontroller needs a monitor for its agents: When it detects that an agent hasn't connected within the last minute it schedules a jumpscript on the master agent.

This jumpscript will call the maintenance mode api in the portal for this node (force=True)

FastGeert commented 6 years ago

Uptime daemon

Small UDP gevent based daemon, running on everey node, that sends a uuid.uuid4() (==16 random bytes, do not use the string format), to all other uptime daemons (= loaded once on startup from a configuration file that is kept up to date via a jumpscript) which just echo the data back from where it comes. When the uptime daemon does not receive an echo within x seconds (lets put it in the config file), it notifies the uptime monitor running in the kubernetes controller. The uptime daemon is installed as a systemd service, and is started automatically at boot time. The uptime daemon uses both the management and the backend network.

Uptime monitor

Small UDP gevent based daemon that receives notifications of the uptime daemons when they fail to get an echo for a certain other uptime daemon. When the uptime monitor discovers that more than 50% of the uptime daemons report that a certain node is not responding. Its starts investigating by running a jumpscript from one of the controller agents:

Check if the node is in maintenance or decomissioned. (The uptime daemon cashes this value for 5 minutes in case of maintenance. Forever in case of decomissioned).
- If true the offline status reported by the uptime daemons is ignored, and further analysis is cancelled.
- If false the next step in the analysis is started
Establish ssh connection to the node. Maximum timeout 5 seconds.
- If this succeeds:
- Check if the uptime daemon is running:
  - If it is running, create a healthcheck warning that the uptime daemon had to be restarted.
  - If it is not running, create a healtheck error that the uptime daemon had crashed and was started again.
- If this does not succeed,
- Put the node in maintenance
- Issue the power-off ipmi command for the node in case of a CPU node.

FastGeert commented 6 years ago

Dont forget about:

liveness probes for the uptime monitor pod in kubernetes
make sure this feature can be turned on / off. By default it should be off. Off means that instead of putting nodes into maintenance, and powering them off, a HC shows that the system was not reachable anymore.

0-complexity / openvcloud

Virtual machine failover within 1' in case of CPU node failure #903

Currently it can take up to 5' before VM's are being failed over. For today's standards this is way to slow, and needs to be brought down to less than 1'.

Uptime daemon

Uptime monitor