0-complexity / openvcloud

OpenvCloud
Other
2 stars 4 forks source link

Virtual machine failover within 1' in case of CPU node failure #903

Open FastGeert opened 6 years ago

FastGeert commented 6 years ago

Currently it can take up to 5' before VM's are being failed over. For today's standards this is way to slow, and needs to be brought down to less than 1'.

The implementation should prevent split-brain issues, where the CPU node becomes inaccessible while the vms running on it are still doing iops / networking IO. Currently OVS does not have an API to terminate running edge client connections. A way to prevent two VM's from doing simultaneous IO to the same vdisk is by temporary blocking all network access between the failed CPU node and the storagerouters. Need to investigate if the same can be done for the networking IO of the vm's that might still be running on the failing CPU node.

Another way to prevent split brain issue is by performing a shutdown or power cycle via the ipmi interface of the physical machine. This approach does not require cleanup actions in firewalls when the node gets re-enabled.

FastGeert commented 6 years ago

FYI @hofkensj @Wvandebroeck

FastGeert commented 6 years ago

@grimpy Can you provide a high level overview how you would accomplish this requirement.

hofkensj commented 6 years ago

take a look at https://github.com/0-complexity/openvcloud/issues/587 - is this a full duplicate.

grimpy commented 6 years ago

Need to write api to issue ipmi commands on physical nodes (cloudbroker/node)

grimpy commented 6 years ago

In agentcontroller needs a monitor for its agents: When it detects that an agent hasn't connected within the last minute it schedules a jumpscript on the master agent.

This jumpscript will call the maintenance mode api in the portal for this node (force=True)

FastGeert commented 6 years ago

Uptime daemon

Small UDP gevent based daemon, running on everey node, that sends a uuid.uuid4() (==16 random bytes, do not use the string format), to all other uptime daemons (= loaded once on startup from a configuration file that is kept up to date via a jumpscript) which just echo the data back from where it comes. When the uptime daemon does not receive an echo within x seconds (lets put it in the config file), it notifies the uptime monitor running in the kubernetes controller. The uptime daemon is installed as a systemd service, and is started automatically at boot time. The uptime daemon uses both the management and the backend network.

Uptime monitor

Small UDP gevent based daemon that receives notifications of the uptime daemons when they fail to get an echo for a certain other uptime daemon. When the uptime monitor discovers that more than 50% of the uptime daemons report that a certain node is not responding. Its starts investigating by running a jumpscript from one of the controller agents:

FastGeert commented 6 years ago

Dont forget about: