HWCloudEngine / hybrid_cloud

6 stars 12 forks source link

[BGW HA] Implement Watchdog #40

Open eshedg opened 8 years ago

eshedg commented 8 years ago

Overview of new functionality

Implement a watchdog process that monitors the critical runtime components of BGW and ensure a consistent inactive state for a failed BGW instance.

Context of new functionality

The L2GW Agent is responsible for switching to a substitute BGW instance in case of failure. However, it only detects failures based on disconnection from the OVSDB. The responsibility of the watchdog is to ensure L2GW Agent will experience an OVSDB disconnection, even when the failed component is not OVSDB.

Design Guideline

The general direction is to use a STONITH technique, by just killing all surviving processes in case of a partial failure, and then reboot the host to complete its cleanup and availability as a substitute for the next failing BGW.