OpenClovis / SAFplus-Availability-Scalability-Platform

Middleware that provides libraries, GUI, and code generator to design multi-node (clustered) applications that are highly available, redundant, and scalable. Provides sub-second node and application fault detection and failover, and useful application libraries including distributed hash tables (checkpoint), event, logging, and communications. Implements SA-Forum APIs where applicable. Used anywhere reliability is a must -- like telecom, wireless, defense and enterprise computing. Download stable release with installer from: ftp.openclovis.com
www.openclovis.com
GNU General Public License v2.0
19 stars 13 forks source link

Simplification of Booting Procedure (Boot Level). #117

Open nateshrevankar opened 10 years ago

nateshrevankar commented 10 years ago

Phase 2: Simplification of Booting Procedure (Boot Level).

  1. The whole “boot level” concept is unnecessarily complex. I feel the entire code “bmInitialize”, “bmStart”, etc can be removed and replaced with a thread that simply does:

A. start safplus_logd and safplus_gms if not already started see above. Also start AMF threads (not sure when this happens today).

B. Start the rest of the SAFplus services, based on the XML/database definition Today these services require that a cluster master exists. This is a chickenandegg problem because this node is not actually ready to become master until these very services come up. So a single SC cluster theoretically cannot boot up. Today this is resolved by the master falsely claiming it can become master when in fact it cannot. This can cause nasty issues where the cluster database is lost (never got synchronized) during backtoback failovers that should be accounted for as a catastrophic fault but are not.

C. At the same time, all SAFplus services are tolerant of brief intervals where no master exists (during failover). So please modify the SAFplus services to be tolerant of nomaster at startup.

D. Wait until they come up (should not even be needed b/c user services can tolerate failover so why not service outage because not started yet).

E. Does the system issue a “node is ready event” today? If so it goes here. If not we need to add one.

F. Let cluster master AMF start bringing up user’s services.

nateshrevankar commented 10 years ago

Pls take care of this.