OpenClovis / SAFplus-Availability-Scalability-Platform

Middleware that provides libraries, GUI, and code generator to design multi-node (clustered) applications that are highly available, redundant, and scalable. Provides sub-second node and application fault detection and failover, and useful application libraries including distributed hash tables (checkpoint), event, logging, and communications. Implements SA-Forum APIs where applicable. Used anywhere reliability is a must -- like telecom, wireless, defense and enterprise computing. Download stable release with installer from: ftp.openclovis.com
www.openclovis.com
GNU General Public License v2.0
19 stars 13 forks source link

SAFplus_amf startup needs to be completely rewritten. #113

Closed nateshrevankar closed 10 years ago

nateshrevankar commented 10 years ago

Initial Comments from Andrew:

SAFplus_amf startup needs to be completely rewritten. The issues are:

  1. Get rid of the EoMain and EoInitialize junk. “Main” should do all that. You will still have to initialize the “EO” to get IOC messaging going but forget about all these callbacks.
  2. (this will be already done in our log rework) Nodename.log, and nodename.log rotation: This is where stdout goes and is completely redundant with safplus_logd. It needs to be completely removed and stdout should be redirected to /dev/null unless a “debug” flag is passed on the command line. If “debug” is on, leave it going to stdout.
  3. Removing nodename.log means that safplus_logd must handle the bootup logs. Therefore it likely needs to be moved to be started right away however a few logs can be sent to shared memory before the logd is consuming them. So hard code safplus_logd’s startup EARLY rather then relying on some xml definition file (but leave the XML definition of safplus_logd in the file for process monitoring).
nateshrevankar commented 10 years ago

Purpose: Simplify AMS start.  Consolidate all initialization of components in cpmMain().  Shared memory used logd changes (implemented separately).  Removing node related log file names and log-rotation (implemented separately). Approach: In existing function, the initialization is partly done by EO, AMF and logd modules. Consolidate and make simple to make single sequence.  Get name and node info and initialize them in EO module.  Get APP configuration and set for current system  Then perform normal start. New control flow sequence:  cpmMain() will start node Name from config files and set to environment variables.  EO is initialized for IOC port info.  Perform settings from APP for current node in main().  Initialize all components serially.  Perform logd initialization as “debug” flag is passed on the command line. If “debug” is on, leave it going to stdout.  Nodename.log, and nodename.log rotation  These perform normal start.

Target Files:  ClInt32T main(ClInt32T argc, ClCharT argv[], ClCharT envp[]) and cpmMain() in file clCpmMain.c will be mainly changed.

Estimation:  6 days for changing and 2 day for verification.

Present function call sequence is : ClInt32T main(ClInt32T argc, ClCharT _argv[], ClCharT envp[]) - clCpmMain.c --cpmValidateEnv() --clCpmParseCmdLine(argc, argv) --clAmsSetInstantiateCommand(argc, argv) --cpmMain(argc, argv) ----loadAspInstallInfo() ----CL_CPM_IOC_ADDRESS_GET(myCh, mySl, address) ----clEoNodeRepresentativeDeclare() ----clAppConfigure() ----clEoInitialize(ClInt32T argc, ClCharT argv[]) ------clEoStaticQueueInit() ------clASPInitialize() --------clLoadEnvVars() --------clEoSetup() ----------clDbgInitialize() ----------eoProtoInit() - init all protocols like RMD, msg, ----------clIocParseConfig(gClEoNodeName,&gpClEoIocConfig) ----------ClRcT clEoGetConfig(ClCharT_ compCfgFile) ------------clParseXML(filePath,CL_EO_CONFIG_FILE_NAME,&clEoConfigListParserData) ...() /* Parse & Copy config.*/ ----------clEoEssentialLibInitialize() ------------ClEoEssentialLibInfoT gEssentialLibInfo[] ----------clLogUtilLibInitialize() ----------clAspBasicLibInitialize() ----------clOsalSigHandlerInitialize() ----------clEoStaticMutexInit() ----------clEoCreate(&eoConfig, &pThis) ----------clAspClientLibInitialize(void) --------clCpmTargetInfoInitialize() ------clEoMyEoObjectGet(&pThis) ------eoConfig.clEoCreateCallout(argc, argv);

nateshrevankar commented 10 years ago

The call sequence is consolidated to:

ClInt32T main(ClInt32T argc, ClCharT argv[], ClCharT envp[]) - clCpmMain.c --cpmValidateEnv() --clCpmParseCmdLine(argc, argv) --clAmsSetInstantiateCommand(argc, argv) --cpmMain(argc, argv) ----loadAspInstallInfo() ----CL_CPM_IOC_ADDRESS_GET(myCh, mySl, address) ----clEoNodeRepresentativeDeclare() ----clAppConfigure() ----clEoStaticQueueInit() ----clCPM_Initialize() ------clLoadEnvVars(); ------clDbgInitialize(); ------Read configuration file ------clEoEssentialLibInitialize() ------clAspBasicLibInitialize() ------clCPM_EoSetup() // do only clEoCreate(&eoConfig, &pThis); ------clCpmTargetInfoInitialize() ----clEoMyEoObjectGet(&pThis) ----clEoRefInc(pThis); ----eoConfig.clEoCreateCallout(argc, argv);

The clCPM_Initialize() will perform all major initializations.

Implementation: the above mentioned sequence is being performed and is able to get the services up and running SAF.

root@ubuntu:~/aspp/etc# ps -eaf | grep -i saf root 6344 1 0 20:34 ? 00:00:00 /root/aspp/bin/safplus_amf -c 0 -l 1 -n SCNodeI0 root 6353 1 0 20:34 ? 00:00:00 python /root/aspp/etc/safplus_watchdog.py root 6383 6344 0 20:34 ? 00:00:00 /root/aspp/bin/safplus_logd root 6392 6344 0 20:34 ? 00:00:00 /root/aspp/bin/safplus_gms clGmsConfig.xml root 6421 6344 2 20:34 ? 00:00:00 /root/aspp/bin/safplus_event root 6430 6344 0 20:34 ? 00:00:00 /root/aspp/bin/safplus_name root 6431 6344 1 20:34 ? 00:00:00 /root/aspp/bin/safplus_ckpt root 6467 6344 1 20:34 ? 00:00:00 /root/aspp/bin/safplus_msg

Testing: The CPM related initial logs are not getting displayed and will be fixed and further code factorization is being done.

Improvement: clCPM_EoSetup() is calling only clEoCreate(&eoConfig, &pThis), which will be further simplified.

nateshrevankar commented 10 years ago

Progress:

  1. Able to create safplus_amf, safplus_watchdog, safplus_logd, safplus_gms in order with initializing Server and Client libraries.
  2. Unnecessary clAspBasicLibInitialize() are removed.
  3. client library which were null were removed.
  4. safplus_event, safplus_name, safplus_ckpt, safplus_msg are not getting created.
  5. AMF exits after few seconds (~60sec).

Issues:

  1. check why the SAF is going down after few seconds of init.
  2. safplus_event, safplus_name, safplus_ckpt, safplus_msg are not getting created.

root@ubuntu:~/aspp# ./eval start --asp-log-level=debug INFO num of bearer : 1 ... INFO Starting AMF... ValidateEnv() clCpmParseCmdLine() forked new process with PID[12447] Set NodeName[SCNodeI0] cpmName[cpmServer_SCNodeI0] iocAddress [1] clLoadEnvVars() clDbgInitialize() done

Reading CPM (EO) configuration file Initializing essential libraries... cpm tableSize [8] Initializing essential library [OSAL]... rc=[0] Initializing essential library [MEMORY]... rc=[0] Initializing essential library [HEAP]... rc=[0] Initializing essential library [BUFFER]... rc=[0] Initializing essential library [TIMER]... rc=[0] Initializing essential library [IOC]... rc=[0] Initializing essential library [RMD]... rc=[0] Initializing essential library [EO]... rc=[0]

clCpmTargetInfoInitialize()

doing APP init from cpmMAIN()

Initializing essential libraries... EO tableSize [8] Initializing essential library [OSAL]... rc=[0] Initializing essential library [MEMORY]... rc=[0] Initializing essential library [HEAP]... rc=[0] Initializing essential library [BUFFER]... rc=[0] Initializing essential library [TIMER]... rc=[0] Initializing essential library [IOC]... rc=[0] Initializing essential library [RMD]... rc=[0] Initializing essential library [EO]... rc=[0]

Process [/root/aspp/bin/safplus_amf] exited normally Returned from cpmMAIN rc=[0]

root@ubuntu:~/aspp# ps -eaf | grep -i saf root 12432 5599 0 19:28 pts/2 00:00:00 grep --color=auto -i saf root@ubuntu:~/aspp# ps -eaf | grep -i saf root 12447 1 2 19:28 ? 00:00:00 /root/aspp/bin/safplus_amf -c 0 -l 1 -n SCNodeI0 root 12457 1 3 19:28 ? 00:00:00 python /root/aspp/etc/safplus_watchdog.py root 12483 12447 0 19:28 ? 00:00:00 /root/aspp/bin/safplus_logd root 12496 12447 0 19:28 ? 00:00:00 /root/aspp/bin/safplus_gms clGmsConfig.xml root 12504 5599 0 19:28 pts/2 00:00:00 grep --color=auto -i saf root@ubuntu:~/aspp# ps -eaf | grep -i saf root 12447 1 1 19:28 ? 00:00:00 /root/aspp/bin/safplus_amf -c 0 -l 1 -n SCNodeI0 root 12457 1 1 19:28 ? 00:00:00 python /root/aspp/etc/safplus_watchdog.py root 12483 12447 3 19:28 ? 00:00:00 /root/aspp/bin/safplus_logd root 12496 12447 1 19:28 ? 00:00:00 /root/aspp/bin/safplus_gms clGmsConfig.xml root 12509 5599 0 19:28 pts/2 00:00:00 grep --color=auto -i saf root@ubuntu:~/aspp# ps -eaf | grep -i saf root 12447 1 0 19:28 ? 00:00:00 /root/aspp/bin/safplus_amf -c 0 -l 1 -n SCNodeI0 root 12457 1 0 19:28 ? 00:00:00 python /root/aspp/etc/safplus_watchdog.py root 12483 12447 1 19:28 ? 00:00:00 /root/aspp/bin/safplus_logd root 12496 12447 0 19:28 ? 00:00:00 /root/aspp/bin/safplus_gms clGmsConfig.xml root 12511 5599 0 19:28 pts/2 00:00:00 grep --color=auto -i saf root@ubuntu:~/aspp# ps -eaf | grep -i saf root 12447 1 0 19:28 ? 00:00:00 /root/aspp/bin/safplus_amf -c 0 -l 1 -n SCNodeI0 root 12457 1 0 19:28 ? 00:00:00 python /root/aspp/etc/safplus_watchdog.py root 12483 12447 0 19:28 ? 00:00:00 /root/aspp/bin/safplus_logd root 12496 12447 0 19:28 ? 00:00:00 /root/aspp/bin/safplus_gms clGmsConfig.xml root 12532 5599 0 19:28 pts/2 00:00:00 grep --color=auto -i saf root@ubuntu:~/aspp# ps -eaf | grep -i saf root 12447 1 0 19:28 ? 00:00:00 /root/aspp/bin/safplus_amf -c 0 -l 1 -n SCNodeI0 root 12457 1 0 19:28 ? 00:00:00 python /root/aspp/etc/safplus_watchdog.py root 12483 12447 0 19:28 ? 00:00:00 /root/aspp/bin/safplus_logd root 12496 12447 0 19:28 ? 00:00:00 /root/aspp/bin/safplus_gms clGmsConfig.xml root 12574 5599 0 19:29 pts/2 00:00:00 grep --color=auto -i saf root@ubuntu:~/aspp# ps -eaf | grep -i saf root 12447 1 0 19:28 ? 00:00:00 /root/aspp/bin/safplus_amf -c 0 -l 1 -n SCNodeI0 root 12457 1 0 19:28 ? 00:00:00 python /root/aspp/etc/safplus_watchdog.py root 12483 12447 0 19:28 ? 00:00:00 /root/aspp/bin/safplus_logd root 12597 5599 0 19:29 pts/2 00:00:00 grep --color=auto -i saf root@ubuntu:~/aspp# ps -eaf | grep -i saf root 12447 1 0 19:28 ? 00:00:00 /root/aspp/bin/safplus_amf -c 0 -l 1 -n SCNodeI0 root 12457 1 0 19:28 ? 00:00:00 python /root/aspp/etc/safplus_watchdog.py root 12483 12447 0 19:28 ? 00:00:00 /root/aspp/bin/safplus_logd root 12606 5599 0 19:29 pts/2 00:00:00 grep --color=auto -i saf root@ubuntu:~/aspp# ps -eaf | grep -i saf root 12457 1 0 19:28 ? 00:00:00 python /root/aspp/etc/safplus_watchdog.py root 12655 5599 0 19:29 pts/2 00:00:00 grep --color=auto -i saf root@ubuntu:~/aspp# ps -eaf | grep -i saf root 12663 5599 0 19:29 pts/2 00:00:00 grep --color=auto -i saf

nateshrevankar commented 10 years ago

Issues Fixed:

  1. SAF is going down after few seconds of init - Now it is able to run without error for more than a hour.
  2. safplus_event, safplus_name, safplus_ckpt, safplus_msg are getting created and running without errors..

Process status: root@ubuntu:~# ps -eaf | grep saf root 31385 16686 0 18:34 pts/1 00:00:00 grep --color=auto saf root@ubuntu:~# ps -eaf | grep saf root 31405 1 1 18:34 ? 00:00:00 /root/aspp/bin/safplus_amf -c 0 -l 1 -n SCNodeI0 root 31414 1 1 18:34 ? 00:00:00 python /root/aspp/etc/safplus_watchdog.py root 31440 31405 3 18:34 ? 00:00:00 /root/aspp/bin/safplus_logd root 31453 31405 1 18:34 ? 00:00:00 /root/aspp/bin/safplus_gms clGmsConfig.xml root 31464 16686 0 18:34 pts/1 00:00:00 grep --color=auto saf root@ubuntu:~# ps -eaf | grep saf root 31405 1 0 18:34 ? 00:00:00 /root/aspp/bin/safplus_amf -c 0 -l 1 -n SCNodeI0 root 31414 1 0 18:34 ? 00:00:00 python /root/aspp/etc/safplus_watchdog.py root 31440 31405 1 18:34 ? 00:00:00 /root/aspp/bin/safplus_logd root 31453 31405 0 18:34 ? 00:00:00 /root/aspp/bin/safplus_gms clGmsConfig.xml root 31479 31405 0 18:34 ? 00:00:00 /root/aspp/bin/safplus_event root 31489 16686 0 18:34 pts/1 00:00:00 grep --color=auto saf root@ubuntu:~# ps -eaf | grep saf root 31405 1 0 18:34 ? 00:00:00 /root/aspp/bin/safplus_amf -c 0 -l 1 -n SCNodeI0 root 31414 1 0 18:34 ? 00:00:00 python /root/aspp/etc/safplus_watchdog.py root 31440 31405 0 18:34 ? 00:00:00 /root/aspp/bin/safplus_logd root 31453 31405 0 18:34 ? 00:00:00 /root/aspp/bin/safplus_gms clGmsConfig.xml root 31479 31405 0 18:34 ? 00:00:00 /root/aspp/bin/safplus_event root 31490 31405 0 18:34 ? 00:00:00 /root/aspp/bin/safplus_name root 31491 31405 0 18:34 ? 00:00:00 /root/aspp/bin/safplus_ckpt root 31518 31405 0 18:34 ? 00:00:00 /root/aspp/bin/safplus_msg root 31598 16686 0 18:35 pts/1 00:00:00 grep --color=auto saf root@ubuntu:~# date Fri Jan 3 18:35:10 IST 2014 root@ubuntu:~# date Fri Jan 3 18:47:51 IST 2014 root@ubuntu:~# ps -eaf | grep saf root 382 16686 0 18:47 pts/1 00:00:00 grep --color=auto saf root 31405 1 0 18:34 ? 00:00:00 /root/aspp/bin/safplus_amf -c 0 -l 1 -n SCNodeI0 root 31414 1 0 18:34 ? 00:00:00 python /root/aspp/etc/safplus_watchdog.py root 31440 31405 0 18:34 ? 00:00:02 /root/aspp/bin/safplus_logd root 31453 31405 0 18:34 ? 00:00:00 /root/aspp/bin/safplus_gms clGmsConfig.xml root 31479 31405 0 18:34 ? 00:00:00 /root/aspp/bin/safplus_event root 31490 31405 0 18:34 ? 00:00:00 /root/aspp/bin/safplus_name root 31491 31405 0 18:34 ? 00:00:00 /root/aspp/bin/safplus_ckpt root 31518 31405 0 18:34 ? 00:00:00 /root/aspp/bin/safplus_msg root@ubuntu:~#

nateshrevankar commented 10 years ago

Issue: Initial logs during boot ("Welcome ....") not getting reflected in sys.latest file.

The init sequence is:

fork new process and call cpmMAIN() clLogLibInitialize() from cpmMAIN() clLogDebugLevelSet - by setting env variable for log level clLogInitialize() clLogHandleInitHandleCreate() clLogHandleDBGet[rc = 0x40012] Error 0x12 = CL_ERR_INVALID_STATE

Cause: The log module is searching for EO object which will be initialized during EO-Init as part of Client library. As the CPM and EO library functionality are separated, log module is not initialized as EO object instance in process context is invalid. Though the initial logs are not available in log file, after EO lib init as client library, the logs are available and system is coming up with all related processes. Working to get "Welcome...." log in sys.latest file.

nateshrevankar commented 10 years ago
 I have done changes for - SAFplus_amf startup need to have all start procedure in CPMmain().

The changes are in 2 files :

  1. clCpmMain.c
  2. clEo.c

The changes are:

  1. Main() will start cpmMain without demonizing it and making it in foreground process.
  2. cpmMain() will do the basic name setting for the process.
  3. clCPM_Initialize() is new function which does all initialization, a. LoadEnvVars b. DbgInitialize c. eoProtoInit d. Reading configuration file e. EssentialLibInitialize f. LogUtilLibInitialize g. AspBasicLibInitialize h. EoCreate i. AspClientLibInitialize j. CpmTargetInfoInitialize k. Start EO essentials.

The changes are uploaded in location : https://github.com/nateshrevankar/SAFplus-Availability-Scalability-Platform/commit/df2135ab47fdf2bafebe36d6d5bbca0182550a69