OpenClovis / SAFplus-Availability-Scalability-Platform

Middleware that provides libraries, GUI, and code generator to design multi-node (clustered) applications that are highly available, redundant, and scalable. Provides sub-second node and application fault detection and failover, and useful application libraries including distributed hash tables (checkpoint), event, logging, and communications. Implements SA-Forum APIs where applicable. Used anywhere reliability is a must -- like telecom, wireless, defense and enterprise computing. Download stable release with installer from: ftp.openclovis.com
www.openclovis.com
GNU General Public License v2.0
19 stars 13 forks source link

Simplifying SAF Startup procedure #111

Closed nateshrevankar closed 10 years ago

nateshrevankar commented 10 years ago

Initial comments from Andrew: Startup: “safplus start” should start the AMF watchdog. This program will start its normal operation. In doing so, it detects that SAFplus is not running and starts it. This start procedure should be the exact same code as happens after failure detection. AMF watchdog: The purpose of the AMF watchdog is to be an extremely simple program that restarts SAFplus if it stops. It needs to have ZERO bugs because nothing monitors it. The simplest way to make a zero bug program is to make it simple. Currently it is extremely complex and has been known to quit by raising unhandled exceptions or simply quitting for and unknown reason. The AMF watchdog needs to be rewritten to become a < 50 line program that:

  1. daemonizes itself
  2. loops doing: a) tests for the existence of SAFplus. b) If it is dead it checks for a SINGLE “quit” indicator (I suggest touching a file in var/run) if that indicator exists the watchdog quits. c) Otherwise it delays 30 seconds then starts safplus_amf.
nateshrevankar commented 10 years ago

The function for Start needs to be:  Start Watchdog alone.  Watchdog checks for a file for existence (this file to be created if AMF should NOT be restarted).  Start AMF if restart file is not available as demon process. o Save previous core, log, db files (better to move this to Watchdog before starting SAF) o Load configuration of TIPC module. o Specify library paths o For System controller, start SNMP daemon o For Simulation, start GMS configurations o Start AMF by having node name and address. o Load HPI if it not simulation

nateshrevankar commented 10 years ago

The following things are accomplished in the uploaded files:

  1. "start" from shell will start safplus_watchdog_start.py
  2. This in-turn starts safplus_watchdog.py in background and return.
  3. Watchdog process will check for presence of Restart File "safplus_restart" (this file is created for graceful exit. For abrupt exit, this file is NOT created).
  4. If restart file not available (implies abrupt shutdown of SAF), watchdog will start AMS process.
  5. AMS process will clean shared memory and save previous core, log, db files, then start AMF.

The initial WIP version is pushed in github: https://github.com/nateshrevankar/SAFplus-Availability-Scalability-Platform/commit/3ce40e7a2a5bd606df6e1709d5081f969fa1c31c

The basic testing is performed and it working well. further testing is in-progress and may need fine tuning as the watchdog and SAF start are inter-related.

nateshrevankar commented 10 years ago

The start-up process is verified and uploaded in location: https://github.com/nateshrevankar/SAFplus-Availability-Scalability-Platform/commit/382b04064fca111a803747b579b12add017375f7

The supported functionality are: start : Start Watchdog stop : Kill Wathdog restart : Stop ASP. After 30 sec - watchdog will start SAF again. status : ASP status. help : help menu.

nateshrevankar commented 10 years ago

Test Case 1: Procedure:

  1. Freshly load the image and then perform "./safplus_model start"
  2. Verify the file which is starting the script. Expected result: safplus_watchdog_start.py script to be started with start (user supplied) command. Result: PASS.

Test Case 2: Procedure:

  1. Freshly load the image and then perform "./safplus_model start" from root privileges.
  2. Verify the tasks started by running safplus_watchdog_start.py . Expected result: Task needs to be spawned "root XXXXX 1 0 15:01 ? 00:00:00 python /root/eval/etc/safplus_watchdog.py" Result: PASS.

Test Case 3: Procedure:

  1. Freshly load the image and then perform "./safplus_model start" without root privileges.
  2. Verify the tasks started by running safplus_watchdog_start.py . Expected result: Should not start watchdog and ASP and display error "ASP is not being run in root user mode" Result: PASS.

Test Case 4: Procedure:

  1. Start watchdog "./safplus_model start" when a watchdog process is still running.
  2. Observe the tasks started. Expected result: The watchdog to exit without creating the task stating error "Watchdog is already running on node [XX], pid YYY" Result: PASS.

Test Case 5: Procedure:

  1. Delete restart file (safplus_restart) if available in location "/root/safplus/var/run"
  2. Start Watchdog by using command.
  3. Observe the tasks created. Expected result: Watchdog task as initiated from python and AMS to be started with relative tasks.

Result: PASS. root@ubuntu:~/eval/var/run# ps -eaf | grep -i saf root 4212 6499 0 17:15 pts/3 00:00:00 grep --color=auto -i saf root 31172 1 0 15:01 ? 00:00:00 python /root/eval/etc/safplus_watchdog.py root 32047 1 0 15:04 ? 00:00:07 /root/eval/bin/safplus_amf -c 0 -l 1 -n SCNodeI0 root 32061 32047 12 15:04 ? 00:16:47 /root/eval/bin/safplus_logd root 32072 32047 0 15:04 ? 00:00:06 /root/eval/bin/safplus_gms clGmsConfig.xml root 32089 32047 0 15:04 ? 00:00:06 /root/eval/bin/safplus_event root 32098 32047 0 15:04 ? 00:00:06 /root/eval/bin/safplus_name root 32099 32047 0 15:04 ? 00:00:07 /root/eval/bin/safplus_ckpt root 32126 32047 0 15:04 ? 00:00:06 /root/eval/bin/safplus_msg


Test Case 6: Procedure:

  1. Create restart file (safplus_restart) available in location "/root/safplus/var/run"
  2. Start Watchdog by using command.
  3. Observe the tasks created. Expected result: Watchdog task should not create AMF process, in-turn it need to proceed for killing self as AMF has exited gracefully (by creating safplus_restart file) Result: PASS.

Test Case 7: Procedure:

  1. Start watchdog and wait till the AMF is started along with required processes.
  2. Stop watchdog using command "./safplus_model stop" Expected result:
  3. AMF related processes to be stopped
  4. Then kill watchdog, hence all SAF related tasks are killed by clearing temp files. Result: PASS.

Test Case 8: Procedure:

  1. Create some core and log files in respective location Ex:"/root/safplus/var/log"
  2. Start Watchdog by using command.
  3. Observe the files were backed-up. Expected result: While starting AMS, the code and log files to be backed-up Result: PASS.

Test Case 9: Procedure:

  1. Start watchdog and wait till the AMF is started along with required processes.
  2. Restart using command "./safplus_model restart"
  3. Observe the PIDs of Watchdog, AMS, logd ... Expected result:
  4. AMF related processes to be stopped.
  5. All services related to AMS to be started with newer PIDs.
  6. Watchdog PID to remain same. Result: PASS.

Test Case 10: Procedure:

  1. Issue Help command "./safplus_model help" Expected result: The help string to be displayed on terminal Result: PASS.

Test Case 11: Procedure:

  1. Issue unsupported command "./safplus_model abcdefgh" Expected result: The safplus_watchdog_start need to display help string on terminal Result: PASS.

Test Case 12: Procedure:

  1. Start watchdog and wait till the AMF is started along with required processes.
  2. Get status using command "./safplus_model status" Expected result:
  3. AMF process status to be displayed to user. Result: PASS.

Test Case 13: Procedure:

  1. Start watchdog and wait till the AMF is started along with required processes.
  2. kill the process related to watchdog.
  3. make sure that the AMS processes are running. Expected result:
  4. Only watchdog process to be started without disrupting SAF processes. Result: PASS.

Test Case 14: Procedure:

  1. Stop AMF and watchdog and make sure the related processes are not running
  2. Restart ASP from watchdog issuing command "./safplus_model restart" Expected result: No action to be taken and user to be intimated that no SAF processed are running "ASP is not running on node [%s]. Cleaning up anyway..." Result: PASS.

root@ubuntu:~/eval# ./eval restart DEBUG checkTipc: True WARNING ASP is not running on node [1]. Cleaning up anyway... INFO Waiting for AMF to shutdown... DEBUG checkTipc: True INFO Unloading TIPC ... DEBUG disable bearer :tipc-config -bd=eth:eth0 ...


Test Case 15: Procedure:

  1. Stop AMF and watchdog and make sure the related processes are not running
  2. Issue stop again from watchdog issuing command "./safplus_model stop" Expected result: No action to be taken and user to be intimated that no SAF processed are running "ASP is not running on node [%s]. Cleaning up anyway..." Result: PASS.

root@ubuntu:~/eval# ./eval stop DEBUG checkTipc: True WARNING ASP is not running on node [1]. Cleaning up anyway... INFO Waiting for AMF to shutdown... DEBUG checkTipc: True INFO Unloading TIPC ... DEBUG disable bearer :tipc-config -bd=eth:eth0 ...


Test Case 16: Procedure:

  1. Start Watchdog and AMF with related processes.
  2. Create restart file (safplus_restart) in location "/root/safplus/var/run"
  3. Kill SAF related processes. Expected result: After 30 seconds, watchdog need to check for SAF graceful exit and need to kill itself. Result: PASS.

Test Case 17: Procedure:

  1. Start Watchdog and AMF with related processes.
  2. Delete restart file (safplus_restart) if any in location "/root/safplus/var/run"
  3. Kill SAF related processes. Expected result: After 30 seconds, watchdog check for SAF gone down and to start AMF with all related processes. Result: PASS.

Test Case 18: Procedure:

  1. Start Watchdog and AMF with related processes.
  2. Delete restart file (safplus_restart) if any in location "/root/safplus/var/run"
  3. Kill SAF related processes and make disk full scenario in the virtual machine. Expected result: After 30 seconds, watchdog check for SAF gone down and to start AMF with all related processes displaying errors on terminal. The start of all SAF related processes are not guaranteed. Result: PASS - The disk low error is displayed to user.

Test Case 19: Procedure:

  1. Start Watchdog and AMF with related processes.
  2. Kill any SAF related non critical processes (Event, name, logd). Expected result: AMF to check for non critical process and to start the process particular process within speculated time. kill logd then verify the process PID. Result: PASS.

Test Case 20: Procedure:

  1. Start Watchdog and AMF with related processes.
  2. Kill any SAF related critical processes (ckpt, msg, gms). Expected result: AMF to check for critical process as it is down, start AMF process freshly with related processes. Result: PASS.

Test Case 21: Procedure:

  1. Remove directory which stores restart file (run directory) in location "/root/safplus/var/"
  2. Start watchdog and AMF. Expected result: Watchdog and AMF processes to be started without any error. Result: PASS.

Test Case 22: Procedure:

  1. Start Watchdog and AMF with related processes.
  2. Create restart file (safplus_restart) in location "/root/safplus/var/run"
  3. Restart AMS through watchdog issuing command. Expected result: Watchdog to check for restart file, if present - stop AMF graceful and kill itself. Result: PASS.

Test Case 23: Procedure:

  1. Start Watchdog and AMF with related processes.
  2. Create restart file (safplus_restart) in location "/root/safplus/var/run"
  3. Stop AMS through watchdog issuing command.
  4. Check for restart file (safplus_restart). Expected result: Watchdog to find restart file, stop AMF graceful and kill itself and preserve restart file. Result: PASS. APP should remove restart file then start Watchdog.

Test Case 24: Procedure:

  1. Start Watchdog and AMF with related processes.
  2. kill AMF by restarting command and make sure the restart file is not created
  3. Restart AMS through watchdog issuing command within 30 sec after AMF killing.

Expected result: AMF process not available and display appropriate message Result: PASS. root@ubuntu:~/eval# ./eval restart DEBUG checkTipc: True WARNING ASP is not running on node [1]. Cleaning up anyway... INFO Waiting for AMF to shutdown... DEBUG checkTipc: True INFO Unloading TIPC ... DEBUG disable bearer :tipc-config -bd=eth:eth0 ...


Test Case 25: Procedure:

  1. Start Watchdog and AMF with related processes.
  2. kill AMF by restarting command and make sure the restart file is not created
  3. Restart AMS through watchdog issuing command within 30 sec after AMF killing.
  4. Make sure the watchdog is up and running.

Expected result: AMF process not available message is displayed and watchdog to start AMF within 30 sec. Result: PASS.

root@ubuntu:~/eval/var/run# ps -eaf | grep -i saf root 7360 1 0 18:49 ? 00:00:00 python /root/eval/etc/safplus_watchdog.py root 7447 1 0 18:49 ? 00:00:00 /root/eval/bin/safplus_amf -c 0 -l 1 -n SCNodeI0 root 7464 7447 0 18:49 ? 00:00:00 /root/eval/bin/safplus_logd root 7473 7447 0 18:49 ? 00:00:00 /root/eval/bin/safplus_gms clGmsConfig.xml root 7493 7447 1 18:49 ? 00:00:00 /root/eval/bin/safplus_event root 7502 7447 0 18:49 ? 00:00:00 /root/eval/bin/safplus_name root 7503 7447 1 18:49 ? 00:00:00 /root/eval/bin/safplus_ckpt root 7530 7447 0 18:49 ? 00:00:00 /root/eval/bin/safplus_msg root 7567 6499 0 18:49 pts/3 00:00:00 grep --color=auto -i saf

root@ubuntu:~/eval/var/run# ps -eaf | grep -i saf root 7360 1 0 18:49 ? 00:00:00 python /root/eval/etc/safplus_watchdog.py root 7447 1 0 18:49 ? 00:00:00 /root/eval/bin/safplus_amf -c 0 -l 1 -n SCNodeI0 root 7464 7447 2 18:49 ? 00:00:00 /root/eval/bin/safplus_logd root 7530 7447 0 18:49 ? 00:00:00 /root/eval/bin/safplus_msg root 7608 7597 0 18:49 pts/4 00:00:00 python /root/eval/etc/safplus_watchdog_start.py restart root 7637 6499 0 18:49 pts/3 00:00:00 grep --color=auto -i saf root@ubuntu:~/eval/var/run# ps -eaf | grep -i saf root 7360 1 0 18:49 ? 00:00:00 python /root/eval/etc/safplus_watchdog.py root 7464 1 2 18:49 ? 00:00:00 /root/eval/bin/safplus_logd root 7530 1 0 18:49 ? 00:00:00 /root/eval/bin/safplus_msg root 7608 7597 0 18:49 pts/4 00:00:00 python /root/eval/etc/safplus_watchdog_start.py restart root 7640 6499 0 18:49 pts/3 00:00:00 grep --color=auto -i saf root@ubuntu:~/eval/var/run# ps -eaf | grep -i saf root 7360 1 0 18:49 ? 00:00:00 python /root/eval/etc/safplus_watchdog.py root 7608 7597 0 18:49 pts/4 00:00:00 python /root/eval/etc/safplus_watchdog_start.py restart root@ubuntu:~/eval/var/run# ps -eaf | grep -i saf root 7360 1 0 18:49 ? 00:00:00 python /root/eval/etc/safplus_watchdog.py root 7784 6499 0 18:50 pts/3 00:00:00 grep --color=auto -i saf

root@ubuntu:~/eval/var/run# ps -eaf | grep -i saf root 7360 1 0 18:49 ? 00:00:00 python /root/eval/etc/safplus_watchdog.py root 7863 1 0 18:50 ? 00:00:00 /root/eval/bin/safplus_amf -c 0 -l 1 -n SCNodeI0 root 7878 7863 12 18:50 ? 00:00:32 /root/eval/bin/safplus_logd root 7888 7863 0 18:50 ? 00:00:00 /root/eval/bin/safplus_gms clGmsConfig.xml root 7905 7863 0 18:50 ? 00:00:00 /root/eval/bin/safplus_event root 7914 7863 0 18:50 ? 00:00:00 /root/eval/bin/safplus_name root 7915 7863 0 18:50 ? 00:00:00 /root/eval/bin/safplus_ckpt root 7950 7863 0 18:50 ? 00:00:00 /root/eval/bin/safplus_msg root 8118 6499 0 18:54 pts/3 00:00:00 grep --color=auto -i saf


nateshrevankar commented 10 years ago

The changes merged to master branch. https://github.com/OpenClovis/SAFplus-Availability-Scalability-Platform/commit/6ef206e3e5dcb901ea24775c03627e94ee6de8c1

nateshrevankar commented 10 years ago

Suggested Changes:

  1. Running "safplus start" should remove the restart file so that you don't get into a case where running "safplus start" refuses to start SAFplus (happening on my machine).
  2. Also is the sense backwards? If "safplus_restart" file exists you DON'T restart SAFplus?!!! Please don't EVER do that -- let the words diverge from the semantics of the code. It leads to great confusion. I recommend changing the file name to safplus_no_restart.

updated the corrections mentioned:

  1. (no) restart file name changed to "safplus_no_restart".
  2. While user issue "safplus start" it will delete the file and then perform normal start.
  3. "safplus restart", it will remove the no-restart file and then start AMF.
  4. "safplus stop", it will stop AMF, watchdog and then remove no-restart file.

The git link is: https://github.com/OpenClovis/SAFplus-Availability-Scalability-Platform/commit/a042e0acda5ef1c2c4cd8305b70db062bc1fda4f

nateshrevankar commented 10 years ago

Updating code review comments for changed code in starting Watchdog and no_start file

  1. Updating code review comments for appropriate file name, function name are being done.
  2. ZAP functionality for killing SAF processes updated for resent watchdog changes.
  3. ASP.py can be used as independent library and do not call any functions outside this file.

The code review is tracked in: https://docs.google.com/a/openclovis.com/spreadsheet/ccc?key=0AjGC0u9a-NgKdHlvZHFZZ1dyckdEbE5wd3lrN3ZtSUE&usp=drive_web#gid=0

Roughly consolidation of LOC: Added: 10, Deleted: 80, Modified: 5

IN-PROGRESS: Further improvement for start-up python script in starting AMF in asp.py file by removing run time database.

nateshrevankar commented 10 years ago
 I have updated changes in start procedure for Watchdog and SAFplus_AMF.

The change details are:

  1. Code review comments from Andrew (https://docs.google.com/a/openclovis.com/spreadsheet/ccc?key=0AjGC0u9a-NgKdHlvZHFZZ1dyckdEbE5wd3lrN3ZtSUE&usp=drive_web#gid=0 )
  2. ASP.py as separate library which is used accesses from watchdog.
  3. Loading TIPC before starting watchdog.
  4. Removing dynamic database d[] during ASP start in asp.py file.
  5. Separating TIPC dependent code to safplus_tipc.py.
  6. ZAP command.
  7. Issuing python commands than shell commands while starting AMF.
  8. General python coding.

The files for review are:

  1. asp.py
  2. safplus_watchdog.py
  3. safplus_watchdog_start.py
  4. safplus_tipc.py

The changed files are available in my branch in location: https://github.com/nateshrevankar/SAFplus-Availability-Scalability-Platform/commit/628953bcf4513082acd358dfb2d3dc3877ba80a5

nateshrevankar commented 10 years ago

completed