Closed box293 closed 5 years ago
I wonder what signal svcadm disable
sends to the process. If it's a SIGKILL
, there's no way to catch it and delete the socket. Couldn't find anything about it in google search. Unfortunately, ndo2db
does not log any shutdown messages.
void ndo2db_parent_sighandler(int sig){
switch (sig){
case SIGTERM: <<<<<<<<<<<<----------- this SHOULD be the signal it gets
case SIGINT:
/* forward signal to all members of this group of processes */
kill(0, sig);
break;
case SIGCHLD:
/* cleanup children that exit, so we don't have zombies */
while(waitpid(-1,NULL,WNOHANG)>0);
return;
default:
printf("Caught the Signal '%d' but don't care about this.\n", sig);
}
/* cleanup the socket */
ndo2db_cleanup_socket(); <<<<<<<<<<<<----------- this is where the socket gets deleted
/* free memory */
ndo2db_free_program_memory();
exit(0);
return;
}
I tested this further today. Using kill -15 pid
should send the SIGTERM to the process.
root@core-041:/var/tmp/ndoutils-ndoutils-2.1.3# ps -ef | grep ndo
root@core-041:/var/tmp/ndoutils-ndoutils-2.1.3# rm -f /usr/local/nagios/var/ndo.sock
root@core-041:/var/tmp/ndoutils-ndoutils-2.1.3# /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
root@core-041:/var/tmp/ndoutils-ndoutils-2.1.3# ps -ef | grep ndo
nagios 4157 1 0 14:19:09 ? 0:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
root@core-041:/var/tmp/ndoutils-ndoutils-2.1.3# kill -15 4157
root@core-041:/var/tmp/ndoutils-ndoutils-2.1.3# ps -ef | grep ndo
root@core-041:/var/tmp/ndoutils-ndoutils-2.1.3# ls -la /usr/local/nagios/var/ndo.sock
srwxr-xr-x 1 nagios nagios 0 Jul 5 14:19 /usr/local/nagios/var/ndo.sock
That test showed it's nothing to do with the service manifest file or how the kill is initiated.
I added some debug code:
void ndo2db_parent_sighandler(int sig){
ndo2db_log_debug_info(NDO2DB_DEBUGL_PROCESSINFO, 2,"TEST 1\n");
switch (sig){
case SIGTERM:
case SIGINT:
ndo2db_log_debug_info(NDO2DB_DEBUGL_PROCESSINFO, 2,"TEST 1A\n");
ndo2db_log_debug_info(NDO2DB_DEBUGL_PROCESSINFO, 2,"sig = %i\n", sig);
/* forward signal to all members of this group of processes */
kill(0, sig);
break;
case SIGCHLD:
/* cleanup children that exit, so we don't have zombies */
while(waitpid(-1,NULL,WNOHANG)>0);
return;
default:
printf("Caught the Signal '%d' but don't care about this.\n", sig);
}
ndo2db_log_debug_info(NDO2DB_DEBUGL_PROCESSINFO, 2,"TEST 1B\n");
/* cleanup the socket */
ndo2db_cleanup_socket();
/* free memory */
ndo2db_free_program_memory();
exit(0);
return;
}
I only ever see this:
[1530767664.401914] [001.2] [pid=21375] TEST 1A
[1530767664.401932] [001.2] [pid=21375] sig = 15
I never receive the TEST 1B
message.
I went and did the same test on RHEL and I got this output:
[1530768741.353795] [001.2] [pid=5516] TEST 1A
[1530768741.353798] [001.2] [pid=5516] sig = 15
[1530768741.353802] [001.2] [pid=5516] TEST 1B
So it's something to do with Solaris and the kill function, it's like the process is being killed entirely hence the ndo2db_cleanup_socket()
function is never called.
Since there are no more services/daemons to worry about for ndo-3
, I'll be closing this.
I don't think that Solaris 11 is deleting the /usr/local/nagios/var/ndo.sock when the service is disabled (stopped). This causes the ndo2db service to fail to start when it is enabled because it detects that the ndo.sock file exists. Requires you to delete the ndo.sock file to resolve the issue.
Here you can see ndo2db is not running (currently disabled) and the ndo.sock file does not exist:
Now I enable the ndo2db service
You can see it was running, checking the nagios.log file shows it's working:
Now I disable the ndo2db service:
You can see the service is not running and also the ndo.sock file exists.
Now to start the service again:
You can see that the service was not started.
Now to delete the ndo.sock file and then clear the maintenance status on the service:
For comparision, here you can see that when the ndo2db service is stoppped on a CentOS 6.x server the ndo.sock file is deleted: