Solaris 11: ndo.sock is not deleted when service is disabled/stopped

box293 commented 7 years ago

I don't think that Solaris 11 is deleting the /usr/local/nagios/var/ndo.sock when the service is disabled (stopped). This causes the ndo2db service to fail to start when it is enabled because it detects that the ndo.sock file exists. Requires you to delete the ndo.sock file to resolve the issue.

Here you can see ndo2db is not running (currently disabled) and the ndo.sock file does not exist:

root@core-041:~# ps -ef | grep ndo
--->No Output<---

root@core-041:~# svcs -xv ndo2db
svc:/network/nagios/ndo2db:default (NDO2DB daemon)
 State: disabled since May  8, 2017 05:00:37 PM EST
Reason: Disabled by an administrator.
   See: http://support.oracle.com/msg/SMF-8000-05
   See: http://www.nagios.org
   See: /var/svc/log/network-nagios-ndo2db:default.log
Impact: This service is not running.

root@core-041:~# ls -la /usr/local/nagios/var/ndo.sock 
/usr/local/nagios/var/ndo.sock: No such file or directory

Now I enable the ndo2db service

root@core-041:~# svcadm clear ndo2db
svcadm: Instance "svc:/network/nagios/ndo2db:default" is not in a maintenance or degraded state.

root@core-041:~# svcadm enable ndo2db
root@core-041:~# svcs -xv ndo2db
svc:/network/nagios/ndo2db:default (NDO2DB daemon)
 State: online since May  8, 2017 05:02:55 PM EST
   See: http://www.nagios.org
   See: /var/svc/log/network-nagios-ndo2db:default.log
Impact: None.

You can see it was running, checking the nagios.log file shows it's working:

[1494226975] ndomod: Successfully reconnected to data sink!  0 items lost, 680 queued items to flush.
[1494226976] ndomod: Successfully flushed 680 queued items to data sink.

Now I disable the ndo2db service:

root@core-041:~# svcadm disable ndo2db

root@core-041:~# ls -la /usr/local/nagios/var/ndo.sock 
srwxr-xr-x   1 nagios   nagios         0 May  8 17:02 /usr/local/nagios/var/ndo.sock

root@core-041:~# svcs -xv ndo2db
svc:/network/nagios/ndo2db:default (NDO2DB daemon)
 State: disabled since May  8, 2017 05:03:57 PM EST
Reason: Disabled by an administrator.
   See: http://support.oracle.com/msg/SMF-8000-05
   See: http://www.nagios.org
   See: /var/svc/log/network-nagios-ndo2db:default.log
Impact: This service is not running.

You can see the service is not running and also the ndo.sock file exists.

Now to start the service again:

root@core-041:~# svcadm enable ndo2db

root@core-041:~# svcs -xv ndo2db
svc:/network/nagios/ndo2db:default (NDO2DB daemon)
 State: maintenance since May  8, 2017 05:05:42 PM EST
Reason: Start method failed repeatedly, last exited with status 1.
   See: http://support.oracle.com/msg/SMF-8000-KS
   See: http://www.nagios.org
   See: /var/svc/log/network-nagios-ndo2db:default.log
Impact: This service is not running.

root@core-041:~# tail -20 /var/svc/log/network-nagios-ndo2db:default.log
[ May  8 17:02:55 Enabled. ]
[ May  8 17:02:55 Executing start method ("/usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg"). ]
[ May  8 17:02:55 Method "start" exited with status 0. ]
[ May  8 17:03:57 Stopping because service disabled. ]
[ May  8 17:03:57 Executing stop method (:kill). ]
[ May  8 17:05:42 Enabled. ]
[ May  8 17:05:42 Executing start method ("/usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg"). ]
Could not bind socket: Address already in use
[ May  8 17:05:42 Method "start" exited with status 1. ]
[ May  8 17:05:42 Executing start method ("/usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg"). ]
Could not bind socket: Address already in use
[ May  8 17:05:42 Method "start" exited with status 1. ]
[ May  8 17:05:42 Executing start method ("/usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg"). ]
Could not bind socket: Address already in use
[ May  8 17:05:42 Method "start" exited with status 1. ]

root@core-041:~# ls -la /usr/local/nagios/var/ndo.sock 
srwxr-xr-x   1 nagios   nagios         0 May  8 17:02 /usr/local/nagios/var/ndo.sock

root@core-041:~# ps -ef | grep ndo
--->No Output<---

You can see that the service was not started.

Now to delete the ndo.sock file and then clear the maintenance status on the service:

root@core-041:~# rm -f /usr/local/nagios/var/ndo.sock

root@core-041:~# svcadm clear ndo2db

root@core-041:~# svcs -xv ndo2db
svc:/network/nagios/ndo2db:default (NDO2DB daemon)
 State: online since May  8, 2017 05:10:39 PM EST
   See: http://www.nagios.org
   See: /var/svc/log/network-nagios-ndo2db:default.log
Impact: None.

root@core-041:~# ps -ef | grep ndo
  nagios  2318  2317   1 17:10:47 ?           0:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
  nagios  2317  2315   0 17:10:47 ?           0:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
  nagios  2315     1   0 17:10:39 ?           0:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg

root@core-041:~# tail /var/svc/log/network-nagios-ndo2db:default.log
[ May  8 17:10:39 Leaving maintenance because clear requested. ]
[ May  8 17:10:39 Enabled. ]
[ May  8 17:10:39 Executing start method ("/usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg"). ]
[ May  8 17:10:39 Method "start" exited with status 0. ]

root@core-041:~# tail /usr/local/nagios/var/nagios.log  
[1494227407] ndomod: Still unable to reconnect to data sink.  0 items lost, 475 queued items to flush.
[1494227427] ndomod: Still unable to reconnect to data sink.  0 items lost, 501 queued items to flush.
[1494227447] ndomod: Successfully reconnected to data sink!  0 items lost, 527 queued items to flush.
[1494227448] ndomod: Successfully flushed 527 queued items to data sink.

For comparision, here you can see that when the ndo2db service is stoppped on a CentOS 6.x server the ndo.sock file is deleted:

[root@core-014 ndoutils-2.1.3]# ls -la /usr/local/nagios/var/ndo.sock 
srwxr-xr-x 1 nagios nagios 0 May  8 16:27 /usr/local/nagios/var/ndo.sock

[root@core-014 ndoutils-2.1.3]# stop ndo2db
ndo2db stop/waiting

[root@core-014 ndoutils-2.1.3]# ls -la /usr/local/nagios/var/ndo.sock 
ls: cannot access /usr/local/nagios/var/ndo.sock: No such file or directory

[root@core-014 ndoutils-2.1.3]# start ndo2db
ndo2db start/running, process 2408

[root@core-014 ndoutils-2.1.3]# ls -la /usr/local/nagios/var/ndo.sock 
srwxr-xr-x 1 nagios nagios 0 May  8 17:18 /usr/local/nagios/var/ndo.sock

jfrickson commented 7 years ago

I wonder what signal svcadm disable sends to the process. If it's a SIGKILL, there's no way to catch it and delete the socket. Couldn't find anything about it in google search. Unfortunately, ndo2db does not log any shutdown messages.

void ndo2db_parent_sighandler(int sig){

    switch (sig){
    case SIGTERM:   <<<<<<<<<<<<----------- this SHOULD be the signal it gets
    case SIGINT:
        /* forward signal to all members of this group of processes */
        kill(0, sig);
        break;
    case SIGCHLD:
        /* cleanup children that exit, so we don't have zombies */
        while(waitpid(-1,NULL,WNOHANG)>0);
        return;
    default:
        printf("Caught the Signal '%d' but don't care about this.\n", sig);
    }

    /* cleanup the socket */
    ndo2db_cleanup_socket();   <<<<<<<<<<<<----------- this is where the socket gets deleted

    /* free memory */
    ndo2db_free_program_memory();

    exit(0);

    return;
        }

box293 commented 6 years ago

I tested this further today. Using kill -15 pid should send the SIGTERM to the process.

root@core-041:/var/tmp/ndoutils-ndoutils-2.1.3# ps -ef | grep ndo
root@core-041:/var/tmp/ndoutils-ndoutils-2.1.3# rm -f /usr/local/nagios/var/ndo.sock
root@core-041:/var/tmp/ndoutils-ndoutils-2.1.3# /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
root@core-041:/var/tmp/ndoutils-ndoutils-2.1.3# ps -ef | grep ndo
  nagios  4157     1   0 14:19:09 ?           0:00 /usr/local/nagios/bin/ndo2db -c /usr/local/nagios/etc/ndo2db.cfg
root@core-041:/var/tmp/ndoutils-ndoutils-2.1.3# kill -15 4157
root@core-041:/var/tmp/ndoutils-ndoutils-2.1.3# ps -ef | grep ndo
root@core-041:/var/tmp/ndoutils-ndoutils-2.1.3# ls -la /usr/local/nagios/var/ndo.sock
srwxr-xr-x   1 nagios   nagios         0 Jul  5 14:19 /usr/local/nagios/var/ndo.sock

That test showed it's nothing to do with the service manifest file or how the kill is initiated.

I added some debug code:

void ndo2db_parent_sighandler(int sig){
    ndo2db_log_debug_info(NDO2DB_DEBUGL_PROCESSINFO, 2,"TEST 1\n");
    switch (sig){
    case SIGTERM:
    case SIGINT:
        ndo2db_log_debug_info(NDO2DB_DEBUGL_PROCESSINFO, 2,"TEST 1A\n");
        ndo2db_log_debug_info(NDO2DB_DEBUGL_PROCESSINFO, 2,"sig = %i\n", sig);
        /* forward signal to all members of this group of processes */
        kill(0, sig);
        break;
    case SIGCHLD:
        /* cleanup children that exit, so we don't have zombies */
        while(waitpid(-1,NULL,WNOHANG)>0);
        return;
    default:
        printf("Caught the Signal '%d' but don't care about this.\n", sig);
    }
    ndo2db_log_debug_info(NDO2DB_DEBUGL_PROCESSINFO, 2,"TEST 1B\n");
    /* cleanup the socket */
    ndo2db_cleanup_socket();
    /* free memory */
    ndo2db_free_program_memory();
    exit(0);
    return;
        }

I only ever see this:

[1530767664.401914] [001.2] [pid=21375] TEST 1A
[1530767664.401932] [001.2] [pid=21375] sig = 15

I never receive the TEST 1B message.

I went and did the same test on RHEL and I got this output:

[1530768741.353795] [001.2] [pid=5516] TEST 1A
[1530768741.353798] [001.2] [pid=5516] sig = 15
[1530768741.353802] [001.2] [pid=5516] TEST 1B

So it's something to do with Solaris and the kill function, it's like the process is being killed entirely hence the ndo2db_cleanup_socket() function is never called.

hedenface commented 5 years ago

Since there are no more services/daemons to worry about for ndo-3, I'll be closing this.

NagiosEnterprises / ndoutils

Solaris 11: ndo.sock is not deleted when service is disabled/stopped #33