bstansell / conserver

Logged, multi-user access to device consoles
https://www.conserver.com/
BSD 3-Clause "New" or "Revised" License
129 stars 38 forks source link

conserver dying repeatedly with buffer overflow errors (sig6) #72

Open netllama opened 3 years ago

netllama commented 3 years ago

I'm running the latest conserver (8.2.6) on a Linux-x86_64 server with nearly 6000 console sessions. Its worked fine literally for years, until yesterday, when it started dying with a buffer overflow error (and signal 6).

The conserver itself is configured as follows:

config * {
    reinitcheck 5;
    initdelay 15;
}
default full {
    rw *;
}
default * {
    logfile /mnt/logs/conserver/&.log;
    logfilemax 2m;
    timestamp 10lab;
    include full;
    port 623;
    options reinitoncc;
    ipmiworkaround checksum;
}

access * {
    trusted 127.0.0.1 ;
    allowed 127.0.0.1 ;
    admin ncotoolsprod,root;
}

All of the individual console sessions are of the following configuration:

console         c001.for002.ix { type ipmi; master  127.0.0.1; host ipmi.c001.for002.ix; password XXX ; username ABC; }

The only output coming from the conserver when it dies is the following:

[Wed Aug 25 14:30:57 2021] conserver (4773): child pid 17117: signal(6), restarting
*** buffer overflow detected ***: /apps/nco-conserver/sbin/conserver terminated
[Wed Aug 25 14:30:57 2021] conserver (4773): child pid 17118: signal(6), restarting
*** buffer overflow detected ***: /apps/nco-conserver/sbin/conserver terminated
[Wed Aug 25 14:30:57 2021] conserver (4773): child pid 17119: signal(6), restarting
*** buffer overflow detected ***: /apps/nco-conserver/sbin/conserver terminated
[Wed Aug 25 14:30:57 2021] conserver (4773): child pid 17120: signal(6), restarting
*** buffer overflow detected ***: /apps/nco-conserver/sbin/conserver terminated
[Wed Aug 25 14:30:57 2021] conserver (4773): child pid 17121: signal(6), restarting
*** buffer overflow detected ***: /apps/nco-conserver/sbin/conserver terminated
[Wed Aug 25 14:30:57 2021] conserver (4773): child pid 17122: signal(6), restarting
*** buffer overflow detected ***: /apps/nco-conserver/sbin/conserver terminated
[Wed Aug 25 14:30:57 2021] conserver (4773): child pid 17123: signal(6), restarting
*** buffer overflow detected ***: /apps/nco-conserver/sbin/conserver terminated
[Wed Aug 25 14:30:57 2021] conserver (4773): child pid 17124: signal(6), restarting
*** buffer overflow detected ***: /apps/nco-conserver/sbin/conserver terminated
[Wed Aug 25 14:30:57 2021] conserver (4773): child pid 17125: signal(6), restarting
*** buffer overflow detected ***: /apps/nco-conserver/sbin/conserver terminated
[Wed Aug 25 14:30:57 2021] conserver (4773): child pid 17126: signal(6), restarting

If I try stopping the daemon and retarting, it resumes the same sig6 failure behavior. If I reboot the server, then its stable for several hours before descending back into the same mess again.

The server had been up for over a week prior to this problem beginning for the first time.

I'm at a loss where to begin debugging this, and would appreciate any ideas/guidance/assistance.

netllama commented 3 years ago

I attached to the parent conserver process with strace, and its spewing this stuff repeatedly every time one of the child processes dies:

--- SIGCHLD {si_signo=SIGCHLD, si_code=CLD_KILLED, si_pid=27575, si_uid=0, si_status=SIGABRT, si_utime=0, si_stime=0} ---
rt_sigreturn({mask=[]})                 = -1 EINTR (Interrupted system call)
wait4(-1, [{WIFSIGNALED(s) && WTERMSIG(s) == SIGABRT}], WNOHANG|WSTOPPED, NULL) = 27575
stat("/etc/localtime", {st_mode=S_IFREG|0644, st_size=127, ...}) = 0
write(1, "[Wed Aug 25 15:04:49 2021] conse"..., 84) = 84
socket(AF_INET, SOCK_STREAM, IPPROTO_TCP) = 7
setsockopt(7, SOL_SOCKET, SO_REUSEADDR, [1], 4) = 0
fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(7, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
bind(7, {sa_family=AF_INET, sin_port=htons(31194), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
getsockname(7, {sa_family=AF_INET, sin_port=htons(31194), sin_addr=inet_addr("0.0.0.0")}, [16]) = 0
clone(child_stack=NULL, flags=CLONE_CHILD_CLEARTID|CLONE_CHILD_SETTID|SIGCHLD, child_tidptr=0x7f65c9786a10) = 27576
close(7)                                = 0
nanosleep({tv_sec=0, tv_nsec=750000000}, NULL) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)

I'm not sure how to interpret any of this though.

netllama commented 3 years ago

After some more experimentation, I've found that the conserver is recoverable (for a few hours) without rebooting the entire OS.

Also, running "console -u", even when the buffer overflow storm has not started will often fail after reporting less than all sessions with this error:

console: FileRead(): fd 5: Connection reset by peer
console: lost connection
netllama commented 3 years ago

After additional triage efforts, I have determined that all of this instability is triggered when the hostname (ipmi.c001.for002.ix in the example provided above) is unreachable, due to a DNS resolution failure, or an unroutable IP address assignment (such as 192.168.x.x). Once I commented out all console session entries associated with unreachable hosts, the instability went away completely, and has remained stable for nearly 24 hours now. Prior to this change, the conserver was going into the buffer overflow crash state within 4 hours of being started.

This seems to conclusively point to a conserver bug, as a few unreachable sessions should never cause the entire conserver to become unstable and effectively unresponsive.

bstansell commented 3 years ago

Sorry for all the trouble you've been having. Couple things:

It certainly points to a bug somewhere. I'm hoping some of the above will help pinpoint the general location/situation.

netllama commented 3 years ago

Sorry for all the trouble you've been having. Couple things:

* What OS is this on?

Ubuntu 18.04.5 LTS(bionic): GNU/Linux 5.4.0-1055-aws x86_64

* Is this built from source or was it installed via yum/apt/etc?

Built from source as follows: ./configure --prefix=${APP_PATH} --exec-prefix=${APP_PATH} --localstatedir=/data --with-libwrap --with-openssl --with-freeipmi=/apps/nco-freeipmi --with-pam --with-ipv6 make DESTDIR=${BUILD_PATH} && make DESTDIR=${BUILD_PATH} install

* Can you share the output of `conserver -V`?
conserver: conserver.com version 8.2.6
conserver: default access type `r'
conserver: default escape sequence `^Ec'
conserver: default configuration in `/apps/nco-conserver/etc/conserver.cf'
conserver: default password in `/apps/nco-conserver/etc/conserver.passwd'
conserver: default logfile is `/var/log/conserver'
conserver: default pidfile is `/var/run/conserver.pid'
conserver: default limit is 16 members per group
conserver: default primary port referenced as `conserver'
conserver: default secondary base port referenced as `0'
conserver: options: freeipmi, libwrap, openssl, pam
conserver: freeipmi version: 1.2.6
conserver: openssl version: OpenSSL 1.1.1  11 Sep 2018
conserver: built with `./configure --prefix=/apps/nco-conserver --exec-prefix=/apps/nco-conserver --localstatedir=/data --with-libwrap --with-openssl --with-freeipmi=/apps/nco-freeipmi --with-pam --with-ipv6'
* What command line are you starting consever with?
/apps/nco-conserver/sbin/conserver -C /apps/nco-conserver/etc/config/conserver.cf
* Can you share a part of the logs where inaccessible/unknown hosts are producing errors (before the signal 6 issue)?

Sure, but the catch here is that the "normal" errors from a host where the hostname does resolve and is routable are not any different than those where the hostname fails to resolve, or has no valid route. These are hosts that did not resolve (had no DNS record):

[Thu Aug 26 00:07:55 2021] conserver (15024): ERROR: [c040.bom001.ix] Could not connect to IPMI host ipmi.c040.bom001.ix': forcing down
[Thu Aug 26 00:07:55 2021] conserver (15024): ERROR: [c039.bom001.ix] Could not connect to IPMI host ipmi.c039.bom001.ix': forcing down
[Thu Aug 26 00:07:55 2021] conserver (15024): ERROR: FileWrite(): fd 0: Bad file descriptor
[Thu Aug 26 00:07:55 2021] conserver (15024): ERROR: [c038.bom001.ix] Could not connect to IPMI host ipmi.c038.bom001.ix': forcing down
[Thu Aug 26 00:07:55 2021] conserver (15024): ERROR: FileWrite(): fd 0: Bad file descriptor
[Thu Aug 26 00:07:55 2021] conserver (15090): ERROR: [c151.fra002.ix] Could not connect to IPMI host ipmi.c151.fra002.ix': forcing down
[Thu Aug 26 00:07:55 2021] conserver (15090): ERROR: FileWrite(): fd 0: Bad file descriptor
[Thu Aug 26 00:07:55 2021] conserver (15090): ERROR: [c150.fra002.ix] Could not connect to IPMI host ipmi.c150.fra002.ix': forcing down
[Thu Aug 26 00:07:55 2021] conserver (15090): ERROR: FileWrite(): fd 0: Bad file descriptor
[Thu Aug 26 00:07:55 2021] conserver (15090): ERROR: [c149.fra002.ix] Could not connect to IPMI host ipmi.c149.fra002.ix': forcing down
[Thu Aug 26 00:07:55 2021] conserver (15090): ERROR: FileWrite(): fd 0: Bad file descriptor
[Thu Aug 26 00:07:55 2021] conserver (15090): ERROR: [c148.fra002.ix] Could not connect to IPMI host ipmi.c148.fra002.ix': forcing down
[Thu Aug 26 00:07:55 2021] conserver (15090): ERROR: FileWrite(): fd 0: Bad file descriptor
[Thu Aug 26 00:07:55 2021] conserver (15090): ERROR: [c147.fra002.ix] Could not connect to IPMI host ipmi.c147.fra002.ix': forcing down
[Thu Aug 26 00:07:55 2021] conserver (15090): ERROR: FileWrite(): fd 0: Bad file descriptor
[Thu Aug 26 00:07:56 2021] conserver (16043): ERROR: [c006.sjc004.dev] connect timeout: forcing down
[Thu Aug 26 00:07:56 2021] conserver (16043): ERROR: FileWrite(): fd 0: Bad file descriptor
[Thu Aug 26 00:07:56 2021] conserver (16043): ERROR: [c003.sjc004.dev] connect timeout: forcing down
[Thu Aug 26 00:07:56 2021] conserver (16043): ERROR: [c026.sjc003.dev] connect timeout: forcing down
netllama commented 3 years ago

@bstansell is there any other information that I can provide to continue with the investigation?

netllama commented 3 years ago

@bstansell any updates?

netllama commented 2 years ago

@bstansell as its now been over a month, will there ever be any follow up on this bug?

netllama commented 1 year ago

Over a year, and no response.

agialluc commented 10 months ago

FYI: you may want to take a look at https://github.com/bstansell/conserver/issues/93

netllama commented 10 months ago

@agialluc wow, thanks, that does look very relevant. thanks for sharing !