NagiosEnterprises / nrpe

NRPE Agent
GNU General Public License v2.0
263 stars 134 forks source link

Signal 6 (SIGABRT) with every execution of a command #227

Closed hariwe closed 4 years ago

hariwe commented 4 years ago

Affects Version: 4.0.0 with latest pull request #225 OS: RHEL 7

I'm running nrpe with systemd. Everytime a check is executed the spawned nrpe process gets terminated by signal 6. However, check_nrpe gets a result and is working fine.

GDB Core Output:

> Reading symbols from /opt/nagios/plugins/bin/nrpe...Reading symbols from /opt/nagios/plugins/bin/nrpe...(no debugging symbols found)...done.
> (no debugging symbols found)...done.
> [New LWP 24473]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib64/libthread_db.so.1".
> Core was generated by `/opt/nagios/plugins/bin/nrpe -c /opt/nagios/plugins/etc/nrpe.cfg -f'.
> Program terminated with signal 6, Aborted.
> #0  0x00007f22d2238207 in raise () from /lib64/libc.so.6
> (gdb) #0  0x00007f22d2238207 in raise () from /lib64/libc.so.6
> #1  0x00007f22d22398f8 in abort () from /lib64/libc.so.6
> #2  0x00007f22d227ad27 in __libc_message () from /lib64/libc.so.6
> #3  0x00007f22d2283489 in _int_free () from /lib64/libc.so.6
> #4  0x00007f22d285770d in CRYPTO_free () from /lib64/libcrypto.so.10
> #5  0x00007f22d2913f5e in EVP_CIPHER_CTX_cleanup ()
>    from /lib64/libcrypto.so.10
> #6  0x00007f22d2c8edc5 in ssl_clear_cipher_ctx () from /lib64/libssl.so.10
> #7  0x00007f22d2c8fc75 in SSL_free () from /lib64/libssl.so.10
> #8  0x0000000000407cb9 in handle_connection ()
> #9  0x0000000000408296 in wait_for_connections ()
> #10 0x0000000000408383 in run_src ()
> #11 0x00000000004039c2 in main ()
> (gdb)   Id   Target Id         Frame 
> * 1    Thread 0x7f22d30c6840 (LWP 24473) 0x00007f22d2238207 in raise ()
>    from /lib64/libc.so.6
> (gdb) 
> Thread 1 (Thread 0x7f22d30c6840 (LWP 24473)):
> #0  0x00007f22d2238207 in raise () from /lib64/libc.so.6
> #1  0x00007f22d22398f8 in abort () from /lib64/libc.so.6
> #2  0x00007f22d227ad27 in __libc_message () from /lib64/libc.so.6
> #3  0x00007f22d2283489 in _int_free () from /lib64/libc.so.6
> #4  0x00007f22d285770d in CRYPTO_free () from /lib64/libcrypto.so.10
> #5  0x00007f22d2913f5e in EVP_CIPHER_CTX_cleanup ()
>    from /lib64/libcrypto.so.10
> #6  0x00007f22d2c8edc5 in ssl_clear_cipher_ctx () from /lib64/libssl.so.10
> #7  0x00007f22d2c8fc75 in SSL_free () from /lib64/libssl.so.10
> #8  0x0000000000407cb9 in handle_connection ()
> #9  0x0000000000408296 in wait_for_connections ()
> #10 0x0000000000408383 in run_src ()
> #11 0x00000000004039c2 in main ()
sawolf commented 4 years ago

Thanks for reporting this.

I haven't been able to reproduce this issue so far. Running under gdb with detach-on-fork off, follow-fork-mode child, and breakpoints set for SSL_free and abort, I do encounter SSL_free but never abort.

Does the issue occur if you run systemd's command (/opt/nagios/plugins/bin/nrpe -c /opt/nagios/plugins/etc/nrpe.cfg -f) in your terminal? Do you get any log messages when trying to run commands? ~What distribution are you running?~ I'm running CentOS 7 on this machine, so I wouldn't expect any different behavior there.

hariwe commented 4 years ago

Hi,

I cannot reproduce this when I'm running nrpe directly in the commandline (as user nagios), so it seems related to systemd. My service file:

[Service] Type=simple Restart=on-abort ExecStart=/opt/nagios/plugins/bin/nrpe -c /opt/nagios/plugins/etc/nrpe.cfg -f ExecReload=/bin/kill -HUP $MAINPID User=nagios Group=nagios

hariwe commented 4 years ago

I was finally able to reproduce the issue in Valgrind using the latest master branch.

The Valgrind output is as follows:

==8227== Invalid write of size 1 ==8227== at 0x4C2D0F3: strcpy (vg_replace_strmem.c:513) ==8227== by 0x40758F: handle_connection (nrpe.c:1927) ==8227== by 0x40668A: wait_for_connections (nrpe.c:1441) ==8227== by 0x4047FC: run_src (nrpe.c:642) ==8227== by 0x403CF5: main (nrpe.c:224) ==8227== Address 0x75cf438 is 0 bytes after a block of size 88 alloc'd ==8227== at 0x4C2BF79: calloc (vg_replace_malloc.c:762) ==8227== by 0x4074FC: handle_connection (nrpe.c:1919) ==8227== by 0x40668A: wait_for_connections (nrpe.c:1441) ==8227== by 0x4047FC: run_src (nrpe.c:642) ==8227== by 0x403CF5: main (nrpe.c:224)

This patch fixes the issue:

--- nrpe-4.0.0/src/nrpe.c       2020-01-15 17:01:48.000000000 +0100
+++ nrpe-4.0.0.patched/src/nrpe.c       2020-02-27 13:59:56.562148344 +0100
@@ -1912,9 +1912,9 @@

        } else {

-               pkt_size = (sizeof(v3_packet) - NRPE_V4_PACKET_SIZE_OFFSET) + strlen(send_buff);
+               pkt_size = (sizeof(v3_packet) - NRPE_V4_PACKET_SIZE_OFFSET) + strlen(send_buff) + 1;
                if (packet_ver == NRPE_PACKET_VERSION_3) {
-                       pkt_size = (sizeof(v3_packet) - NRPE_V3_PACKET_SIZE_OFFSET) + strlen(send_buff);
+                       pkt_size = (sizeof(v3_packet) - NRPE_V3_PACKET_SIZE_OFFSET) + strlen(send_buff) + 1;
                }
                v3_send_packet = calloc(1, pkt_size);
                send_pkt = (char *)v3_send_packet;
@@ -1923,7 +1923,7 @@
                v3_send_packet->packet_type = htons(RESPONSE_PACKET);
                v3_send_packet->result_code = htons(result);
                v3_send_packet->alignment = 0;
-               v3_send_packet->buffer_length = htonl(strlen(send_buff));
+               v3_send_packet->buffer_length = htonl(strlen(send_buff) + 1);
                strcpy(&v3_send_packet->buffer[0], send_buff);

                /* calculate the crc 32 value of the packet */
sawolf commented 4 years ago

I'm really surprised that the patch shown here would fix the original issue. That said, I do think it's a good change regardless. I can verify the valgrind issue affects my development machine as well.

hariwe commented 4 years ago

Thanks! The previous pull request #228 made it better and it happenend not everytime, but still occasionally. This fixes it now completely for me.