Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
1.99k stars 573 forks source link

Segfault Upgrade Icinga 2.13.7 -> 2.14.0 #9854

Open Crited opened 1 year ago

Crited commented 1 year ago

Describe the bug

Icinga 2 crashes immediatly after start/restart.

After upgrading Icinga 2 from 2.13.7 to 2.14.0 the configuration master ran into a segfault.

To Reproduce

Unknown. First occurence of the error after deploying new configuration -> a reload of the service.

Expected behavior

Icinga 2 should not crash.

Your Environment

Include as many relevant details about the environment you experienced the problem in

Additional context

Worth mentioning:

Aug 23 11:04:42 **** systemd[1]: Started Icinga host/service/network monitoring system.
Aug 23 11:04:45 **** systemd[1]: icinga2.service: Main process exited, code=exited, status=139/n/a
Aug 23 11:04:45 **** systemd[1]: icinga2.service: Failed with result 'exit-code'.
Aug 23 11:06:47 **** systemd[1]: Starting Icinga host/service/network monitoring system...

As well as the segfault itself.

Aug 23 11:06:58 **** icinga2[22890]: [2023-08-23 11:06:58 +0200] information/cli: Closing console log.
Aug 23 11:06:58 **** systemd[1]: Started Icinga host/service/network monitoring system.
Aug 23 11:07:00 **** kernel: icinga2[23000]: segfault at 1b0 ip 0000000000f6146c sp 00007f401f74b1c0 error 4 in icinga2[400000+ebe000]
Aug 23 11:07:00 **** kernel: Code: 00 48 89 43 10 48 c7 43 18 00 00 00 00 0f ae f0 f0 48 83 45 00 01 e8 83 37 a6 ff f0 48 83 45 00 01 0f b6 44 24 0c 48 8b 0c 24 <4c> 8b a1 b0 01 00 00 4c 8d a9 b8 01 00 00 88 44 24 28 48 8b 84 24
Aug 23 11:07:01 **** systemd[1]: icinga2.service: Main process exited, code=exited, status=139/n/a

We've attempted to produce a core dump, but were not able to. Due to the fact that the system is in production the environment was downgraded to 2.13.8.

julianbrost commented 1 year ago

Looks like it's a problem in icinga::RedisConnection::FireAndForgetQueries():

(gdb) info line *0x0000000000f6146c
Line 59 of "/usr/include/boost/asio/executor_work_guard.hpp"
   starts at address 0xf61463 <icinga::RedisConnection::FireAndForgetQueries(std::vector<std::vector<icinga::String, std::allocator<icinga::String> >, std::allocator<std::vector<icinga::String, std::allocator<icinga::String> > > >, icinga::RedisConnection::QueryPriority, icinga::RedisConnection::QueryAffects)+307>
   and ends at 0xf61490 <icinga::RedisConnection::FireAndForgetQueries(std::vector<std::vector<icinga::String, std::allocator<icinga::String> >, std::allocator<std::vector<icinga::String, std::allocator<icinga::String> > > >, icinga::RedisConnection::QueryPriority, icinga::RedisConnection::QueryAffects)+352>.

Can you please share what was logged regarding Icinga DB, i.e. something like grep -i icingadb /var/log/icinga2/icinga2.log?

Crited commented 1 year ago

Sadly, the logs are not really helpful in that regard:

[2023-08-23 11:54:27 +0200] information/IcingaDB: 'icingadb' started.
[2023-08-23 11:54:27 +0200] information/IcingaDB: Trying to connect to Redis server (async) on host '10.0.64.81:6380'
[2023-08-23 11:54:27 +0200] information/IcingaDB: Connected to Redis server
[2023-08-23 11:54:27 +0200] information/IcingaDB: Starting initial config/status dump
[2023-08-23 11:54:30 +0200] information/IcingaDB: Initial config/status dump finished in 3.07589 seconds.
[2023-08-23 11:54:37 +0200] information/IcingaDB: Pending queries: 0 (Input: 5109/s; Output: 5210/s)
[2023-08-23 11:58:59 +0200] information/IcingaDB: Flushing history data buffer to Redis.
[2023-08-23 11:58:59 +0200] information/IcingaDB: 'icingadb' stopped.
[2023-08-23 11:59:05 +0200] information/IcingaDB: 'icingadb' started.
[2023-08-23 11:59:05 +0200] information/IcingaDB: Trying to connect to Redis server (async) on host '10.0.64.81:6380'
[2023-08-23 11:59:05 +0200] information/IcingaDB: Connected to Redis server
[2023-08-23 11:59:05 +0200] information/IcingaDB: Starting initial config/status dump
[2023-08-23 11:59:08 +0200] information/IcingaDB: Initial config/status dump finished in 2.62474 seconds.
[2023-08-23 11:59:15 +0200] information/IcingaDB: Pending queries: 0 (Input: 1290/s; Output: 1298/s)
[2023-08-23 12:01:02 +0200] information/IcingaDB: Flushing history data buffer to Redis.
[2023-08-23 12:01:02 +0200] information/IcingaDB: 'icingadb' stopped.
[2023-08-23 12:01:08 +0200] information/IcingaDB: 'icingadb' started.
[2023-08-23 12:01:08 +0200] information/IcingaDB: Trying to connect to Redis server (async) on host '10.0.64.81:6380'
[2023-08-23 12:01:08 +0200] information/IcingaDB: Connected to Redis server
[2023-08-23 12:01:09 +0200] information/IcingaDB: Starting initial config/status dump
[2023-08-23 12:01:11 +0200] information/IcingaDB: Initial config/status dump finished in 2.60669 seconds.
[2023-08-23 12:01:18 +0200] information/IcingaDB: Pending queries: 0 (Input: 1283/s; Output: 1283/s)
[2023-08-23 12:06:18 +0200] information/IcingaDB: Pending queries: 0 (Input: 641/s; Output: 641/s)
[2023-08-23 12:11:18 +0200] information/IcingaDB: Pending queries: 0 (Input: 629/s; Output: 629/s)
Al2Klimov commented 1 year ago

Looks like it's a problem in icinga::RedisConnection::FireAndForgetQueries():

(gdb) info line *0x0000000000f6146c
Line 59 of "/usr/include/boost/asio/executor_work_guard.hpp"
   starts at address 0xf61463 <icinga::RedisConnection::FireAndForgetQueries(std::vector<std::vector<icinga::String, std::allocator<icinga::String> >, std::allocator<std::vector<icinga::String, std::allocator<icinga::String> > > >, icinga::RedisConnection::QueryPriority, icinga::RedisConnection::QueryAffects)+307>
   and ends at 0xf61490 <icinga::RedisConnection::FireAndForgetQueries(std::vector<std::vector<icinga::String, std::allocator<icinga::String> >, std::allocator<std::vector<icinga::String, std::allocator<icinga::String> > > >, icinga::RedisConnection::QueryPriority, icinga::RedisConnection::QueryAffects)+352>.

Are you sure?

$ git diff v2.13.8..v2.14.0 -- lib/icingadb/redisconnection.cpp
$

Neither git diff v2.13.8..v2.14.0 -- lib/icingadb shows me anything suspicious.

Anyway I'm afraid we can't do much without a core dump: https://icinga.com/docs/icinga-2/latest/doc/21-development/#core-dump

julianbrost commented 1 year ago

Are you sure?

I mean that's what GDB says for icinga2-bin-debuginfo-2.14.0 from our RHEL 8 repo.

Double-checking:

# objdump -dC --start-address 0xf61442 --stop-address 0xf61481 /usr/lib64/icinga2/sbin/icinga2

/usr/lib64/icinga2/sbin/icinga2:     file format elf64-x86-64

Disassembly of section .text:

0000000000f61442 <icinga::RedisConnection::FireAndForgetQueries(std::vector<std::vector<icinga::String, std::allocator<icinga::String> >, std::allocator<std::vector<icinga::String, std::allocator<icinga::String> > > >, icinga::RedisConnection::QueryPriority, icinga::RedisConnection::QueryAffects)@@Base+0x112>:
  f61442:   00 48 89                add    %cl,-0x77(%rax)
  f61445:   43 10 48 c7             rex.XB adc %cl,-0x39(%r8)
  f61449:   43 18 00                rex.XB sbb %al,(%r8)
  f6144c:   00 00                   add    %al,(%rax)
  f6144e:   00 0f                   add    %cl,(%rdi)
  f61450:   ae                      scas   %es:(%rdi),%al
  f61451:   f0 f0 48 83 45 00 01    lock lock addq $0x1,0x0(%rbp)
  f61458:   e8 83 37 a6 ff          callq  9c4be0 <icinga::Utility::GetTime()@@Base>
  f6145d:   f0 48 83 45 00 01       lock addq $0x1,0x0(%rbp)
  f61463:   0f b6 44 24 0c          movzbl 0xc(%rsp),%eax
  f61468:   48 8b 0c 24             mov    (%rsp),%rcx
  f6146c:   4c 8b a1 b0 01 00 00    mov    0x1b0(%rcx),%r12
  f61473:   4c 8d a9 b8 01 00 00    lea    0x1b8(%rcx),%r13
  f6147a:   88 44 24 28             mov    %al,0x28(%rsp)
  f6147e:   48                      rex.W
  f6147f:   8b                      .byte 0x8b
  f61480:   84                      .byte 0x84

Yup, matches the binary code excerpt logged by the kernel:

Aug 23 11:07:00 **** kernel: Code: 00 48 89 43 10 48 c7 43 18 00 00 00 00 0f ae f0 f0 48 83 45 00 01 e8 83 37 a6 ff f0 48 83 45 00 01 0f b6 44 24 0c 48 8b 0c 24 <4c> 8b a1 b0 01 00 00 4c 8d a9 b8 01 00 00 88 44 24 28 48 8b 84 24

So pretty sure, I'd say.