Icinga / icingadb

Icinga configuration and state database supporting multiple environments
https://icinga.com
GNU General Public License v2.0
60 stars 21 forks source link

IcingaDB fails with: 127.0.0.1:6380: i/o timeout #164

Closed mwaldmueller closed 3 years ago

mwaldmueller commented 4 years ago

I played around with huge environments (e.g. 20.000 Hosts/200.000 Services) and IcingaDB fails with:

Apr 23 10:26:45 idb3.localdomain icingadb[15916]: time="2020-04-23T10:26:45+02:00" level=info msg="Added 2 state history entries in the last 20 seconds"
Apr 23 10:26:56 idb3.localdomain icingadb[15916]: time="2020-04-23T10:26:56+02:00" level=info msg="Icinga 2 sent no heartbeat for 15 seconds. Pausing sync" UUID=038b40c4-2860-41e3-8c6d-05c186d37b07 context=HA environment=da39a3ee5e6b4b0d3255bfef95601890afd80709
Apr 23 10:26:58 idb3.localdomain icingadb[15916]: time="2020-04-23T10:26:58+02:00" level=info msg="Icinga 2 sent heartbeat. Starting sync" UUID=038b40c4-2860-41e3-8c6d-05c186d37b07 context=HA environment=da39a3ee5e6b4b0d3255bfef95601890afd80709
Apr 23 10:27:05 idb3.localdomain icingadb[15916]: time="2020-04-23T10:27:05+02:00" level=info msg="Added 3 state history entries in the last 20 seconds"
Apr 23 10:27:13 idb3.localdomain icingadb[15916]: time="2020-04-23T10:27:13+02:00" level=info msg="Icinga 2 sent no heartbeat for 15 seconds. Pausing sync" UUID=038b40c4-2860-41e3-8c6d-05c186d37b07 context=HA environment=da39a3ee5e6b4b0d3255bfef95601890afd80709
Apr 23 10:27:25 idb3.localdomain icingadb[15916]: time="2020-04-23T10:27:25+02:00" level=info msg="Added 2 state history entries in the last 20 seconds"
Apr 23 10:27:29 idb3.localdomain icingadb[15916]: time="2020-04-23T10:27:29+02:00" level=info msg="Icinga 2 sent heartbeat. Starting sync" UUID=038b40c4-2860-41e3-8c6d-05c186d37b07 context=HA environment=da39a3ee5e6b4b0d3255bfef95601890afd80709
Apr 23 10:27:45 idb3.localdomain icingadb[15916]: time="2020-04-23T10:27:45+02:00" level=info msg="Added 2 state history entries in the last 20 seconds"
Apr 23 10:27:45 idb3.localdomain icingadb[15916]: time="2020-04-23T10:27:45+02:00" level=info msg="Icinga 2 sent no heartbeat for 15 seconds. Pausing sync" UUID=038b40c4-2860-41e3-8c6d-05c186d37b07 context=HA environment=da39a3ee5e6b4b0d3255bfef95601890afd80709
Apr 23 10:28:01 idb3.localdomain icingadb[15916]: time="2020-04-23T10:28:01+02:00" level=fatal msg="read tcp 127.0.0.1:47914->127.0.0.1:6380: i/o timeout"
Apr 23 10:28:01 idb3.localdomain systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Apr 23 10:28:01 idb3.localdomain systemd[1]: icingadb.service: Failed with result 'exit-code'.

I don't think it's related to system resources. A restart of icingadb.service solves the problem.

Versions: icinga2: v2.12.0-rc1 icingadb: v1.0.0-rc1 icingaweb2: v2.8.0-rc1 icingadb-web: v1.0.0-rc1

mwaldmueller commented 4 years ago

"debug"-output:


Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=info msg="Icinga 2 sent no heartbeat for 15 seconds. Pausing sync" UUID=92352a36-985f-4166-b266-d3b83dd5ab9d context=HA environment=da39a3ee5e6b4b0d3255bfef95601890afd80709
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="servicegroup_customvar: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="checkcommand: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="checkcommand_customvar: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="checkcommand_argument: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="checkcommand_envvar: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="eventcommand: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="eventcommand_customvar: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="eventcommand_argument: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="eventcommand_envvar: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="notificationcommand: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="notificationcommand_customvar: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="notificationcommand_argument: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="notificationcommand_envvar: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="hostgroup: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="hostgroup_customvar: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="zone: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="timeperiod: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="timeperiod_customvar: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="timeperiod_override_include: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="timeperiod_override_exclude: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="timeperiod_range: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="comment: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="customvar: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="action_url: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="customvar_flat: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="notes_url: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="icon_image: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="usergroup: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="usergroup_customvar: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="notification: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="notification_customvar: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="notification_user: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="notification_usergroup: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="notification_recipient: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="endpoint: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="host_state: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="host: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="host_customvar: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="hostgroup_member: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="user_customvar: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="user: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="usergroup_member: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="downtime: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="service_customvar: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="servicegroup_member: Lost responsibility"
Apr 23 11:27:01 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:01+02:00" level=debug msg="servicegroup: Lost responsibility"
Apr 23 11:27:42 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:42+02:00" level=debug msg="service_state: Insert preparation worker started"
Apr 23 11:27:42 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:42+02:00" level=debug msg="service_state: Update preparation worker started"
Apr 23 11:27:42 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:42+02:00" level=debug msg="service_state: Delete execution worker started"
Apr 23 11:27:42 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:42+02:00" level=debug msg="service_state: Lost responsibility"
Apr 23 11:27:42 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:42+02:00" level=debug msg="service_state: Insert execution worker started"
Apr 23 11:27:42 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:42+02:00" level=debug msg="service_state: Insert preparation worker stopped"
Apr 23 11:27:42 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:42+02:00" level=debug msg="service_state: Update comparison worker started"
Apr 23 11:27:42 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:42+02:00" level=debug msg="service_state: Update execution worker started"
Apr 23 11:27:49 idb3.localdomain icingadb[17817]: time="2020-04-23T11:27:49+02:00" level=fatal msg="read tcp 127.0.0.1:51912->127.0.0.1:6380: i/o timeout"
Apr 23 11:27:49 idb3.localdomain systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Apr 23 11:27:49 idb3.localdomain systemd[1]: icingadb.service: Failed with result 'exit-code'.```
Al2Klimov commented 4 years ago

Apropos system resources: How much %CPU does our Redis consume?

mwaldmueller commented 4 years ago

Thanks, can you please provide packages that also include #168?

I tested with those packages:

ii  icingadb                             1.0.0+rc1.21.ge7ef0ce.20200430.1045+buster-0 amd64        database connector for Icinga 2
ii  icingadb-redis                       1.0.0+rc1.21.ge7ef0ce.20200430.1045+buster-0 amd64        Redis service for Icinga DB

Resources from my prior tests:

# ps -C icinga2,mysqld,icingadb,icinga-redis-se -o %cpu,%mem,cmd
%CPU %MEM CMD
40.3  8.2 /usr/bin/icinga-redis-server *:6380
15.6  8.7 /usr/sbin/mysqld
 6.0  9.4 /usr/sbin/icingadb --config /etc/icingadb/icingadb.ini
 0.0  0.1 /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio -e /var/log/icinga2/error.log
96.4 42.8 /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio -e /var/log/icinga2/error.log
60.2 13.1 /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2 --no-stack-rlimit daemon --close-stdio -e /var/log/icinga2/error.log
mwaldmueller commented 4 years ago

I still got the error:

Apr 30 15:31:04 idb3.localdomain icingadb[12940]: time="2020-04-30T15:31:04+02:00" level=fatal msg="read tcp 127.0.0.1:48348->127.0.0.1:6380: i/o timeout"
Apr 30 15:31:04 idb3.localdomain systemd[1]: icingadb.service: Main process exited, code=exited, status=1/FAILURE
Apr 30 15:31:04 idb3.localdomain systemd[1]: icingadb.service: Failed with result 'exit-code'

Versions: icinga2: v2.12.0-rc1 icingadb: v1.0.0+rc1.21.ge7ef0ce.20200430.1045+buster-0 icingadb-redis: v1.0.0+rc1.21.ge7ef0ce.20200430.1045+buster-0 icingaweb2: v2.8.0-rc1 icingadb-web: v1.0.0-rc1

Al2Klimov commented 4 years ago

168 Is not ready, yet.

Al2Klimov commented 4 years ago

@N-o-X Why does IcingaDB's Redis error handling differ from the MySQL one?

N-o-X commented 4 years ago

@Al2Klimov I think some things were added to the MySQL part after the first implementation and I just didn't think of also adding those things to Redis. Would be nice to have that 👍

Al2Klimov commented 4 years ago

Just discovered that our Redis lib provides configurable auto-retrying which covers net I/O errors by itself. Shall we use that instead?

N-o-X commented 4 years ago

If that's easier to use, then to build such a thing from ground up, sure why not 👍

N-o-X commented 3 years ago

Closing this issue, because our RC2 changes completely reworked our Redis client implementation.