Closed deric closed 1 year ago
Looks more like something is trying to stop icinga2 but it isn't stopping in time so it is killed. Can you please also share timestamps for the systemd log messages as well as icinga2 logs (/var/log/icinga2/icinga2.log
)?
@julianbrost Yes, the reload is triggered by puppet, as I wrote above. But it timeouts and the main process is unresponsive during the grace period.
Jan 03 10:24:34 safe-reload[317362]: Reloading Icinga 2: Done
Jan 03 10:25:18 systemd[1]: Reloaded Icinga host/service/network monitoring system.
Jan 03 13:24:10 puppet-agent[323515]: (/Stage[main]/Icinga::Master/Icinga2::Object::Endpoint...
Jan 03 13:24:11 systemd[1]: Reloading Icinga host/service/network monitoring system.
Jan 03 13:24:38 safe-reload[323975]: Validating config files: Done
Jan 03 13:24:38 safe-reload[323975]: Reloading Icinga 2: Done
Jan 03 13:31:30 systemd[1]: icinga2.service: State 'stop-sigterm' timed out. Killing.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 312434 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317424 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317433 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324036 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324045 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317427 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317428 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317429 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317430 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317431 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317463 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317464 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317465 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317466 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317467 (n/a) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317469 (n/a) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317470 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317471 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317475 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317478 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324037 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324038 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324039 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324041 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324042 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324046 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324059 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324061 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324062 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324064 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324066 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Main process exited, code=killed, status=9/KILL
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317433 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324036 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317424 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317465 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 317466 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324039 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324042 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Killing process 324046 (icinga2) with signal SIGKILL.
Jan 03 13:31:30 systemd[1]: icinga2.service: Failed with result 'timeout'.
Jan 03 13:31:30 systemd[1]: Stopped Icinga host/service/network monitoring system.
Jan 03 13:31:30 systemd[1]: icinga2.service: Consumed 7h 56min 11.767s CPU time.
Jan 03 13:31:30 systemd[1]: Starting Icinga host/service/network monitoring system...
Jan 03 13:31:30 icinga2[324112]: [2023-01-03 13:31:30 +0000] information/cli: Icinga application loader (version: r2.13.6-1)
Jan 03 13:31:49 puppet-agent[323515]: (/Stage[main]/Icinga2::Service/Service[icinga2]) Triggered 'refresh' from 1 event
Jan 03 13:31:50 puppet-agent[323515]: Applied catalog in 518.05 seconds
Is there anything logged to /var/log/icinga2/icinga2.log
in that time?
It would also be helpful to have backtraces from the processes at the time it should reload (so in the above logs, this would have been somewhere around 13:30). Please have a look at https://icinga.com/docs/icinga-2/latest/doc/21-development/#gdb-backtrace-from-running-process for how to obtain them.
No, not really. The reload/restart is irrelevant, it looks like a memory leak to me:
The node has 8 GB of RAM, it used to work fine with ~2.4GB RAM.
The instance has been upgraded 2.13.5-1.bullseye
-> 2.13.6-2+debian11
on 2023-01-02
. Although started leaking memory 24h after the upgrade... so not necessarily related to the upgrade.
Start-Date: 2023-01-02 10:21:25
Commandline: apt upgrade -y
Upgrade: icinga2-ido-pgsql:amd64 (2.13.5-1.bullseye, 2.13.6-2+debian11), icingaweb2-common:amd64 (2.11.1-1.bullseye, 2.11.3-2.bullseye), icinga-php-thirdparty:amd64 (0.11.0-1.bullseye, 0.11.0-2.bullseye), php-icinga:amd64 (2.11.1-1.bullseye, 2.11.3-2.bullseye), icinga2-bin:amd64 (2.13.5-1.bullseye, 2.13.6-2+debian11), postgresql-client-13:amd64 (13.9-0+deb11u1, 13.9-1.pgdg110+1), icinga2:amd64 (2.13.5-1.bullseye, 2.13.6-2+debian11), icingaweb2:amd64 (2.11.1-1.bullseye, 2.11.3-2.bullseye), icinga-php-library:amd64 (0.10.0-1+debian11, 0.10.1-1+debian11), icingaweb2-module-monitoring:amd64 (2.11.1-1.bullseye, 2.11.3-2.bullseye), postgresql-client-common:amd64 (244.pgdg110+1, 246.pgdg110+1), dbconfig-common:amd64 (2.0.19, 2.0.22~pgdg110+1), telegraf:amd64 (1.24.2-1, 1.25.0-1), icinga2-common:amd64 (2.13.5-1.bullseye, 2.13.6-2+debian11), libpq5:amd64 (15.0-1.pgdg110+1, 15.1-1.pgdg110+1), puppet-agent:amd64 (7.20.0-1bullseye, 7.21.0-1bullseye)
End-Date: 2023-01-02 10:22:04
@julianbrost Here are the backtraces, let me know if you need anything else:
Currently the icinga2 service is at 4.1GB
from heaptrack profiler:
A flame graph of leaked memory:
This looks like your database server isn't able to keep up with the amount of queries generated by Icinga. These queries will then be buffered. If this happens for a short time, they will be processed afterwards and the queue shrinks again. If the database server can't keep up with the sustained rate of queries, this memory usage will grow. There should also be hints to this in the logs, please do check them.
@julianbrost That's weird. The allocations are linearly growing and the db backend is pretty much idle. The DB might need some tuning, but that's not the main issue.
A workaround is to downgrade the master instances to the previous version 2.13.5
:
apt install icinga2=2.13.5-1.bullseye icinga2-bin=2.13.5-1.bullseye icinga2-common=2.13.5-1.bullseye icinga2-ido-pgsql=2.13.5-1.bullseye
With 2.13.5
the memory usage is constant and I haven't done any change to the db backend:
Here's pending queries log. With 2.13.6 the queries backlog is only growing, after downgrade it gets better (the throughput is basically constant all the time)
[2023-01-05 06:50:11 +0000] information/IdoPgsqlConnection: Pending queries: 588928 (Input: 1714/s; Output: 1342/s)
[2023-01-05 06:50:21 +0000] information/IdoPgsqlConnection: Pending queries: 592830 (Input: 1775/s; Output: 1386/s)
[2023-01-05 06:50:31 +0000] information/IdoPgsqlConnection: Pending queries: 595250 (Input: 1767/s; Output: 1539/s)
[2023-01-05 06:50:41 +0000] information/IdoPgsqlConnection: Pending queries: 600774 (Input: 1738/s; Output: 1195/s)
[2023-01-05 06:50:51 +0000] information/IdoPgsqlConnection: Pending queries: 604802 (Input: 1683/s; Output: 1312/s)
[2023-01-05 06:51:01 +0000] information/IdoPgsqlConnection: Pending queries: 610195 (Input: 1771/s; Output: 1254/s)
[2023-01-05 06:51:11 +0000] information/IdoPgsqlConnection: Pending queries: 616477 (Input: 1787/s; Output: 1181/s)
[2023-01-05 06:51:21 +0000] information/IdoPgsqlConnection: Pending queries: 620253 (Input: 1718/s; Output: 1348/s)
[2023-01-05 06:51:31 +0000] information/IdoPgsqlConnection: Pending queries: 625293 (Input: 1758/s; Output: 1266/s)
[2023-01-05 06:51:41 +0000] information/IdoPgsqlConnection: Pending queries: 632946 (Input: 1784/s; Output: 1036/s)
[2023-01-05 06:51:51 +0000] information/IdoPgsqlConnection: Pending queries: 638382 (Input: 1739/s; Output: 1206/s)
[2023-01-05 06:52:01 +0000] information/IdoPgsqlConnection: Pending queries: 640290 (Input: 1739/s; Output: 1556/s)
[2023-01-05 06:52:11 +0000] information/IdoPgsqlConnection: Pending queries: 640998 (Input: 1640/s; Output: 1577/s)
---- icinga2 downgrade to 2.13.5
[2023-01-05 06:52:21 +0000] information/IdoPgsqlConnection: Pending queries: 641006 (Input: 1/s; Output: 1577/s)
[2023-01-05 06:52:31 +0000] information/IdoPgsqlConnection: Pending queries: 641054 (Input: 5/s; Output: 1577/s)
[2023-01-05 06:52:41 +0000] information/IdoPgsqlConnection: Pending queries: 641074 (Input: 2/s; Output: 1577/s)
[2023-01-05 06:52:51 +0000] information/IdoPgsqlConnection: Pending queries: 641094 (Input: 2/s; Output: 1577/s)
[2023-01-05 06:53:01 +0000] information/IdoPgsqlConnection: Pending queries: 641389 (Input: 607/s; Output: 577/s)
[2023-01-05 06:53:11 +0000] information/IdoPgsqlConnection: Pending queries: 687177 (Input: 6316/s; Output: 1854/s)
[2023-01-05 06:53:21 +0000] information/IdoPgsqlConnection: Pending queries: 724263 (Input: 5403/s; Output: 1832/s)
[2023-01-05 06:53:31 +0000] information/IdoPgsqlConnection: Pending queries: 723213 (Input: 1754/s; Output: 1865/s)
[2023-01-05 06:53:41 +0000] information/IdoPgsqlConnection: Pending queries: 722565 (Input: 1782/s; Output: 1846/s)
[2023-01-05 06:53:51 +0000] information/IdoPgsqlConnection: Pending queries: 721805 (Input: 1746/s; Output: 1816/s)
[2023-01-05 06:54:01 +0000] information/IdoPgsqlConnection: Pending queries: 720947 (Input: 1736/s; Output: 1820/s)
[2023-01-05 06:54:11 +0000] information/IdoPgsqlConnection: Pending queries: 719505 (Input: 1702/s; Output: 1848/s)
[2023-01-05 06:54:21 +0000] information/IdoPgsqlConnection: Pending queries: 718963 (Input: 1765/s; Output: 1817/s)
[2023-01-05 06:54:31 +0000] information/IdoPgsqlConnection: Pending queries: 718263 (Input: 1769/s; Output: 1846/s)
[2023-01-05 06:54:41 +0000] information/IdoPgsqlConnection: Pending queries: 717381 (Input: 1734/s; Output: 1814/s)
[2023-01-05 06:54:51 +0000] information/IdoPgsqlConnection: Pending queries: 715831 (Input: 1681/s; Output: 1837/s)
[2023-01-05 06:55:01 +0000] information/IdoPgsqlConnection: Pending queries: 714835 (Input: 1781/s; Output: 1880/s)
[2023-01-05 06:55:11 +0000] information/IdoPgsqlConnection: Pending queries: 714408 (Input: 1790/s; Output: 1830/s)
[2023-01-05 06:55:21 +0000] information/IdoPgsqlConnection: Pending queries: 713572 (Input: 1716/s; Output: 1793/s)
[2023-01-05 06:55:31 +0000] information/IdoPgsqlConnection: Pending queries: 712734 (Input: 1752/s; Output: 1838/s)
[2023-01-05 06:55:41 +0000] information/IdoPgsqlConnection: Pending queries: 712144 (Input: 1780/s; Output: 1834/s)
[2023-01-05 06:55:51 +0000] information/IdoPgsqlConnection: Pending queries: 711250 (Input: 1753/s; Output: 1833/s)
[2023-01-05 06:56:01 +0000] information/IdoPgsqlConnection: Pending queries: 710282 (Input: 1730/s; Output: 1823/s)
[2023-01-05 06:56:11 +0000] information/IdoPgsqlConnection: Pending queries: 709484 (Input: 1709/s; Output: 1791/s)
[2023-01-05 06:56:21 +0000] information/IdoPgsqlConnection: Pending queries: 709502 (Input: 1769/s; Output: 1766/s)
[2023-01-05 06:56:31 +0000] information/IdoPgsqlConnection: Pending queries: 710262 (Input: 1763/s; Output: 1701/s)
[2023-01-05 06:56:41 +0000] information/IdoPgsqlConnection: Pending queries: 709901 (Input: 1731/s; Output: 1769/s)
[2023-01-05 06:56:51 +0000] information/IdoPgsqlConnection: Pending queries: 708856 (Input: 1682/s; Output: 1793/s)
[2023-01-05 06:57:01 +0000] information/IdoPgsqlConnection: Pending queries: 708387 (Input: 1775/s; Output: 1819/s)
[2023-01-05 06:57:11 +0000] information/IdoPgsqlConnection: Pending queries: 708012 (Input: 1785/s; Output: 1822/s)
[2023-01-05 06:57:21 +0000] information/IdoPgsqlConnection: Pending queries: 707121 (Input: 1717/s; Output: 1801/s)
[2023-01-05 06:57:31 +0000] information/IdoPgsqlConnection: Pending queries: 706326 (Input: 1754/s; Output: 1835/s)
[2023-01-05 06:57:41 +0000] information/IdoPgsqlConnection: Pending queries: 705924 (Input: 1782/s; Output: 1820/s)
[2023-01-05 06:57:51 +0000] information/IdoPgsqlConnection: Pending queries: 705482 (Input: 1737/s; Output: 1777/s)
[2023-01-05 06:58:01 +0000] information/IdoPgsqlConnection: Pending queries: 704864 (Input: 1734/s; Output: 1796/s)
[2023-01-05 06:58:11 +0000] information/IdoPgsqlConnection: Pending queries: 704010 (Input: 1707/s; Output: 1795/s)
[2023-01-05 06:58:21 +0000] information/IdoPgsqlConnection: Pending queries: 703222 (Input: 1762/s; Output: 1835/s)
[2023-01-05 06:58:31 +0000] information/IdoPgsqlConnection: Pending queries: 702655 (Input: 1766/s; Output: 1835/s)
[2023-01-05 06:58:41 +0000] information/IdoPgsqlConnection: Pending queries: 701523 (Input: 1746/s; Output: 1853/s)
[2023-01-05 06:58:51 +0000] information/IdoPgsqlConnection: Pending queries: 699702 (Input: 1691/s; Output: 1877/s)
...
[2023-01-05 07:55:01 +0000] information/IdoPgsqlConnection: Pending queries: 256663 (Input: 1769/s; Output: 1786/s)
[2023-01-05 07:55:11 +0000] information/IdoPgsqlConnection: Pending queries: 256487 (Input: 1782/s; Output: 1800/s)
[2023-01-05 07:55:21 +0000] information/IdoPgsqlConnection: Pending queries: 255353 (Input: 1714/s; Output: 1829/s)
[2023-01-05 07:55:31 +0000] information/IdoPgsqlConnection: Pending queries: 253883 (Input: 1754/s; Output: 1895/s)
[2023-01-05 07:55:41 +0000] information/IdoPgsqlConnection: Pending queries: 252695 (Input: 1781/s; Output: 1892/s)
[2023-01-05 07:55:51 +0000] information/IdoPgsqlConnection: Pending queries: 251337 (Input: 1749/s; Output: 1876/s)
[2023-01-05 07:56:01 +0000] information/IdoPgsqlConnection: Pending queries: 249877 (Input: 1732/s; Output: 1873/s)
I did some optimization on the db backend side in order to improve queries throughput. It did helped.
However the possibilities are very limited since IdoPgsql
is using only single connection. It appears that the latest version 2.13.6
is allocating more memory
Sometimes the IdoPgsql
gets stuck and it's unable to recover without service restart.
I've lowered frequency of some checks and managed to keep the memory usage in reasonable limits. But the ido design isn't optimal, I'll try switching to IcingaDB.
Describe the bug
Main process stops responding, UI claims that the monitoring backend is down.
icinga2
has status:AFAIK default
icinga2.service
config:To Reproduce
Not sure. The crash is probably triggered by configuration changes made by Puppet. Though the config reload should be safe (configuration is valid).
/etc/systemd/system/icinga2.service.d/limits.conf
:Currently the master process requires about 3.4GB of RAM, secondary master is on 2.5GB.
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
):2.13.6-1
icinga2 feature list
):api apiuser checker ido-pgsql influxdb mainlog notification
2.11.3
icinga2 daemon -C
):