Icinga / icinga2

The core of our monitoring platform with a powerful configuration language and REST API.
https://icinga.com/docs/icinga2/latest
GNU General Public License v2.0
2k stars 574 forks source link

Satellite sync failure #9752

Open ymartin-ovh opened 1 year ago

ymartin-ovh commented 1 year ago

Hello

When starting a new icinga2 instance inside a zone, icinga2 fails to start because config validation fails: group in hosts are missing.

Looking on satellite filesystem (/var/lib/icinga2/api/packages/_api/bdd5cdff-6e46-4795-a2cf-64ef56d3b397/conf.d):

I expect that icinga satellites load package" and hostgroups before hosts.

To fix this, I do:

When hostgroups config is sync:

I didn't experience this before icinga2 2.13.7-1+debian11

Regards

Al2Klimov commented 1 year ago

Hello Martin!

How did you create those hosts and groups in the first place?

Best, A/K

ymartin-ovh commented 1 year ago

Hello

Hosts and groups were created through API calls (masters).

Al2Klimov commented 1 year ago

Please share their config.

ymartin-ovh commented 1 year ago

From a satellite, I have something like that.

icinga2 tried to create host before hostgroup and was unhappy to activate host object because hostgroup was missing. As a workaround, I chattr+i the host folder to ensure that icinga2 sync packages and hostgroups first.

/var/lib/icinga2/api/packages/_api/164466ca-64c1-4e63-b4e0-eaaa65d9c493/conf.d/hosts/foobar.conf:

object Host "foobar" {
        import "webservers-host"

        address = "10.19.65.220"
        groups = [ "www-hosts" ]
        vars["delivery_status"] = "delivered"
        version = 1673339012.627591
        zone = "labeu"
}

/var/lib/icinga2/api/packages/_api/164466ca-64c1-4e63-b4e0-eaaa65d9c493/conf.d/hostgroups/www-hosts.conf

object HostGroup "www-hosts" {
        version = 1664280940.247957
        zone = "global-templates"
}
Al2Klimov commented 1 year ago

Why did you put them in different zones?

ymartin-ovh commented 1 year ago

I want to have a group with all www-hosts that regroups all regions.

I have this on icinga2 config too (masters & satellites) /etc/icinga2/zones.conf:

object Zone "global-templates" {
  global = true
}

host and groups are synced to satellite labeu region.

Al2Klimov commented 1 year ago

Have you tried https://icinga.com/docs/icinga-2/latest/doc/17-language-reference/#group-assign instead?

ymartin-ovh commented 1 year ago

I will try this.

However, I think that host activation should rely on hostgroup activation. I will create host.vars.groups and add an assign relashionship on all groups. I have the impression I re-implement what Icinga2 is doing with host.groups list.

Regards

ymartin-ovh commented 1 year ago

Hello @Al2Klimov

I can't find a way to create group object with assign rule with API. Do you know how to do this ?

Regards

Al2Klimov commented 1 year ago

I'm afraid that's impossible via API.

ymartin-ovh commented 1 year ago

Hum,

I have my satellite fresh start issue with 2.14.0 too.

On the first run, I can see the following error in logs:

icinga2[3965374]: icinga2: /usr/include/boost/smart_ptr/intrusive_ptr.hpp:199: T* boost::intrusive_ptr<T>::operator->() const [with T = icinga::Host]: Assertion `px != 0' failed.
icinga2[3965374]: Caught SIGABRT
Build information:
  Compiler: GNU 10.2.1
  Build host: runner-hh8q3bz2-project-575-concurrent-0
  OpenSSL version: OpenSSL 1.1.1n  15 Mar 2022

Application information:

General paths:
  Config directory: /etc/icinga2
  Data directory: /var/lib/icinga2
  Log directory: /var/log/icinga2
  Cache directory: /var/cache/icinga2
  Spool directory: /var/spool/icinga2
  Run directory: /run/icinga2

Old paths (deprecated):
  Installation root: /usr
  Sysconf directory: /etc
  Run directory (base): /run
  Local state directory: /var

Internal paths:
  Package data directory: /usr/share/icinga2
  State path: /var/lib/icinga2/icinga2.state
  Modified attributes path: /var/lib/icinga2/modified-attributes.conf
  Objects path: /var/cache/icinga2/icinga2.debug
  Vars path: /var/cache/icinga2/icinga2.vars
  PID path: /run/icinga2/icinga2.pid

Stacktrace:
 0# icinga::Application::SigAbrtHandler(int) in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
 1# 0x00007F6570AED140 in /lib/x86_64-linux-gnu/libpthread.so.0
 2# gsignal in /lib/x86_64-linux-gnu/libc.so.6
 3# abort in /lib/x86_64-linux-gnu/libc.so.6
 4# 0x00007F65705FD40F in /lib/x86_64-linux-gnu/libc.so.6
 5# 0x00007F657060C662 in /lib/x86_64-linux-gnu/libc.so.6
 6# 0x000055DDCFA96793 in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
 7# icinga::Comment::OnAllConfigLoaded() in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
 8# 0x000055DDCF7E1645 in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
 9# icinga::WorkQueue::RunTaskFunction(std::function<void ()> const&) in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
10# 0x000055DDCF7F462C in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
11# icinga::WorkQueue::RunTaskFunction(std::function<void ()> const&) in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
12# icinga::WorkQueue::WorkerThreadProc() in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
13# 0x00007F657110B787 in /lib/x86_64-linux-gnu/libboost_thread.so.1.74.0
14# 0x00007F6570AE1EA7 in /lib/x86_64-linux-gnu/libpthread.so.0
15# clone in /lib/x86_64-linux-gnu/libc.so.6
Al2Klimov commented 1 year ago

@julianbrost Please say you can decode those addresses via your recent gdb magic. 😭

ymartin-ovh commented 1 year ago

I'm bissecting, I have the crash with this stacktrace since 2.13.6.

ymartin-ovh commented 1 year ago

crash report with 2.13.6: report.1693489668.561474-2.13.6.txt

ymartin-ovh commented 1 year ago

crash report with 2.14.0: report.1693490091.825202-2.14.0.txt

ymartin-ovh commented 1 year ago

For now, to not trigger the bug, for the first start of a satellite:

julianbrost commented 1 year ago

@julianbrost Please say you can decode those addresses via your recent gdb magic. 😭

No magic involved there, just install the the package with the debug symbols for that very exact version.

ymartin-ovh commented 1 year ago

Hello

Checking with dbg symbols, abort I see is related to config validation failure because of missing groups (the topic of my bug report):

Stacktrace:
 0# icinga::Application::SigAbrtHandler(int) in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
 1# 0x00007F8ADAB7B140 in /lib/x86_64-linux-gnu/libpthread.so.0
 2# gsignal in /lib/x86_64-linux-gnu/libc.so.6
 3# abort in /lib/x86_64-linux-gnu/libc.so.6
 4# 0x00007F8ADA68B40F in /lib/x86_64-linux-gnu/libc.so.6
 5# 0x00007F8ADA69A662 in /lib/x86_64-linux-gnu/libc.so.6
 6# 0x0000561357608793 in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
 7# icinga::Comment::OnAllConfigLoaded() in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
 8# 0x0000561357353645 in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
 9# icinga::WorkQueue::RunTaskFunction(std::function<void ()> const&) in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
10# 0x000056135736662C in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
11# icinga::WorkQueue::RunTaskFunction(std::function<void ()> const&) in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
12# icinga::WorkQueue::WorkerThreadProc() in /usr/lib/x86_64-linux-gnu/icinga2/sbin/icinga2
13# 0x00007F8ADB199787 in /lib/x86_64-linux-gnu/libboost_thread.so.1.74.0
14# 0x00007F8ADAB6FEA7 in /lib/x86_64-linux-gnu/libpthread.so.0
15# clone in /lib/x86_64-linux-gnu/libc.so.6

Starting with the satellite with an empty configuration directory, the daemon fails to start up with a valid full configuration ; /var/lib/icinga2/api/packages/_api//conf.d => hostgroups is missing

Icinga2 master send config object in that order:

I'm trying to understand why master does not send hostgroups configuration first.

ymartin-ovh commented 1 year ago
#8  0x0000561357608793 in boost::intrusive_ptr<icinga::Host>::operator->() const [clone .part.0] [clone .lto_priv.0] (this=<optimized out>) at /usr/include/boost/smart_ptr/intrusive_ptr.hpp:199
        __PRETTY_FUNCTION__ = {<optimized out> <repeats 71 times>}
#9  0x0000561357521958 in boost::intrusive_ptr<icinga::Host>::operator-> (this=<synthetic pointer>) at ../lib/icinga/./lib/icinga/comment.cpp:75
        __PRETTY_FUNCTION__ = {<optimized out> <repeats 71 times>}
#10 icinga::Comment::OnAllConfigLoaded (this=0x7f8ac1c35000) at ../lib/icinga/./lib/icinga/comment.cpp:71
        host = {px = <optimized out>}

=> m_Checkable = host->GetServiceByShortName(GetServiceName()); Do we need to check if host is null ?

ymartin-ovh commented 1 year ago

In void Comment::OnAllConfigLoaded(), doing this:

-        if (GetServiceName().IsEmpty())
+        if (GetServiceName().IsEmpty() || ! host)

fix my issue

Al2Klimov commented 1 year ago

Hah! This shall be fixed by:

(I told its absence will make problems.)

You've already tested a custom patch to Icinga. Please could you also test that PR's commit cherry-picked on top of the support/2.14 or support/2.13 branch, whichever will build? (I guess only support/2.13 due to #9577.) If you need a 2.14 (or can't reproduce with 2.13 anymore) I guess(!) you could also revert #9577 and then cherry-pick.

ymartin-ovh commented 1 year ago

I cherry pick #7786 against v2.14, need to adapt a thing:

--- a/lib/remote/apilistener-configsync.cpp
+++ b/lib/remote/apilistener-configsync.cpp
@@ -459,8 +459,7 @@ void ApiListener::SendRuntimeConfigObjects(const JsonRpcConnection::Ptr& aclient
                        bool unresolved_dep = false;

                        /* skip this type (for now) if there are unresolved load dependencies */
-                       for (const String& loadDep : type->GetLoadDependencies()) {
-                               Type::Ptr pLoadDep = Type::GetByName(loadDep);
+                       for (auto pLoadDep : type->GetLoadDependencies()) {

The patch seems to help no triggering the null reference use (for comments) but Icinga2 is still trying to load host before groups. Daemon dies with exit code 139. I can't recover / achieve a state where the satellite grabs all configuration (hosts & hostgroups).

Satellite api package content (hostgroup missing)


ls -l /var/lib/icinga2/api/packages/_api/6c96ea8d-b3a2-4666-8c20-5f60d584460d/conf.d/
total 124
drwx------ 2 nagios nagios 69632 Sep 12 10:11 comments
drwx------ 2 nagios nagios 45056 Sep 12 10:11 downtimes
drwx------ 2 nagios nagios  4096 Sep 12 10:11 hosts
Al2Klimov commented 1 year ago

Daemon dies with exit code 139.

Despite the PR?

ymartin-ovh commented 1 year ago

@Al2Klimov I only apply #7786 with the diff about type->GetLoadDependencies because of #9577 change. I don't pick my yesterday diff #9861

Al2Klimov commented 1 year ago

The patch seems to help no triggering the null reference use (for comments)

At least one thing it fixes, OK.

I cherry pick #7786 against v2.14, need to adapt a thing:

--- a/lib/remote/apilistener-configsync.cpp
+++ b/lib/remote/apilistener-configsync.cpp
@@ -459,8 +459,7 @@ void ApiListener::SendRuntimeConfigObjects(const JsonRpcConnection::Ptr& aclient
                        bool unresolved_dep = false;

                        /* skip this type (for now) if there are unresolved load dependencies */
-                       for (const String& loadDep : type->GetLoadDependencies()) {
-                               Type::Ptr pLoadDep = Type::GetByName(loadDep);
+                       for (auto pLoadDep : type->GetLoadDependencies()) {

Please could you open a new PR into that PR, i.e. bugfix/api-runtime-object-sync-order is your base branch and your adaption is the diff? (Mention me in this case.)

ymartin-ovh commented 1 year ago

Ok, I will do this.

Hum, about loaddependencies. How can I express the fact that HostGroup should be a dependency of Host (aka load HostGroup objects before Host) ?

Regards

Al2Klimov commented 1 year ago

Not sure that you wanna actually do this, but see https://github.com/Icinga/icinga2/pull/8119/files#diff-7529d2f2812859b880d8d6cdf34b2ad783226b8bc59a79077a602f76c6fde0f0 .

ymartin-ovh commented 1 year ago

https://github.com/Icinga/icinga2/pull/8119/files#diff-7529d2f2812859b880d8d6cdf34b2ad783226b8bc59a79077a602f76c6fde0f0 => I don't understand the relationship of "load_after Host".

Host object has group name reference not the opposite.

Al2Klimov commented 1 year ago

This was only an example, just make the opposite if you wanna test it. But the directive is always load_after.

ymartin-ovh commented 1 year ago

So I was wrong #7786 does not fix the issue with comment objects. The exit code 139 was the segfault I addressed yesterday by checking host reference #9861.

I can make a PR to refresh #7786 so we can applied against v2.14.0 but I don't know how to check if it's OK or not. For now, I didn't find any improvment with this. Maybe, the diff will have more sense if I add load_after HostGroup; in host.ti.

Al2Klimov commented 1 year ago

Yes, test the latter if you believe it will help.

ymartin-ovh commented 1 year ago

I update #9861.

For now, all my tests with load_after does not seem to change anything. When Icinga starts, the satellite receives and tries to load objects ... / comments / ... / hosts / ... / hostgroups.

Maybe I miss something.

Al2Klimov commented 4 months ago

Do all three together fix your problem?