Closed esunar closed 2 years ago
(by vern) With most subordinate charms, you don't want them to be installed more than once on a machine. The nrpe charm is special in that you can related it to other charms on the same machine and it can enable additional checks.
This is a good thing and should not be called out by juju-lint.
(by nobuto) There is still a confusion just because of this false positive. People may be surprised by the error and tend to remove relations even if those "duplicate" relations are necessary.
Just for the record:
[1 relation]
$ juju status --relations | grep nrpe-host: ceph-osd:juju-info nrpe-host:general-info juju-info subordinate
$ juju run-action --wait nrpe-host/2 list-nrpe-checks unit-nrpe-host-2: UnitId: nrpe-host/2 id: "18" results: checks: check-conntrack: /usr/local/lib/nagios/plugins/check_conntrack.sh -w 80 -c 90 check-disk-root: '/usr/lib/nagios/plugins/check_disk -u GB -w 25% -c 20% -K 5% -p / ' check-load: /usr/lib/nagios/plugins/check_load -w 32,16,8 -c 64,32,16 check-mem: /usr/local/lib/nagios/plugins/check_mem.pl -C -h -u -w 85 -c 90 check-swap: /usr/lib/nagios/plugins/check_swap -w 40% -c 25% check-swap-activity: /usr/local/lib/nagios/plugins/check_swap_activity -i 5 -w 10240 -c 40960 timestamp: Fri Jul 31 02:40:00 UTC 2020 status: completed timing: completed: 2020-07-31 02:40:01 +0000 UTC enqueued: 2020-07-31 02:39:57 +0000 UTC started: 2020-07-31 02:40:00 +0000 UTC
-> 6 checks
[2 relations]
$ juju status --relations | grep nrpe-host: ceph-osd:juju-info nrpe-host:general-info juju-info subordinate ceph-osd:nrpe-external-master nrpe-host:nrpe-external-master nrpe-external-master subordinate
$ juju run-action --wait nrpe-host/2 list-nrpe-checks unit-nrpe-host-2: UnitId: nrpe-host/2 id: "20" results: checks: check-ceph-osd: /usr/local/lib/nagios/plugins/check_ceph_osd_services.py check-conntrack: /usr/local/lib/nagios/plugins/check_conntrack.sh -w 80 -c 90 check-disk-root: '/usr/lib/nagios/plugins/check_disk -u GB -w 25% -c 20% -K 5% -p / ' check-load: /usr/lib/nagios/plugins/check_load -w 32,16,8 -c 64,32,16 check-mem: /usr/local/lib/nagios/plugins/check_mem.pl -C -h -u -w 85 -c 90 check-swap: /usr/lib/nagios/plugins/check_swap -w 40% -c 25% check-swap-activity: /usr/local/lib/nagios/plugins/check_swap_activity -i 5 -w 10240 -c 40960 timestamp: Fri Jul 31 02:41:11 UTC 2020 status: completed timing: completed: 2020-07-31 02:41:12 +0000 UTC enqueued: 2020-07-31 02:41:11 +0000 UTC started: 2020-07-31 02:41:12 +0000 UTC
-> 7 checks
[3 relations]
$ juju status --relations | grep nrpe-host: ceph-osd:juju-info nrpe-host:general-info juju-info subordinate ceph-osd:nrpe-external-master nrpe-host:nrpe-external-master nrpe-external-master subordinate nova-compute-kvm:nrpe-external-master nrpe-host:nrpe-external-master nrpe-external-master subordinate
$ juju run-action --wait nrpe-host/2 list-nrpe-checks unit-nrpe-host-2: UnitId: nrpe-host/2 id: "22" results: checks: check-ceph-osd: /usr/local/lib/nagios/plugins/check_ceph_osd_services.py check-conntrack: /usr/local/lib/nagios/plugins/check_conntrack.sh -w 80 -c 90 check-disk-root: '/usr/lib/nagios/plugins/check_disk -u GB -w 25% -c 20% -K 5% -p / ' check-libvirtd: /usr/local/lib/nagios/plugins/check_systemd.py libvirtd check-load: /usr/lib/nagios/plugins/check_load -w 32,16,8 -c 64,32,16 check-mem: /usr/local/lib/nagios/plugins/check_mem.pl -C -h -u -w 85 -c 90 check-nova-compute: /usr/local/lib/nagios/plugins/check_systemd.py nova-compute check-swap: /usr/lib/nagios/plugins/check_swap -w 40% -c 25% check-swap-activity: /usr/local/lib/nagios/plugins/check_swap_activity -i 5 -w 10240 -c 40960 timestamp: Fri Jul 31 02:42:31 UTC 2020 status: completed timing: completed: 2020-07-31 02:42:32 +0000 UTC enqueued: 2020-07-31 02:42:29 +0000 UTC started: 2020-07-31 02:42:32 +0000 UTC
-> 9 checks
(by ec0) So, I think generically a way to exempt subordinates from the multiple-placement check would address this. It would definitely need to be exception based, as the default situation with the majority of subordinate charms is that they expect not to be installed multiple time. NRPE is indeed an exception to this, NTP however, as an example, definitely does not expect to be installed multiple times.
Triaging as high.
(by ec0) One question here, can you not simply relate nrpe-host to both nova-compute-kvm and ceph-osd, and use a single nrpe-host?
(by ec0) *Reword: Can you not simply relate nrpe-host to the juju-info relation of either nova-compute-kvm or ceph-osd, and use a single nrpe-host? Which checks are missing in this situation, and is it possible to monitor those checking by relation via the external-master or local-monitors relations or similar? Just want to make sure we fix this in the right place - because it might also make sense to raise a bug against NRPE and/or ceph-osd/nova-compute to handle this situation more elegantly.
This is basically https://bugs.launchpad.net/fce-templates/+bug/1855659 assigned to correct project.
Looks like there are some duplicate relations as we are taking hyper-converged architecture (nova-compute and ceph-osd are on the same physical host).
2019-12-06 15:42:28 [INFO] following subordinates where found on machines more than once: 2019-12-06 15:42:28 [ERROR] -> nrpe-host [21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35]
$ git grep 'nrpe-host' config/bundle.yaml | egrep 'nova-compute|ceph-osd' config/bundle.yaml: - [ nova-compute-kvm, nrpe-host ] config/bundle.yaml: - [ "ceph-osd:nrpe-external-master", "nrpe-host:nrpe-external-master" ] config/bundle.yaml: - [ "nova-compute-kvm:nrpe-external-master", "nrpe-host:nrpe-external-master" ]
We and up with following situation: Unit Workload Agent Machine Public address Ports Message ceph-osd/0 active idle 21 10.16.41.53 Unit is ready (6 OSD) nrpe-host/18 active idle 10.16.41.53 icmp,5666/tcp ready nova-compute-kvm/0 active idle 21 10.16.41.53 Unit is ready nrpe-host/20 active idle 10.16.41.53 ready
Machine State DNS Inst id Series AZ Message 21 started 10.16.41.53 sf-jkt001-hyperconverge020-rack003 bionic 01-03 Deployed
We can see that 2 units of nrpe-host application land on the same physical machine. This is not a bug as one of these is related to ceph-osd and the other one to nova-compute-kvm. Monitored services are chosen based on the principal relation and we have 2 principal charms collocated in here, each of these providing different set of Nagios checks.
Hence juju-lint is seems to be a false-positive alert as this is an expected situation.
Imported from Launchpad using lp2gh.
date created: 2019-12-10T09:59:09Z
owner: majduk
assignee: jfguedez
the launchpad url