ITRS-Group / monitor-merlin

Module for Effortless Redundancy and Loadbalancing In Naemon
https://itrs-group.github.io/monitor-merlin/
GNU General Public License v2.0
22 stars 14 forks source link

Current Merlin releases appear to trigger instability in Naemon #158

Open eschoeller opened 8 months ago

eschoeller commented 8 months ago

There is a release available here on github: 2022.06.02. I started off using that. I then migrated to using packages hosted on this mirror: https://download.opensuse.org/repositories/home:/itrs-op5/CentOS_8_Stream/

In both cases I ran into situations where naemon would crash (and dump a core) and merlind would eventually peg itself at 99% CPU usage. I went in circles for awhile trying to determine what was going on. I started to narrow in on service and host checks that would return a CRITICAL state and cause naemon to crash when it was attempting to generate a notification (even though I had notifications disabled globally). During my initial load testing I was using mostly ping checks that all returned OK, so I rarely hit this condition. But the moment I started getting checks that returned CRITICAL, things would break.

Anyway, long story short - I built merlin from source and everything is fine now. But given the run-around I went through, I figured I'd report this here for anyone else who may encounter this problem -or- merely as a suggestion that it might be an appropriate time to package a new release.

I did see in the github issues (#146 ) there was a 2022.06.30 release, but I never actually found it.

I can re-configure these systems to trigger the issue again pretty easily if you need more info, but since the issue is fixed in the current source code I doubt any further troubleshooting is needed.

Some additional information about the systems where I encountered these problems: CentOS Stream release 8 4.18.0-527.el8.x86_64 #1 SMP Thu Nov 23 14:16:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux libnaemon-1.4.1-18.1.x86_64 naemon-thruk-1.4.1-13.1.noarch naemon-livestatus-1.4.1-14.1.x86_64 naemon-1.4.1-13.1.noarch naemon-devel-1.4.1-18.1.x86_64 naemon-core-1.4.1-18.1.x86_64 naemon-vimvault-1.4.0-3.2.x86_64

NAME="Red Hat Enterprise Linux" VERSION="8.9 (Ootpa)" 4.18.0-513.5.1.el8_9.x86_64 #1 SMP Fri Sep 29 05:21:10 EDT 2023 x86_64 x86_64 x86_64 GNU/Linux naemon-livestatus-1.4.1-14.1.x86_64 naemon-1.4.1-13.1.noarch libnaemon-1.4.1-18.1.x86_64 naemon-vimvault-1.4.0-3.2.x86_64 naemon-core-1.4.1-18.1.x86_64 naemon-thruk-1.4.1-13.1.noarch

(Both CentOS and RHEL systems were originally fetching naemon from https://labs.consol.de/repo/stable/rhel8/x86_64/ but switched to https://download.opensuse.org/repositories/home:/naemon/CentOS_7/)

It also seems like I may have had this exact same problem on a set of Debian machines on 3/6/2023 after naemon got upgraded there. I suppose the fix slipped my mind! PRETTY_NAME="Debian GNU/Linux 10 (buster)" 4.19.0-25-amd64 #1 SMP Debian 4.19.289-2 (2023-08-08) x86_64 GNU/Linux ii libnaemon:amd64 1.4.1-1 amd64
ii naemon 1.4.1-1 amd64 ii naemon-core 1.4.1-1 amd64 ii naemon-dev 1.4.1-1 amd64 ii naemon-livestatus 1.4.1-1 amd64 ii naemon-thruk 1.4.1-1 amd64 ii naemon-vimvault 1.4.0-1 amd64 ii thruk 3.10-1 amd64