NagiosEnterprises / ndoutils

NDOUtils - Database Output for Nagios Core
GNU General Public License v2.0
48 stars 21 forks source link

Help wanted: Bug & MySQL C API review #57

Closed hedenface closed 5 years ago

hedenface commented 5 years ago

Currently working through a complete rewrite of NDOUtils (check the ndo-3 branch).

One of the goals is to remove the necessity of the kernel message queue, as this is the source of many admin headaches in larger nagios systems.

I've currently hit a roadblock, and was curious if any previous or new contributors to core or ndo would be willing to take a look and get some fresh eyes on it. @knweiss comes to mind immediately.

Here's the main issue - I'm attempting to save a lot of individual insert calls to mysql by building several bulk inserts on a loop. Well, originally during the rewrite we were doing individual inserts for brevity and to get it working, but initial performance testing once complete revealed that something needed to change immediately. https://github.com/NagiosEnterprises/ndoutils/blob/ndo-3/src/ndo-startup.c#L527-L810 Here is ndo_write_hosts - all of the ancillary data (host's parent hosts/contacts/contactgroups/customvars) revolve on the host already existing in the nagios_hosts table. So we loop over all hosts, build the appropriate queries, insert the data, repeat until all hosts have been inserted. THEN we loop over them again, and build numerous queries for each of the related objects.

This all works, except once it gets to the custom variables, I get a segfault. I've narrowed this down and what seems like is happening is that (char *) var_query_on_update is simply not readable any longer. On a large system, 15k+ hosts, it usually will start erroring around the 500th host (no matter how big (or small) the ndo_max_insert_values integer is set to (via ndo.cfg).

If anyone has time to review the code and help out - we'd certainly appreciate it.

Likewise, if anyone has any experience with the mysql c api and can point out some flaw or something that is going to blow up one day with this code, that would also be appreciated. (Keep in mind that all of the functions in ndo-startup.c are currently undergoing being re-written to the ndo_write_hosts and ndo_write_services pattern of insertion)

Thanks!

hedenface commented 5 years ago

started adding an isolation case: https://github.com/NagiosEnterprises/ndoutils/commit/77320ba37aa641832b9fa26b023c6b00a0df71f0

hedenface commented 5 years ago

I believe this issue was successfully replicated here (commit https://github.com/NagiosEnterprises/ndoutils/commit/7d946d9dadfda89b927a88c26b07099595b5c51b) https://github.com/NagiosEnterprises/ndoutils/blob/7d946d9dadfda89b927a88c26b07099595b5c51b/src/bug-test.c - and this also contains the fix.

Which ...is a bit silly on my part, but that's how these things go, I suppose https://github.com/NagiosEnterprises/ndoutils/blob/7d946d9dadfda89b927a88c26b07099595b5c51b/src/bug-test.c#L50

nook24 commented 5 years ago

I found this issue via your reddit post. How ever. Did you you already checked out Statusengine 2? It's a fully working NDO drop-in replacement. No kernel message queues, no other random issues anymore. https://statusengine.org/oldstable/ https://github.com/nook24/statusengine/issues/41

sawolf commented 5 years ago

@nook24 I hadn't seen that website before, that's pretty nice!

I think for now we still need to have a solution that we "control", but the approach for your project is certainly interesting.

nook24 commented 4 years ago

Any news about ndo-3 ? :)

sawolf commented 4 years ago

Yea, I suppose I can give an update:

nook24 commented 4 years ago

Is @hedenface not working on Nagios projects anymore?

CPU load on 50k hosts/services dropped from a peak of 100+ to ~1.27.

This sounds great. But was NDOUtils producing this high load or Nagios Core itself? Or some other process.

I was asked to remove it last week because we weren't sure whether we would release the finished product as open-source software.

That's too bad - even if I didn't used NDO since November 2014