bernisys commented 3 months ago

Feature Request

Is your feature request related to a problem? Please describe

We have a test VM which was able to handle the overall amount of data up to Cacti 1.2.23 or so, but after we took the step to either 25 and 27, we have problems with the poller cycle overrunning the boost part. Something has changed which causes it to run much slower than before. At this is where the problem starts: At one point, boost gets killed by timout constraints, when the next boost is starting. From that time on it cannot complete any more, because boost runs on more and more data, while data is collected in parallel, so it gets killed even earlier (in terms of amount of processed data) and starts from zero each time, over and over again, causing more and more DB load until the whole machine stalls completely.

While this problem exists in test, we would not dare to update our production systems.

Do you have any idea what might be the problem here? What had been changed that slows down the process? Switching back to the old version was instantly resolving the issue. I have already adapted the sql server parameters in accordance to the suggestions in the GUI, but this was only giving a slight improvement. I have also increased the amount of RAM for the DB itself, but that was also not really helping.

Describe the solution you'd like

The problem is starting with how boost collects the whole meta-information which it uses to run with. If i have understood the code correctly:

each boost cycle basically starts with the collection of data-IDs across the whole set of poller_output_boostarch* tables
then the table is sorted by data-ID and split into sections, assigning one consecutive and equally large set of IDs to each child process
then the child processes are forked, working on their specific set, starting from the smallest data-ID

When the boost process is overrunning another one, it gets killed and the whole process starts from scratch. Which means that boost has no chance to know, which data has been already processed.

One solution that i have come up with and which i am currently testing is: I try to limit the amount of tables processed by boost, so that the DB is stressed a bit less (faster processing if all tables can be kept in RAM). I've tried around with this, adding an array_slice() to the places where boost_get_arch_table_names() is used to retrieve the list of tables. _BTW: Slight inconsistency: In boost_process_local_data_ids() the variable is called "archive_tables" while in boost_output_rrd_data() and boost_prepare_process_table() it is just "archtables".

Another idea i came up with involves the actual mechanism that keeps track of the data IDs being processed. Here's what i think will help out of those kind of situations:

don't rebuild the table poller_output_boost_local_data_ids every time
clear that table only if a boost process was able to finish in time
if at the start of a boost run this table is empty, fill it with data
remember the name of the last (=highest timestamp) archive-table which is getting processed in the current run
if a boost run completes in time, re-fill it, in case new archive tables are present, which have not been processed yet

One question, what does poller_output_boost_processes actually do? If I remember clearly, it once contained the child process IDs, but nowadays that table is mostly empty.

Describe alternatives you've considered

More CPUs and RAM for a faster database ... but that's a rather lame approach, anyone can do that. ;) It is not solving the actual issue, it just suppresses the symptoms. It's better to actually improve the product and not circumvent effects while the actual source of the problem still exists.

bernisys commented 3 months ago

p.S. not sure if that's really a feature or rather a bug ... you decide :)

TheWitness commented 3 months ago

I guess this need an online session. How big is the VM? Upload some stats. Those arch tables should not exist unless there is an issue. Waiting for more info here.

TheWitness commented 3 months ago

@bernisys, here is a script. Run this as follows:

./showproc /tmp/output_for_cacti.txt

You can adjust the sleep time that is reasonable for your install. Run it when boost is running.

#!/bin/bash

if [ -f "/usr/bin/mariadb" ]; then
  program="mariadb"
else
  program="mysql"
fi

append=""
if [ "$1" != "" ]; then
  echo "Appending to file $1"
  append="| tee -a $1"
fi

sql="SELECT USER, TIME, STATE, SUBSTRING(INFO,1,80) AS INFO
  FROM information_schema.processlist
  WHERE STATE NOT LIKE '%Master%'
  AND INFO NOT LIKE '%processlist%'
  AND INFO IS NOT NULL
  ORDER BY TIME DESC
  LIMIT 20"

while true; do
  clear
  date
  $program -e "$sql" $append | awk -F "\t" '{printf("%-10s %-5s %-20s %-80s\n", $1, $2, $3, $4)}'
  sleep 1
done

bernisys commented 3 months ago

Hi Larry, thanks a lot for checking, here are some quick pointers:

The VM has 6 vCPUs and 32GB of RAM
Intel(R) Xeon(R) CPU E5-2660 v3 @ 2.60GHz
MySQL subprocesses are using 100% CPU constantly on 4-6 cores
Load is around 7-9 at the moment, peaking 14-16 sometimes
before adapting the SQL parameters it went >40 (increased innodb_buffer_pool_size from 8 GB to 12, and then to 16, switched off innodb_doublewrite plus a few other things that the installer noted)
Total RAM usage is at 14GB, but also 10GB of swap is noted
This started only when we were trying to upgrade to 1.2.25-27 - up to i think 1.2.24 we never saw this problem

Things that are unrelated (trouble started only after the upgrade with no other change in the constellation) but are worth noting:

The cacti DB is also kept on the same VM, but i think for a test VM that is okay - also in Prod on a more powerful machine we didn't see any unsolvable trouble with the DB on the same host during the past years. We think about a separation already - needs some thorough planning, as we would want to have the DB on the same physical switch/FEX chassis.
There is a small elastic test instance running in parallel on that VM with very little load, also here we're in a process to separate ELK to a new VM

Running the script it constantly shows this rough picture:

Thu Jul 18 17:18:07 CEST 2024
USER       TIME  STATE                INFO
cacti      6     Sending data         SELECT DISTINCT dtr.data_source_name, dif.data_name FROM graph_templates_item AS
cacti      4     Sending data         SELECT DISTINCT dtr.data_source_name, dtr.data_source_name FROM data_template_rr
cacti      3     Sending data         SELECT DISTINCT dtr.data_source_name, dtr.data_source_name FROM data_template_rr
cacti      2     Sending data         SELECT DISTINCT dtr.data_source_name, dtr.data_source_name FROM data_template_rr
cacti      1     Sending data         SELECT DISTINCT data_source_name, rrd_name, rrd_path FROM data_template_rrd AS d

TIME going mostly up tp 3..4 but sometimes up to 8..10 - for both selects shown above Here's some high timer examples, grabbed out of a log i am writing to disk:

cacti      10    Sending data         SELECT DISTINCT dtr.data_source_name, dtr.data_source_name FROM data_template_rr
cacti      11    Sending data         SELECT DISTINCT data_source_name, rrd_name, rrd_path FROM data_template_rrd AS d

I will keep that log running over night, i can attach it here if needed. Anything else you need to know? I can upload some screenshot for example of "htop" as well, if that helps.

bernisys commented 3 months ago

Here's the top list, yeah there's a lot of crap running, but mostly using little to no CPU and only a little amount of precious RAM. The real tops are the mariadb processes.

TheWitness commented 3 months ago

This is quite bad. Basically, your "ELASTIC SEARCH" + "MariaDB" is taking all the ram, and pushing everything way into swap. If that swap is physical disk, you are dead. Even if the vdisk is NVMe/SSD, you are ruining them (pre-mature wear).

In my assessment, you need to get this server to above 64GB, I mean that's the smallest. I would go to 96GB or some increment like that, or remove Elastic Search from the system.

So, once one of those two things happen and maybe even and doubles the cores, reboot. Then, keep your eye on the total memory and make sure you are not swapping. The reason that I shot on the high side (96GB) is due to the Apache workers, and disk cache. Depending on the size of the server and then number of concurrent connections, Apache can take several Gig of memory. And you want the sum of "du -hs /var/www/html/cacti/rra" to fit into disk cache if you can.

I hope this helps. Not much of a Cacti issue, more of a systems issue.

bernisys commented 3 months ago

Hi Larry, yes it is bad .. and we strongly consider moving ELK out of that picture anyway. We just stuffed it on the same test machined, because there wasn't any other server to use. This takes time and money though, so it won't happen too soon, but we work on a new project for an environment extension, and i plan to add a new VM for Test ELK. Biggest problem is: Companies seem to have no money - in my home environment i am much more flexible. Sad capitalistic world.

Se let's get back to the story - sorry, it will be a longer text again, simply because i still have a few doubts i want to clarify with you. Always keep in mind, that this is a constructive talk among techies, trying to improve things, not trying to blame anyone.

1) Quick-fix: RAM and CPU extension is on the way already, but needs time ... sigh ... I considered +16GB (to 48) and +4 vCPUs (to 10) last week, even before you wrote your assessment above, so i see that generally we're on a similar level of thinking, that sounds promising indeed :)

2) I have stopped ELK now, just to see how it behaves without it over the weekend. If it still fails to process, i would tend to not blame it too much on ELK - a bit, yes, but not entirely. Currently there's just one table in progress, i've cleared out all others, let's see if it stays like that.

3) The point with the disk cache is absolutely valid, and maybe the increase of RAM i configured for mysql has somehow influenced the disk cache. So i might have traded one thing for another. I can delete several files, as they are not processed (see 3.1, active device count), which will probably help a bit.

3.1) On the other hand, what needs to be taken into consideration is that while the test VM has roughly the same devices in the DB as the prod environment, ~95% of them are disabled. Which means, it's only processing around 200 devices instead of 4000. So boost is having less stress, but the queries used during polling are facing the same stress level (i thought this was a halfway brilliant idea to show any problems during the poll cycles) This makes our test VM kind of a canary, simply to make it fail before prod fails. And this works quite well as it seems, has protected us for inconveniences in Prod a few times now, we see performance degradations on test quite easily. There are several queries which go across the whole host_snmp_cache or graph_templates_item, which are by far the largest tables, and those are really good indicators for things going wrong.

4) Apache workers are more or less silent, they are the first ones swapped out when memory gets depleted. Because the system is not used too much by others. I can decrease the workers, good point. Would not limit the users in this case. Here's a crazy idea .. would cacti run without Apache running on the main instance?

5) I checked the swap content, and as i can see there are no critical components in it. Most of it are sleeping background shells i use for administration (running in screen sessions), sleeping parts of Elastic, only one part of Mariadb which gets flushed out pretty early and is never touched. But this was also the case when we ran 1.2.23 which - as mentioned - was running quite fine with the same setup otherwise.

Coming to that point, here's now just one simple logical question that might weaken your point that it is not a cacti issue a bit, so please don't be too upset: Why did it work perfectly smooth when we were on 1.2.23 then? Or the other way around - what has so significantly changed in 1.2.25-27 that makes it (or the DB) so slow or ram-hungry in comparison to the earlier version?

As far as i am aware we changed nothing else in the system, just did the upgrade and it stalled. I consider this a strong indicator that something has changed inside cacti which is making it less performant. Probably for a reason, but maybe there is room for improvement. This thought made me look into the boost code first where I detected the let's say suboptimal handling of data-IDs.

Don't get me wrong, I really don't want to sound rude, i wanna help improving the product. (It's a little obsession, trying to improve everything .. i consider it not a bad one though ;) ) As a techie and engineer, I tend to try looking into the facts, and here is a situation that shows a flaw, and i always try to fix flaws so that they don't reappear fall on our feet later. And i think this boost improvement can be implemented with not too much effort.

If boost would not process the same data over and over again in such a situation, this would already be a real benefit. I know, it's a corner case, but if a system is temporarily overloaded for whatever reasons, this can happen everywhere. In such cases it would really be best if boost would skip the already processed data (i.e. remember the last arch table and the last completed data ID for each child, and not reconstructing the table)

Since I have been reading the code already, and I am getting a bit familiar with boost mechanism (again), I can try to fiddle that part into the boost script, but i'm not sure how long this would take - and might be i need a few pointers from your side.

I can start with porting the changes into my cacti fork on my github account and ping you, there are a few debugging prints added and the array_slice part, which i am observing at the moment. I still need to add some DB calls to remember the last processed tables and data-IDs in the settings table, i've seen that several other internal variables are already saved there, so i intend to follow that example.

bernisys commented 3 months ago

Update:

after stopping ELK there is more RAM and the DB is able to keep up again.

But still the question is in the room: Why was it all working fine on 1.2.23 and fails once we go to 1.2.25 and over? Something must have changed inside the SQL queries, so that they cause more stress on the DB.

And please consider the adaptation of the boost process. I recommend it, as it will give us more time to react in a congestion situation. I can help with the development, i can start by pulling in the debugging statements, then you can already merge those, and i can also try to come up with an algorithm that can at least skip processing the already processed data in a boost kill & re-run situation. (Just not this weekend ... age++ event coming up ...)

Cheers & have a nice weekend yourselves @all ! Let's have a chat next week. Do you use Discord by chance?

bernisys commented 3 months ago

Checking the swap infos from all processes, i see that there is still 5GB swapped out from MySQL ... Swap usage just went from ~12Gb down to 8GB, but there was still enough space (swap and RAM) that was available even when ELK was running. And RAM was also free to a certain extent. 14GB used, now down to around 7GB, but the system still had more than 16 gigs to breathe. So what's with all the free RAM? RRD files take around 8GB, so that would still fit into the buffers easily. Didn't check the buffer/cache stats unfortunately, but at the moment they are around 4GB, and there's still 23GB marked as available now. I think avail mem was around 16GB previously, so there's still a gap that i don't quite understand here. Only thing i could imagine is that the DB files in ELK were hogging the FS-cache. But then again, ELK would be dropping in performance, which it did not that much, even when a lot was cached there. Still strange - any idea for my missing piece of information?

TheWitness commented 3 months ago

You can flush that swap this way, may slow the system though. Do it from a separate window, with one window running htop while the other window does this:

swapoff -a
swapon -a

TheWitness commented 3 months ago

You can flush that swap this way, may slow the system though. Do it from a separate window, with on window running htop while the other window does this:

swapoff -a
swapon -a

TheWitness commented 3 months ago

I have it from a few other large customers that boost is working fine on a few 15k host systems with no issues.

There is that one boost bug that I should get some time to work on tonight I hope.

It's important that innodb tables all fit in memory, if they can't things get slow. Check out the mysqltuner. It'll provide a nice report.

TheWitness commented 3 months ago

We use slack. You can join us there if you like.

bernisys commented 2 months ago

Hi Larry, sorry i didn't respond .. bits of stress were hitting me and the garden wants to see some care too ...

Yup, emptying swap was known to me, but I first wanted to see what's actually in it and if the data is actually used over time. I produced some script to check all /proc/ files, collect and summarize the information, showing the top 20 or so hoggers. Tell me if you're interested. (I think I should create a repo for my tools ...) mysql swaps out a fair bit, but never reclaims it, even if enough RAM is available - so it seems to be unused. The swapoff command was taking a lot of time, this stuff is slow as hell (decreasing with about 16MB/s), but of course it shall not stress the system too much while doing the relocation.

A colleague needed the ELK service for development so he restarted with lower mem settings, and instantly the arch tables started to appear again. So I have just now lowered the mysql memory settings a bit (sometimes less is more) and restarted. Will check and see what happens. A few tables have been created over the weekend, i hope they can be processed with more RAM available for the OS.

If mysql uses the OS filesystem cache, then the data is twice in RAM, same for ELK ... once in the FS cache and also in their own cache. That's quite a lot of overhead. I edited the mariadb config to introduce innodb_flush_method=O_DIRECT now, let's see what happens ...

TheWitness commented 2 months ago

You and @xmacan have this fondness for gardens. ELK is memory hungry, you just have to watch it take as much as it can. Just keep stuff out of swap, and more cores and memory is always good.

The other thing about ELK, is when it searches, it'll use a lot of core time and IO. So, it would be best to move it to separate hardware.

This is so evident in the NetFlow development we are doing right now. With a lot of cores, NVMe storage and using the Aria storage engine, MariaDB can access many GB/sec across multiple servers driving all the cores to 100% at the same time. I was shocked.

How is that possible you ask? I wrote a parallel query API for Cacti that leverages MaxScale to distribute the query shards across all the servers. It uses a map reduce algorithm to achieve it's speed. Pretty cool thing for this old geek.

TheWitness commented 2 months ago

Closing this one.

Cacti / cacti

Boosting boost #5786

Feature Request

Is your feature request related to a problem? Please describe

Describe the solution you'd like

Describe alternatives you've considered