Cacti / cacti

Cacti ™
http://www.cacti.net
GNU General Public License v2.0
1.64k stars 405 forks source link

Graphs broken since 1.2.x #4838

Closed danielnavcom closed 2 years ago

danielnavcom commented 2 years ago

Constant broken graphs - 1 minute polling

We have only cisco devices added, not many, and create graphs for interface traffic, unicast, errors, non-unicast. The server has SSDs, Intel Xeon E5 CPUs and 128GB of RAM 1 spine collector - 1 process 32 threads. We did changed processes, threads, nothing seem to fix it, its not from the config, I guess.

There is nothing wrong in the logs(we log detailed) in that timeframe (when the white interruptions shows in graphs). - see added images https://imgur.com/a/LS3Oxg4

We do see something weird in the logs from time to time (rarely) - see added images https://imgur.com/a/C2mQqfZ ^ if we create errors/discards interface graphs - the above messages appear more often complaining about them.

We even removed graphs and data sources, re-indexed, re-created them, problem still shows. We use snmp v3 on all devices.

We are currently running cacti and spine 1.2.21

Booster enabled and seem work fine.

We have no idea what changed in 1.2.x that causes this issue.

Another issue is that we cannot even create graphs with non-unicast packets - all fail in logs seems that it is using invalid OIDs. - see image https://imgur.com/a/07FA6JF

I believe this is due to templates not updating during update, or something like that - I saw some posts around about it.

Do you guys have any idea what could cause these issues and how to fix them?

Thanks in advance.

netniV commented 2 years ago

If you feel that you need to re-import the templates, these can be done at any time and the packages are included with Cacti which can be downloaded to the local machine before importing to cacti.

danielnavcom commented 2 years ago

Upgraded from git to latest: Version 1.3.0 - Dev 2298283f7 @ 2022-06-21 08:25

Will check and see if problem still shows.

Now, log is spammed with these messages:

21.06.2022 14:25:26 - CMDPHP = (cacti_version_compare(1.3.0.99.1655789148.2298283f7, 1.3.0.99.1593739325, <))

danielnavcom commented 2 years ago

Also, I cannot see Tholds after the upgrade to this version, plugin is installed, latest release and active, but where are my tresholds?

TheWitness commented 2 years ago

Check to see if thold is disabled. How is your cacti.log?

TheWitness commented 2 years ago

Don't use 1.3.x/develop. It's not entirely stable right now.

danielnavcom commented 2 years ago

I downgraded to 1.2.22, but seems it breaks the tresholds/CDEFs... I have the tresholds now reading in megabytes, not megabits, its a mess here..

danielnavcom commented 2 years ago

Restored from backup, and re-run the installer of 1.2.22 and imported templates.. not sure if it changes anything, I still get in logs:

21.06.2022 15:59:06 - POLLER: Poller[Main Poller] PID[3292] WARNING: Data Template 'Interface - Unicast Packets' is impacted by lack of complete information 21.06.2022 15:59:06 - POLLER: Poller[Main Poller] PID[3292] WARNING: There are 1 Data Sources not returning all data leaving rows in the poller output table. Details to follow.

I will watch the graphs..

TheWitness commented 2 years ago

You have to be more specific. Need log entries from the Cacti log. If thold is getting consistently disabled, there is a thold patch for that in the develop branch of thold.

TheWitness commented 2 years ago

Importing packages or templates?

TheWitness commented 2 years ago

Filter on "SYSTEM STATS:" in your Cacti log and screen capture an image.

danielnavcom commented 2 years ago

I specified in the first post how we run.

spine

During the installation I checked the archived files to import

https://imgur.com/a/gvlLqTm

We log detailed everything.

TheWitness commented 2 years ago

Increase your device timeouts, what is your Max OID's set to?

TheWitness commented 2 years ago

With that many devices, you only really need 1 process and say 10 threads too.

danielnavcom commented 2 years ago

Problem still shows, no errors in logs, all green, still have such interruptions.

I have now set 1 process and 12 threads.

Maximum OIDs Per Get Request - 5

Bulk Walk Maximum Repetitions - auto detect on re-index

SNMP Timeout - 5000

All devices are added with snmp v3.

TheWitness commented 2 years ago

Just out of curiosity, is SELinux enabled or enforcing?

danielnavcom commented 2 years ago

SELinux status: disabled

TheWitness commented 2 years ago

Okay, it looks like the poller interval is 1 minute. Verify that with the settings in Console > Configuration > Settings > Poller. Then do the following two things.

rrdtool info somebrokenrrdfile.rrd

Gather the data source id off that RRDfile and then:

SELECT * FROM poller_item WHERE local_data_id = ?\G // Replacing ? with the Data Source. Post.

danielnavcom commented 2 years ago

Here is the image of a graph with interruption and its details: https://imgur.com/a/tLjT1Vc

Here is rrdinfo https://pastebin.com/Qkc5hhze

Here is sql: https://pastebin.com/PRT0hGYK

TheWitness commented 2 years ago

Okay, so two things:

1) Update lib/poller.php from the 1.2.x branch. 2) Check your standard error log for Segmentation faults. We've been dealing with some segmentation faults associated with spine and snmpv3. Let me know if you find them.

danielnavcom commented 2 years ago

Updated lib/poller.php

There are absolutely 0 segfaults or any other error concerning sql, apache/nginx, cacti logs...

TheWitness commented 2 years ago

Good on the segfault side. It could be Net-SNMP API related, though it's been around for a long time, or agent related. Anyway, that specific data source should not have gapped ever unless there was a timeout. So, by default we are no longer issuing warnings for timeouts unless you specifically enable that.

One way to work around periodic timeouts, would be to increase the heartbeat on your data source profile to something like 240, and then write a script to update all your RRDfiles.

danielnavcom commented 2 years ago

How can I change the heartbeat on data source profile which is currently set 2 mins in production.

Problem still exists right now.

TheWitness commented 2 years ago

This is something simple enough:

cd /var/www/html/cacti/rra
for file in `find . -name \*.rrd`;do
   rrdtool tune $file --heartbeat=240
done

It's not that complicated.

TheWitness commented 2 years ago

Oh, wait, there is more to it ;)

TheWitness commented 2 years ago
cd /var/www/html/cacti/rra
for file in `find . -name \*.rrd`;do
   data_sources=`rrdtool info $file | grep "ds\[" | sed -e 's/ds\[//g' | awk -F "]" '{print $1}' | uniq`
   for ds in $data_sources;do
      rrdtool tune $file --heartbeat $ds:240
   done
done
TheWitness commented 2 years ago

Pretty sure that'll do it.

danielnavcom commented 2 years ago

Set it, any clue how to set it to the Presets -> Data profile aswell? So when we add new graphs to be the default?

TheWitness commented 2 years ago

Yes

danielnavcom commented 2 years ago

How?

TheWitness commented 2 years ago

I thought this column was read write from the GUI, but I guess not. Easier to do with a database update till we can update the GUI.

danielnavcom commented 2 years ago

Done it from the database, I believe that there are no more interruptions right now, will still check these days. But why would a lower heartbeat cause this, it was fine till these versions.

TheWitness commented 2 years ago

It increases the time permitted between samples that is allowed before RRDtool adds the gap. So basically, if your device times out once in a while, your graphs will remain smooth.

TheWitness commented 2 years ago

Closing this now.