Closed danielnavcom closed 2 years ago
If you feel that you need to re-import the templates, these can be done at any time and the packages are included with Cacti which can be downloaded to the local machine before importing to cacti.
Upgraded from git to latest: Version 1.3.0 - Dev 2298283f7 @ 2022-06-21 08:25
Will check and see if problem still shows.
Now, log is spammed with these messages:
21.06.2022 14:25:26 - CMDPHP = (cacti_version_compare(1.3.0.99.1655789148.2298283f7, 1.3.0.99.1593739325, <))
Also, I cannot see Tholds after the upgrade to this version, plugin is installed, latest release and active, but where are my tresholds?
Check to see if thold is disabled. How is your cacti.log?
Don't use 1.3.x/develop. It's not entirely stable right now.
I downgraded to 1.2.22, but seems it breaks the tresholds/CDEFs... I have the tresholds now reading in megabytes, not megabits, its a mess here..
Restored from backup, and re-run the installer of 1.2.22 and imported templates.. not sure if it changes anything, I still get in logs:
21.06.2022 15:59:06 - POLLER: Poller[Main Poller] PID[3292] WARNING: Data Template 'Interface - Unicast Packets' is impacted by lack of complete information 21.06.2022 15:59:06 - POLLER: Poller[Main Poller] PID[3292] WARNING: There are 1 Data Sources not returning all data leaving rows in the poller output table. Details to follow.
I will watch the graphs..
You have to be more specific. Need log entries from the Cacti log. If thold is getting consistently disabled, there is a thold patch for that in the develop branch of thold.
Importing packages or templates?
Filter on "SYSTEM STATS:" in your Cacti log and screen capture an image.
I specified in the first post how we run.
spine
During the installation I checked the archived files to import
We log detailed everything.
Increase your device timeouts, what is your Max OID's set to?
With that many devices, you only really need 1 process and say 10 threads too.
Problem still shows, no errors in logs, all green, still have such interruptions.
I have now set 1 process and 12 threads.
Maximum OIDs Per Get Request - 5
Bulk Walk Maximum Repetitions - auto detect on re-index
SNMP Timeout - 5000
All devices are added with snmp v3.
Just out of curiosity, is SELinux enabled or enforcing?
SELinux status: disabled
Okay, it looks like the poller interval is 1 minute. Verify that with the settings in Console > Configuration > Settings > Poller
. Then do the following two things.
rrdtool info somebrokenrrdfile.rrd
Gather the data source id off that RRDfile and then:
SELECT * FROM poller_item WHERE local_data_id = ?\G
// Replacing ? with the Data Source. Post.
Here is the image of a graph with interruption and its details: https://imgur.com/a/tLjT1Vc
Here is rrdinfo https://pastebin.com/Qkc5hhze
Here is sql: https://pastebin.com/PRT0hGYK
Okay, so two things:
1) Update lib/poller.php from the 1.2.x branch. 2) Check your standard error log for Segmentation faults. We've been dealing with some segmentation faults associated with spine and snmpv3. Let me know if you find them.
Updated lib/poller.php
There are absolutely 0 segfaults or any other error concerning sql, apache/nginx, cacti logs...
Good on the segfault side. It could be Net-SNMP API related, though it's been around for a long time, or agent related. Anyway, that specific data source should not have gapped ever unless there was a timeout. So, by default we are no longer issuing warnings for timeouts unless you specifically enable that.
One way to work around periodic timeouts, would be to increase the heartbeat on your data source profile to something like 240, and then write a script to update all your RRDfiles.
How can I change the heartbeat on data source profile which is currently set 2 mins in production.
Problem still exists right now.
This is something simple enough:
cd /var/www/html/cacti/rra
for file in `find . -name \*.rrd`;do
rrdtool tune $file --heartbeat=240
done
It's not that complicated.
Oh, wait, there is more to it ;)
cd /var/www/html/cacti/rra
for file in `find . -name \*.rrd`;do
data_sources=`rrdtool info $file | grep "ds\[" | sed -e 's/ds\[//g' | awk -F "]" '{print $1}' | uniq`
for ds in $data_sources;do
rrdtool tune $file --heartbeat $ds:240
done
done
Pretty sure that'll do it.
Set it, any clue how to set it to the Presets -> Data profile aswell? So when we add new graphs to be the default?
Yes
How?
I thought this column was read write from the GUI, but I guess not. Easier to do with a database update till we can update the GUI.
Done it from the database, I believe that there are no more interruptions right now, will still check these days. But why would a lower heartbeat cause this, it was fine till these versions.
It increases the time permitted between samples that is allowed before RRDtool adds the gap. So basically, if your device times out once in a while, your graphs will remain smooth.
Closing this now.
Constant broken graphs - 1 minute polling
We have only cisco devices added, not many, and create graphs for interface traffic, unicast, errors, non-unicast. The server has SSDs, Intel Xeon E5 CPUs and 128GB of RAM 1 spine collector - 1 process 32 threads. We did changed processes, threads, nothing seem to fix it, its not from the config, I guess.
There is nothing wrong in the logs(we log detailed) in that timeframe (when the white interruptions shows in graphs). - see added images https://imgur.com/a/LS3Oxg4
We do see something weird in the logs from time to time (rarely) - see added images https://imgur.com/a/C2mQqfZ ^ if we create errors/discards interface graphs - the above messages appear more often complaining about them.
We even removed graphs and data sources, re-indexed, re-created them, problem still shows. We use snmp v3 on all devices.
We are currently running cacti and spine 1.2.21
Booster enabled and seem work fine.
We have no idea what changed in 1.2.x that causes this issue.
Another issue is that we cannot even create graphs with non-unicast packets - all fail in logs seems that it is using invalid OIDs. - see image https://imgur.com/a/07FA6JF
I believe this is due to templates not updating during update, or something like that - I saw some posts around about it.
Do you guys have any idea what could cause these issues and how to fix them?
Thanks in advance.