Cacti / cacti

Cacti ™
http://www.cacti.net
GNU General Public License v2.0
1.65k stars 406 forks source link

cacti-1.2.19 spine on collector stops polling items on a given device after graph template changes #4815

Closed nicolatron closed 2 years ago

nicolatron commented 2 years ago

@bmfmancini Larry asked me to open a issue after discussion on this forum post https://forums.cacti.net/viewtopic.php?t=62191

I changed graph templates for a number of graphs, after the change spine don't poll items when running on the collector, if I change devices to Main cacti host polling works. When the polling fails (running from the collector) I see this error on cacti logs. 2022-06-09 10:48:04 - CMDPHP ERROR: A DB Exec Failed!, Error: Out of range value for column 'snmp_sysUpTimeInstance' at row 1

To Reproduce

Have devices being handled by a collector (not main host). Make modifications to a graph template, changing for example data source max traffic_in or traffic_out max values. Then for all (or some I guess) of that graphs, change template to be the new one (with modified items). Spine polling for that devices stops working. System is linux SLES 15 SP2 cacti-1.2.19 cacti-spine 1.2.19 mysql Ver 15.1 Distrib 10.4.22-MariaDB RRDtool 1.7.0 NET-SNMP version: 5.7.3 php 7.4.6 I'm using boost. With one main cacti and 2 collectors (almost all of devices run on one of the collectors).

More detail and screenshots on the forum post. Thanks!

TheWitness commented 2 years ago

Please post your SQL_MODE from the server.cnf. This is our requirement. Nothing more, nothing less.

# important for compatibility
sql_mode=NO_ENGINE_SUBSTITUTION,NO_AUTO_CREATE_USER
TheWitness commented 2 years ago

BTW, I strongly encourage you to get to Cacti 1.2.21. There were real quality issues from 1.2.16 through 1.2.20

nicolatron commented 2 years ago

Will migrate to 1.2.21 on monday as soon as I'm back at work. I tried and had to roll back due to problems, but I did another test upgrade, on the collector I took away to serve as test Main, and looks good more or less, looks better than actual main in 1.2.19 anyway.

This is just for the record, as said I'll upgrade to latest stable version (1.2.21) and report if I encounter problems there, will forget about 1.2.19 and it's problems, as any effort trying to nail bugs maybe a waste for everyone, as it may already been fixed in 1.2.21. I waste my time as I see fit, but wasting your time is a shame.

My main production server on 1.2.19 is now reporting Poller boost cache not empty errors, I also saw log entries in mysql log pointing to insuffient open files and maximum packet size, I'd swear that resources were there before, maybe the system upgrade I did recently overwrite some config files.

Reolved with: /etc/systemd/system/mysql.service

# Increase NOFILE limit to a value big enough for mariadb.
# The default (1024) is too small and will cause a warning and crashes.
#LimitNOFILE=32184
LimitNOFILE=100000
LimitMEMLOCK=100000

/etc/my.cnf

max_allowed_packet=300M
open_files_limit=100000

I did some more tests. Review Sean's youtube video about installing remote pollers, and deleted everything problematic remote poller to start it over as a new remote and see if it retains the problems to actuallyu poll devices. Did it all, but on the last part of "Cacti Server v1.2.19 - Installation Wizard" it seems to hang forever. On main it seems collector is ok, and I can even transfer devices, launching the polling from spine results in no polling again. But I can't see if the "Out of range value" error is there, on the log all I see is:

06/11/2022 08:50:01 - POLLER: Poller[7] PID[3438] Poller is upgrading from pre Cacti 1.0, exiting till upgraded. And if try to enter the gui the installation wizards kicks in.

As for my weathermap, it seems TARGET directive is not working for traffic based rrds ( (which uses boost as i use "SET rrd_use_poller_output 1" directive on my weathermap's .conf), I can't see the colored lines or current traffic, the overlib graph part looks ok. This is the only problem that has been translated to my 1.2.21 test intallation, one of my fears was that weathermap plugin might not work on 1.2.21 at all, so I'll call it a day :).

It seems rude to close a issue so fast, but in some days I'll close it so it don't distracts as an open issue.

Thanks a lot!

TheWitness commented 2 years ago

Weathermap may need some love. I don't use it and Howie moved on. I think Thomas has a good fork.

netniV commented 2 years ago

Yes, @thurban did take over supporting weathermaps under 1.2.x since his clients were using it too I think. Alas, Howie has moved on to other pastures now so his focus is elsewhere, but he will always be appreciated.

nicolatron commented 2 years ago

Got an interesting error when preparing things for 1.2.19 to 1.2.21 upgrade. mysqldump: Error 2020: Got packet bigger than 'max_allowed_packet' when dumping table poller_resource_cache at row: 4434 My max_allowed_packet was set to 500M but I didn't have a [mysqldump] section on my.cnf so maybe that was not applying. Anyway looking at row 4434 in poller_resource_cache it was referring to a weird file: plugins/weathermap/configs/.~lock.wan.xlsx# That's a lock file of an excel I use to generate weathermaps. I made a python script that uses excel data to generate weathermap's .conf files, I keep both the excel and the script in weathermap/config directory. That diirectory is also shared via sshfs so I can mount it from my desktop and modify the excel from there (opening it with libreoffice) but if the files in that directory end up somehow in the database (something I was not aware of), maybe I should keep that config dir more clean, and separate excel and scripts in a different directory, this might explain some of the apparently uncommon nuances I've been having with both cacti and weathermap. Just noticed I also have a cacti database backup stored on weathermap/config directory and it also has an entry on resource_poller_cache. Just check and after deleting the "plugins/weathermap/configs/.~lock.wan.xlsx#" my poller_resource_cache is 269.86MB in size. Will clean some more, and make sure the weathermap/config directory only contains it's .conf files.

Edit: After cleanup and scheduling a "Rebuild Resource Cache", size of pollwe_resource_cache table went down to 113.86MB.

netniV commented 2 years ago

Cacti will attempt to sync any folders within a plugin's directory unless they are marked in the INFO file for exclusion. For example, the following is from reportit:

nosync = exports, archive, cache, tmp, *.html, *.conf

Since most plugins store their configuration within the database, or at least used to because there was no need for localised versions, then the default is to take all files without the above directive.

nicolatron commented 2 years ago

INFO file from @thurban weathermap repository (both master and develop branches) has this:

[info]
[...]
nosync = true
[...]
thurban commented 2 years ago

Which means that only a file called "true" is not synced (or stored into the DB )

Needs to be changed, as I don't think it makes sense to actually sync any files of weathermap to remote pollers, due to the RRD files being mainly on the main server.

TheWitness commented 2 years ago

Don't create tgz's in the Cacti base directory, create them elsewhere.

nicolatron commented 2 years ago
sql_mode=NO_ENGINE_SUBSTITUTION,NO_AUTO_CREATE_USER

Oh my. Forgot about this, sorry. I had this: sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES Now I have your suggestion.

Entropy keeps following my steps, or one step ahead.

I did a test yesterday changing that "nosync = true" to "nosync = *" and my poller_resource_cache table got emptied. I changed it back to "nosync = true" and hit the "Rebuild resource cache" link, but table remains empty, a couple hours and polling cycles have passed by. I'm not sure it wasn't empty before as I didn't check it, and my system is all but stable. Things seems to be mostly working anyway (without data on that table).

Poller cache looks ok, and I see data there for all the offending graphs/data sources that I've checked.

I had first disabled and then deleted the other collector I had, as it's presence impact performance negatively (even disabled), and as of now I don't have a design topology or volume of rrd that justify the need of collectors. This performance problem came from 1.2.19 and didn't go away.

I've disabled all my weathermaps but one as to not pollute log file, as most are far from warning-free.

As of now I'm getting this errors in log, I've being doing all kind of changes to graph template, associate data sources and snmp query to no avail. Inlcuding adding all kind of name and title suggestions on Data Collection->Data Queries->SNMP -Interface Statistics -> Associated Graph Templates [0 Interface - Traffic (bits/sec) ] (Graphs using 1043).

2022-06-14 06:30:02 - POLLER: Poller[1] PID[25718] WARNING: Poller Output Table not Empty.  Issues: 584, DS[768, 754, 754, 753, 753, 752, 752, 751, 751, 750, 750, 749, 749, 748, 748, 747, 747, 746, 746, 745], Additional Issues Remain.  Only showing first 20
2022-06-14 06:30:02 - SYSTEM WARNING: Primary Admin account notifications disabled!  Unable to send administrative Email.
2022-06-14 06:30:30 - POLLER: Poller[1] PID[25718] WARNING: There are 293 Data Sources not returning all data leaving rows in the poller output table.  Details to follow.
2022-06-14 06:30:30 - POLLER: Poller[1] PID[25718] WARNING: Data Template '0 Interface - Traffic' is impacted by lack of complete information
2022-06-14 06:30:30 - SYSTEM STATS: Time:27.9067 Method:spine Processes:1 Threads:5 Hosts:230 HostsPerProcess:230 DataSources:3860 RRDsProcessed:0
2022-06-14 06:30:30 - WEATHERMAP Weathermap 0.98a starting - Normal logging mode. Turn on DEBUG in Cacti for more information
2022-06-14 06:30:30 - WEATHERMAP [Map 9] core.conf: Map: /opt/cacti/plugins/weathermap/configs/core.conf -> /opt/cacti/plugins/weathermap/output/7564965613fbee89c14e.html & /opt/cacti/plugins/weathermap/output/7564965613fbee89c14e.png
2022-06-14 06:30:30 - WEATHERMAP [Map 9] core.conf: About to write image file. If this is the last message in your log, increase memory_limit in php.ini [WMPOLL01]
2022-06-14 06:30:30 - SYSTEM DSSTATS STATS: Time:0.19 Type:HOURLY
2022-06-14 06:30:30 - WEATHERMAP [Map 9] core.conf: Wrote map to /opt/cacti/plugins/weathermap/output/7564965613fbee89c14e.png and /opt/cacti/plugins/weathermap/output/7564965613fbee89c14e.thumb.png
2022-06-14 06:30:30 - WEATHERMAP STATS: Weathermap 0.98a run complete - Tue, 14 Jun 22 06:30:30 +0200: 1 maps were run in 0 seconds with 0 warnings.

The number of graph/DS with issues is not stable sometimes (in last days log) is 584, 56, 528, 640...yesterday at some point it was 2052, I think that was almos all the (1043) graphs using my "Interface - traffic (bits/sec)" template.

It allways comes down to my dreaded graph template, so I include it with dependencies, never ever use it in production if you value your system, it has the devil's seed on it.

cacti_graph_template_0interface-_traffic_bitssec.xml.txt

TheWitness commented 2 years ago

What changes did you make to the Graph Template?

nicolatron commented 2 years ago

In the graph template itself, not much.

Removed the Items I didn't want, as traffic % or sum all totals. And left only a simple items, area, gprint last, average max, for both input and output data source, on output applied CDEF to make it negative. I think I also change data source names. And a comment "COMMENT: Description: |query_ifAlias|" but that was there already.

Then I get the errors about rrd having a different output.max.

So I went to Templates->Data Source, and changed "Interface - Traffic" to force Maximum value of 100000000 both for traffic_in and traffic_out (just checked and that is not the case anymore), now it's allowed for user to override de value on graph creation (maybe that changed during the upgrade process).

At some point also changed the checkbox to allow users to change Index Type, Index Value and Output Type ID on and off.

On data collection "SNMP - Interface Statistics" did some change too. Going there and then to Associated Graph Templates "0 Interface - Traffic (bits/sec) 64-bit counters"

I don't think i changed datasources there. It points to imagen

That data sources with that particular name, I think I imported from the forums, it's a very nice collection of interface counters, used them to build a graph for error percetages on interfaces (sum unicast+broadcast+multicast packets and divide by the sum of error+discard).

On interface.xml it looks like this:

<ifHCInOctets>
            <name>Bytes In - 64-bit Counters</name>
            <method>walk</method>
            <source>value</source>
            <direction>output</direction>
            <oid>.1.3.6.1.2.1.31.1.1.1.6</oid>
        </ifHCInOctets>
        <ifHCOutOctets>
            <name>Bytes Out - 64-bit Counters</name>
            <method>walk</method>
            <source>value</source>
            <direction>output</direction>
            <oid>.1.3.6.1.2.1.31.1.1.1.10</oid>
        </ifHCOutOctets>

Also set a large number of suggestions for both title and name, in fear there was no option available.

imagen

That's what I can remember, but i well may be forgoting something, sorry if I can't be precise :-(

TheWitness commented 2 years ago

Here was the possible fatal flaw:

At some point also changed the checkbox to allow users to change Index Type, Index Value and Output Type ID on and off.

In Cacti 1.2.21 you can no longer do that. For the broken data sources, if you edit them, are those fields still populated?

We incorporated a repair script into the upgrade process from 1.2.(don't remember) to 1.2.20. You might want to force rerun that upgrade. If not at 1.2.21, get there.

nicolatron commented 2 years ago

It is still enabled if you mean this...

imagen

I'm at 1.2.21 now, how can I rerun the upgrade?.

TheWitness commented 2 years ago

Go to the broken Data Sources and show one of them instead.

TheWitness commented 2 years ago

This might be related to another issue. Let me know for one of the byorked Data Souces how many Polling Items that device has.

nicolatron commented 2 years ago

Hope you mean this. imagen imagen imagen

nicolatron commented 2 years ago

That device has 82 Graphs and 85 Data sources. That device has 146 entries in Poller cache items. (View Poller Cache) That device has 1118 entries in Data Query Cache items. (View Data Query Cache)

TheWitness commented 2 years ago

That's weird. I may just be getting senile. Show the Graph page too.

TheWitness commented 2 years ago

The one Data Source you showed is definitely broken. See a "working" Interface Data Source below.

image

Notice the "Custom Data" Section contains those three very important elements that are required for proper functioning of Cacti. How recover at this point would require you dumping your database and sending to developers at cacti dot net.

nicolatron commented 2 years ago

Template Graph

imagen imagen

And one graph from that device with offending data source.

imagen

RRDtool Command:

/usr/bin/rrdtool graph - \
--imgformat=PNG \
--start='-86400' \
--end='-300' \
--pango-markup  \
--title='URANO - Traffic - loopback.1' \
--vertical-label='bits per second' \
--slope-mode \
--base=1000 \
--height=120 \
--width=500 \
--alt-autoscale \
COMMENT:"From 2022-06-13 15\:47\:55 To 2022-06-14 15\:42\:55\c" \
COMMENT:"  \n" \
--color BACK#F3F3F3 \
--color CANVAS#FDFDFD \
--color SHADEA#CBCBCB \
--color SHADEB#999999 \
--color FONT#000000 \
--color AXIS#2C4D43 \
--color ARROW#2C4D43 \
--color FRAME#2C4D43 \
--border $rrdborder --font TITLE:11:'Arial' \
--font AXIS:8:'Arial' \
--font LEGEND:8:'Courier' \
--font UNIT:8:'Arial' \
--font WATERMARK:6:'Arial' \
--slope-mode \
--watermark 'Generated by Cacti®' \
DEF:a='/opt/cacti/rra/urano_traffic_in_1427.rrd':'traffic_in':AVERAGE \
DEF:b='/opt/cacti/rra/urano_traffic_in_1427.rrd':'traffic_out':AVERAGE \
CDEF:cdefb='a,8,*' \
CDEF:cdeff='b,-8,*' \
CDEF:cdefg='b,8,*' \
COMMENT:'Description\: |query_ifAlias|\n'  \
AREA:cdefb#00FF00FF:'I'  \
GPRINT:cdefb:LAST:'Cur\:%8.2lf %s'  \
GPRINT:cdefb:AVERAGE:'Avg\:%8.2lf %s'  \
GPRINT:cdefb:MAX:'Max\:%8.2lf %s\n'  \
AREA:cdeff#0000FFFF:'O'  \
GPRINT:cdefg:LAST:'Cur\:%8.2lf %s'  \
GPRINT:cdefg:AVERAGE:'Avg\:%8.2lf %s'  \
GPRINT:cdefg:MAX:'Max\:%8.2lf %s\n' 

RRDtool Command lengths = 1220 charaters.

RRDtool Says:

OK
TheWitness commented 2 years ago

Check out my note above. Since it benefits all of the community, I'll do a freebie and review your dump.

nicolatron commented 2 years ago

Thanks Larry, sent you a mail about that.

Some more information...

When I did the upgrade from 1.2.19 it was hanged at the end of the upgrade process. No apparent error on cacti.log, so after waiting for some minutes (not many) I just restarted apache (clever me), and then cacti was (apparently) ready to go.

Probably that clever move skipped some of the upgrade scripts and led to the subsequent problems.

Also when I did the upgrade my database was not configured as suggested. I had sql_mode=NO_ENGINE_SUBSTITUTION,STRICT_TRANS_TABLES instead of the recommendation: sql_mode=NO_ENGINE_SUBSTITUTION,NO_AUTO_CREATE_USER And I neither followed the other advices contained on cacti's github homepage (convert engine's to innodb and that). I guess any of that could have caused the actual problems.

TheWitness commented 2 years ago

We are good. Hit the reset button on 1.2.21. So far there is only one medium severity issue there and @netniV fixed it already.