Cacti / cacti

Cacti ™
http://www.cacti.net
GNU General Public License v2.0
1.61k stars 399 forks source link

Recache Event Loop can cause Interface Graphs to show gaps #3693

Closed DatBoiTroy closed 3 years ago

DatBoiTroy commented 4 years ago

Describe the bug

We have a handful of devices whose interface graphs are gapping for one, 5-min polling cycle every 2 hours. What we have been able to determine is that the gapping starts for a device after it physically goes down then restores, so it seems to be tied to the uptime. Every two hours, all of the affected devices log that a Recache Assert Failed on all of their data queries then a NaN is written to the RRD files, even though debugging shows actual values are returned by the device.

This issue is affecting interface graphs 95% of the time, but gaps show up in other graphs (uptime, latency, etc.) every once in a great while when recache events happen back-to-back. There are roughly 15 devices doing this at the moment in our production instance and they all show Recache Assert Failed events in the remote poller logs at the same time. I have added two of these devices to our lab setup and there is no gapping or recache events every 2 hours until after a device goes physically down then restores.

This problem started after we upgraded to the one of the develop commits of Cacti 1.2.12 from 1.2.9. There are no errors logged anywhere that I have looked when the gapping occurs.

What seems to be happening is that Cacti is using an uptime assert value that is multiple times larger than the uptime reported in the last cycle when these recache events occur. After that, the assert value is correct until 2 hours later. Here are some of the logs showing the recache events in our lab:

2020/07/13 22:40:02 - SPINE: Poller[3] PID[22976] Device[8] HT[2] DQ[4] RECACHE ASSERT FAILED: '47934056<24174325'
2020/07/13 22:40:02 - SPINE: Poller[3] PID[22976] Device[8] HT[2] DQ[10] RECACHE ASSERT FAILED: '47934098<24174325'
2020/07/13 22:40:02 - SPINE: Poller[3] PID[22976] Device[8] HT[2] DQ[11] RECACHE ASSERT FAILED: '47934139<24174325'
2020/07/13 22:40:02 - SPINE: Poller[3] PID[22976] Device[8] HT[1] DQ[4] RECACHE ASSERT FAILED: '47934056<24174331'
2020/07/13 22:40:02 - SPINE: Poller[3] PID[22976] Device[8] HT[1] DQ[10] RECACHE ASSERT FAILED: '47934098<24174331'
2020/07/13 22:40:02 - SPINE: Poller[3] PID[22976] Device[8] HT[1] DQ[11] RECACHE ASSERT FAILED: '47934139<24174331'
2020/07/14 00:40:02 - SPINE: Poller[3] PID[26782] Device[8] HT[2] DQ[4] RECACHE ASSERT FAILED: '47934056<24894334'
2020/07/14 00:40:02 - SPINE: Poller[3] PID[26782] Device[8] HT[2] DQ[10] RECACHE ASSERT FAILED: '47934098<24894334'
2020/07/14 00:40:02 - SPINE: Poller[3] PID[26782] Device[8] HT[2] DQ[11] RECACHE ASSERT FAILED: '47934139<24894334'
2020/07/14 00:40:02 - SPINE: Poller[3] PID[26782] Device[8] HT[1] DQ[4] RECACHE ASSERT FAILED: '47934056<24894342'
2020/07/14 00:40:02 - SPINE: Poller[3] PID[26782] Device[8] HT[1] DQ[10] RECACHE ASSERT FAILED: '47934098<24894342'
2020/07/14 00:40:02 - SPINE: Poller[3] PID[26782] Device[8] HT[1] DQ[11] RECACHE ASSERT FAILED: '47934139<24894342'
2020/07/14 02:45:02 - SPINE: Poller[3] PID[30728] Device[8] HT[2] DQ[4] RECACHE ASSERT FAILED: '47934056<25644304'
2020/07/14 02:45:02 - SPINE: Poller[3] PID[30728] Device[8] HT[2] DQ[10] RECACHE ASSERT FAILED: '47934098<25644304'
2020/07/14 02:45:02 - SPINE: Poller[3] PID[30728] Device[8] HT[2] DQ[11] RECACHE ASSERT FAILED: '47934139<25644304'
2020/07/14 02:45:02 - SPINE: Poller[3] PID[30728] Device[8] HT[1] DQ[4] RECACHE ASSERT FAILED: '47934056<25644311'
2020/07/14 02:45:02 - SPINE: Poller[3] PID[30728] Device[8] HT[1] DQ[10] RECACHE ASSERT FAILED: '47934098<25644311'
2020/07/14 02:45:02 - SPINE: Poller[3] PID[30728] Device[8] HT[1] DQ[11] RECACHE ASSERT FAILED: '47934139<25644311'
2020/07/14 04:45:02 - SPINE: Poller[3] PID[34573] Device[8] HT[2] DQ[4] RECACHE ASSERT FAILED: '47934056<26364310'
2020/07/14 04:45:02 - SPINE: Poller[3] PID[34573] Device[8] HT[2] DQ[10] RECACHE ASSERT FAILED: '47934098<26364310'
2020/07/14 04:45:02 - SPINE: Poller[3] PID[34573] Device[8] HT[2] DQ[11] RECACHE ASSERT FAILED: '47934139<26364310'
2020/07/14 04:45:02 - SPINE: Poller[3] PID[34573] Device[8] HT[1] DQ[4] RECACHE ASSERT FAILED: '47934056<26364316'
2020/07/14 04:45:02 - SPINE: Poller[3] PID[34573] Device[8] HT[1] DQ[10] RECACHE ASSERT FAILED: '47934098<26364316'
2020/07/14 04:45:02 - SPINE: Poller[3] PID[34573] Device[8] HT[1] DQ[11] RECACHE ASSERT FAILED: '47934139<26364316'
2020/07/14 06:50:03 - SPINE: Poller[3] PID[38712] Device[8] HT[2] DQ[4] RECACHE ASSERT FAILED: '47934056<27114382'
2020/07/14 06:50:03 - SPINE: Poller[3] PID[38712] Device[8] HT[2] DQ[10] RECACHE ASSERT FAILED: '47934098<27114382'
2020/07/14 06:50:03 - SPINE: Poller[3] PID[38712] Device[8] HT[2] DQ[11] RECACHE ASSERT FAILED: '47934139<27114382'
2020/07/14 06:50:03 - SPINE: Poller[3] PID[38712] Device[8] HT[1] DQ[4] RECACHE ASSERT FAILED: '47934056<27114388'
2020/07/14 06:50:03 - SPINE: Poller[3] PID[38712] Device[8] HT[1] DQ[10] RECACHE ASSERT FAILED: '47934098<27114388'
2020/07/14 06:50:03 - SPINE: Poller[3] PID[38712] Device[8] HT[1] DQ[11] RECACHE ASSERT FAILED: '47934139<27114388'
2020/07/14 08:55:02 - SPINE: Poller[3] PID[42637] Device[8] HT[2] DQ[4] RECACHE ASSERT FAILED: '47934056<27864336'
2020/07/14 08:55:02 - SPINE: Poller[3] PID[42637] Device[8] HT[2] DQ[10] RECACHE ASSERT FAILED: '47934098<27864336'
2020/07/14 08:55:02 - SPINE: Poller[3] PID[42637] Device[8] HT[2] DQ[11] RECACHE ASSERT FAILED: '47934139<27864336'
2020/07/14 08:55:02 - SPINE: Poller[3] PID[42637] Device[8] HT[1] DQ[4] RECACHE ASSERT FAILED: '47934056<27864344'
2020/07/14 08:55:02 - SPINE: Poller[3] PID[42637] Device[8] HT[1] DQ[10] RECACHE ASSERT FAILED: '47934098<27864344'
2020/07/14 08:55:02 - SPINE: Poller[3] PID[42637] Device[8] HT[1] DQ[11] RECACHE ASSERT FAILED: '47934139<27864344'
2020/07/14 10:55:02 - SPINE: Poller[3] PID[46503] Device[8] HT[2] DQ[4] RECACHE ASSERT FAILED: '47934056<28584300'
2020/07/14 10:55:02 - SPINE: Poller[3] PID[46503] Device[8] HT[2] DQ[10] RECACHE ASSERT FAILED: '47934098<28584300'
2020/07/14 10:55:02 - SPINE: Poller[3] PID[46503] Device[8] HT[2] DQ[11] RECACHE ASSERT FAILED: '47934139<28584300'
2020/07/14 10:55:02 - SPINE: Poller[3] PID[46503] Device[8] HT[1] DQ[4] RECACHE ASSERT FAILED: '47934056<28584306'
2020/07/14 10:55:02 - SPINE: Poller[3] PID[46503] Device[8] HT[1] DQ[10] RECACHE ASSERT FAILED: '47934098<28584306'
2020/07/14 10:55:02 - SPINE: Poller[3] PID[46503] Device[8] HT[1] DQ[11] RECACHE ASSERT FAILED: '47934139<28584306'

To Reproduce

Steps to reproduce the behaviour:

  1. Add a test device to Cacti, enable debug, then create a few interface statistics graphs.

  2. Physically power off the device for a short time then turn it back on.

  3. Let Cacti run for a few hours after the device has restored.

  4. If gapping starts, check the remote poller's log. There will likely be a Recache Assert Failed event at the times of the gaps.

Expected behavior

These devices use uptime for the re-index method for data queries. They should not recache every two hours and the recache should not cause a total loss of data for that polling cycle.

Screenshots

example-gapping-int-graphs

Server Information

Cacti version: 1.2.12 (Develop Commit 5181b79) Spine version: 1.2.12 Server OS: Ubuntu 18.04 LTS Apache version: 2.4.29 PHP version: 7.2.24-0ubuntu0.18.04.6 MariaDB version: 10.4.13-MariaDB-1:10.4.13+maria~bionic Number of Remote Pollers: 5

DatBoiTroy commented 4 years ago

For now, I may have found a temporary fix to stop the gapping once it starts. In Settings > Device Defaults, set the Re-index Method for Data Queries to None and save. Find the device under Management > Devices and have it Sync to Device Template. After a few polling cycles, the Re-index Method for Data Queries can be set back to Uptime and sync the device to the template again to make sure the data queries go back to using Uptime for recaching.

TheWitness commented 4 years ago

Check and see if there is a problem with the table. It should update the uptime value as a part of the spine process.

TheWitness commented 4 years ago

Also, grab the 1.2.13 spine source and try using that. This does sound familiar.

DatBoiTroy commented 4 years ago

What should I be looking for on the table? And I am setting up Spine 1.2.13 on our lab now. I'll let you know if there is any change!

DatBoiTroy commented 4 years ago

Unfortunately, the gapping and recache events are still happening using the release of Spine 1.2.13.

TheWitness commented 4 years ago

Run the repair database. What is the reindex method set to?

DatBoiTroy commented 4 years ago

The re-index method is Uptime. I ran the mysqlcheck command to repair the database and rebuilt the poller cache from the CLI. Monitoring to see if there is any change.

DatBoiTroy commented 4 years ago

I just checked this morning after letting things run overnight. No change in the gaps so far.

TheWitness commented 4 years ago

To me, this check does not even look valid:

47934056<27864336

Seems to me it's larger. So, I'm wondering when your reindex this data query, maybe the uptime the data query is saving is not the uptime of the SNMP agent, but the uptime of the host. They will be different. Even then, it looks to me that in fact:

47934056 > 27864336

So, something on your system is jacked up. I'm just speechless for the moment. It's almost like strtoll() is not working as expected.

DatBoiTroy commented 4 years ago

I understand how off this issue seems and I am at a loss as well. I use a Ansible playbook to manage both our production and lab instances and I was able to reproduce this once. I am going to do a fresh, manual lab install of 1.2.12 and see if I am able to reproduce this issue. Thanks for your assistance so far.

DatBoiTroy commented 3 years ago

Hi guys. Its been a while since I have responded on this issue, but this is still on-going and I am continuing to troubleshoot on my side. To refresh everyone, I only see these "Recache Assert Failed" entries roughly every two hours for devices whose SNMP agent was restarted or the device was rebooted. I have a fix that stops the loop temporarily, but it has to be done for each and every SNMP agent reboot.

I have included some debug logs for one of my devices that is showing this behavior. All I can say is that 95% of the the time, Cacti uses the uptime value returned from the previous polling cycle, specifically from host thread 1, to compare to the device's current uptime. However, where I see the "Recache Assert Failed" log entries, the device's uptime is being compared to the assert value stored in the poller_reindex MySQL table. The values stored in the poller_reindex table never seem to be updated either whether that is intentional or not.

2020/09/18 12:40:13 - SPINE: Poller[poller-02] PID[8260] Device[407] HT[1] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233997071 < output: 234027164)
2020/09/18 12:40:13 - SPINE: Poller[poller-02] PID[8260] Device[407] HT[1] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233997071 < output: 234027164)
2020/09/18 12:40:13 - SPINE: Poller[poller-02] PID[8260] Device[407] HT[1] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233997071 < output: 234027164)
2020/09/18 12:40:13 - SPINE: Poller[poller-02] PID[8260] Device[407] HT[1] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233997071 < output: 234027164)
2020/09/18 12:40:13 - SPINE: Poller[poller-02] PID[8260] Device[407] HT[2] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233997071 < output: 234027161)
2020/09/18 12:40:13 - SPINE: Poller[poller-02] PID[8260] Device[407] HT[2] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233997071 < output: 234027161)
2020/09/18 12:40:13 - SPINE: Poller[poller-02] PID[8260] Device[407] HT[2] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233997071 < output: 234027161)
2020/09/18 12:40:13 - SPINE: Poller[poller-02] PID[8260] Device[407] HT[2] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233997071 < output: 234027161)
============================================
2020/09/18 12:35:12 - SPINE: Poller[poller-02] PID[1984] Device[407] HT[1] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233967063 < output: 233997071)
2020/09/18 12:35:12 - SPINE: Poller[poller-02] PID[1984] Device[407] HT[1] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233967063 < output: 233997071)
2020/09/18 12:35:12 - SPINE: Poller[poller-02] PID[1984] Device[407] HT[1] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233967063 < output: 233997071)
2020/09/18 12:35:12 - SPINE: Poller[poller-02] PID[1984] Device[407] HT[1] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233967063 < output: 233997071)
2020/09/18 12:35:12 - SPINE: Poller[poller-02] PID[1984] Device[407] HT[2] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233967063 < output: 233997070)
2020/09/18 12:35:12 - SPINE: Poller[poller-02] PID[1984] Device[407] HT[2] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233967063 < output: 233997070)
2020/09/18 12:35:12 - SPINE: Poller[poller-02] PID[1984] Device[407] HT[2] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233967063 < output: 233997070)
2020/09/18 12:35:12 - SPINE: Poller[poller-02] PID[1984] Device[407] HT[2] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233967063 < output: 233997070)
============================================
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[1] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[1] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE ASSERT FAILED: '374881858<233967063'
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[1] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881858 < output: 233967063)
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[1] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[1] DQ[Cisco EnvMon IOS - Temperature] RECACHE ASSERT FAILED: '374881817<233967063'
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[1] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881817 < output: 233967063)
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[1] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[1] DQ[Cisco EnvMon IOS - Power] RECACHE ASSERT FAILED: '374881776<233967063'
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[1] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881776 < output: 233967063)
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[1] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[1] DQ[SNMP - Interface Statistics] RECACHE ASSERT FAILED: '374881764<233967063'
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[1] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881764 < output: 233967063)
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[2] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[2] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE ASSERT FAILED: '374881858<233967059'
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[2] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881858 < output: 233967059)
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[2] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[2] DQ[Cisco EnvMon IOS - Temperature] RECACHE ASSERT FAILED: '374881817<233967059'
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[2] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881817 < output: 233967059)
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[2] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[2] DQ[Cisco EnvMon IOS - Power] RECACHE ASSERT FAILED: '374881776<233967059'
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[2] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881776 < output: 233967059)
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[2] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[2] DQ[SNMP - Interface Statistics] RECACHE ASSERT FAILED: '374881764<233967059'
2020/09/18 12:30:12 - SPINE: Poller[poller-02] PID[53272] Device[407] HT[2] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881764 < output: 233967059)
============================================
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[1] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[1] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE ASSERT FAILED: '374881858<233937038'
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[1] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881858 < output: 233937038)
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[1] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[1] DQ[Cisco EnvMon IOS - Temperature] RECACHE ASSERT FAILED: '374881817<233937038'
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[1] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881817 < output: 233937038)
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[1] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[1] DQ[Cisco EnvMon IOS - Power] RECACHE ASSERT FAILED: '374881776<233937038'
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[1] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881776 < output: 233937038)
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[1] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[1] DQ[SNMP - Interface Statistics] RECACHE ASSERT FAILED: '374881764<233937038'
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[1] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881764 < output: 233937038)
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[2] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[2] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE ASSERT FAILED: '374881858<233937036'
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[2] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881858 < output: 233937036)
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[2] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[2] DQ[Cisco EnvMon IOS - Temperature] RECACHE ASSERT FAILED: '374881817<233937036'
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[2] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881817 < output: 233937036)
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[2] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[2] DQ[Cisco EnvMon IOS - Power] RECACHE ASSERT FAILED: '374881776<233937036'
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[2] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881776 < output: 233937036)
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[2] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[2] DQ[SNMP - Interface Statistics] RECACHE ASSERT FAILED: '374881764<233937036'
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[2] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881764 < output: 233937036)
2020/09/18 12:25:12 - SPINE: Poller[poller-02] PID[50722] Device[407] HT[2] RECACHE: Processing 4 items in the auto reindex cache for '10.10.68.16'
============================================
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[1] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[1] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE ASSERT FAILED: '374881858<233907243'
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[1] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881858 < output: 233907243)
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[1] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[1] DQ[Cisco EnvMon IOS - Temperature] RECACHE ASSERT FAILED: '374881817<233907243'
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[1] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881817 < output: 233907243)
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[1] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[1] DQ[Cisco EnvMon IOS - Power] RECACHE ASSERT FAILED: '374881776<233907243'
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[1] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881776 < output: 233907243)
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[1] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[1] DQ[SNMP - Interface Statistics] RECACHE ASSERT FAILED: '374881764<233907243'
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[1] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881764 < output: 233907243)
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[2] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[2] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE ASSERT FAILED: '374881858<233907242'
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[2] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881858 < output: 233907242)
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[2] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[2] DQ[Cisco EnvMon IOS - Temperature] RECACHE ASSERT FAILED: '374881817<233907242'
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[2] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881817 < output: 233907242)
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[2] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[2] DQ[Cisco EnvMon IOS - Power] RECACHE ASSERT FAILED: '374881776<233907242'
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[2] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881776 < output: 233907242)
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[2] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[2] DQ[SNMP - Interface Statistics] RECACHE ASSERT FAILED: '374881764<233907242'
2020/09/18 12:20:14 - SPINE: Poller[poller-02] PID[48194] Device[407] HT[2] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881764 < output: 233907242)
============================================
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[1] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[1] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE ASSERT FAILED: '374881858<233877150'
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[1] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881858 < output: 233877150)
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[1] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[1] DQ[Cisco EnvMon IOS - Temperature] RECACHE ASSERT FAILED: '374881817<233877150'
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[1] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881817 < output: 233877150)
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[1] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[1] DQ[Cisco EnvMon IOS - Power] RECACHE ASSERT FAILED: '374881776<233877150'
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[1] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881776 < output: 233877150)
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[1] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[1] DQ[SNMP - Interface Statistics] RECACHE ASSERT FAILED: '374881764<233877150'
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[1] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881764 < output: 233877150)
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[2] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[2] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE ASSERT FAILED: '374881858<233877147'
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[2] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881858 < output: 233877147)
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[2] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[2] DQ[Cisco EnvMon IOS - Temperature] RECACHE ASSERT FAILED: '374881817<233877147'
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[2] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881817 < output: 233877147)
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[2] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[2] DQ[Cisco EnvMon IOS - Power] RECACHE ASSERT FAILED: '374881776<233877147'
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[2] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881776 < output: 233877147)
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[2] NOTICE: Spike Kill in Effect for '10.10.68.16'
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[2] DQ[SNMP - Interface Statistics] RECACHE ASSERT FAILED: '374881764<233877147'
2020/09/18 12:15:13 - SPINE: Poller[poller-02] PID[45614] Device[407] HT[2] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 374881764 < output: 233877147)
============================================
2020/09/18 12:10:14 - SPINE: Poller[poller-02] PID[42954] Device[407] HT[1] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233816991 < output: 233847234)
2020/09/18 12:10:14 - SPINE: Poller[poller-02] PID[42954] Device[407] HT[1] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233816991 < output: 233847234)
2020/09/18 12:10:14 - SPINE: Poller[poller-02] PID[42954] Device[407] HT[1] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233816991 < output: 233847234)
2020/09/18 12:10:14 - SPINE: Poller[poller-02] PID[42954] Device[407] HT[1] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233816991 < output: 233847234)
2020/09/18 12:10:14 - SPINE: Poller[poller-02] PID[42954] Device[407] HT[2] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233816991 < output: 233847231)
2020/09/18 12:10:14 - SPINE: Poller[poller-02] PID[42954] Device[407] HT[2] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233816991 < output: 233847231)
2020/09/18 12:10:14 - SPINE: Poller[poller-02] PID[42954] Device[407] HT[2] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233816991 < output: 233847231)
2020/09/18 12:10:14 - SPINE: Poller[poller-02] PID[42954] Device[407] HT[2] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233816991 < output: 233847231)
2020/09/18 12:10:14 - SPINE: Poller[poller-02] PID[42954] Device[407] HT[2] RECACHE: Processing 4 items in the auto reindex cache for '10.10.68.16'
============================================
2020/09/18 12:05:11 - SPINE: Poller[poller-02] PID[40360] Device[407] HT[1] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233787102 < output: 233816991)
2020/09/18 12:05:11 - SPINE: Poller[poller-02] PID[40360] Device[407] HT[1] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233787102 < output: 233816991)
2020/09/18 12:05:11 - SPINE: Poller[poller-02] PID[40360] Device[407] HT[1] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233787102 < output: 233816991)
2020/09/18 12:05:11 - SPINE: Poller[poller-02] PID[40360] Device[407] HT[1] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233787102 < output: 233816991)
2020/09/18 12:05:11 - SPINE: Poller[poller-02] PID[40360] Device[407] HT[2] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233787102 < output: 233816988)
2020/09/18 12:05:11 - SPINE: Poller[poller-02] PID[40360] Device[407] HT[2] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233787102 < output: 233816988)
2020/09/18 12:05:11 - SPINE: Poller[poller-02] PID[40360] Device[407] HT[2] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233787102 < output: 233816988)
2020/09/18 12:05:11 - SPINE: Poller[poller-02] PID[40360] Device[407] HT[2] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233787102 < output: 233816988)
============================================
2020/09/18 12:00:12 - SPINE: Poller[poller-02] PID[37773] Device[407] HT[1] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233756990 < output: 233787102)
2020/09/18 12:00:12 - SPINE: Poller[poller-02] PID[37773] Device[407] HT[1] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233756990 < output: 233787102)
2020/09/18 12:00:12 - SPINE: Poller[poller-02] PID[37773] Device[407] HT[1] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233756990 < output: 233787102)
2020/09/18 12:00:12 - SPINE: Poller[poller-02] PID[37773] Device[407] HT[1] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233756990 < output: 233787102)
2020/09/18 12:00:12 - SPINE: Poller[poller-02] PID[37773] Device[407] HT[2] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233756990 < output: 233787100)
2020/09/18 12:00:12 - SPINE: Poller[poller-02] PID[37773] Device[407] HT[2] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233756990 < output: 233787100)
2020/09/18 12:00:12 - SPINE: Poller[poller-02] PID[37773] Device[407] HT[2] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233756990 < output: 233787100)
2020/09/18 12:00:12 - SPINE: Poller[poller-02] PID[37773] Device[407] HT[2] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233756990 < output: 233787100)
============================================
2020/09/18 11:55:11 - SPINE: Poller[poller-02] PID[35193] Device[407] HT[1] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233727145 < output: 233756990)
2020/09/18 11:55:11 - SPINE: Poller[poller-02] PID[35193] Device[407] HT[1] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233727145 < output: 233756990)
2020/09/18 11:55:11 - SPINE: Poller[poller-02] PID[35193] Device[407] HT[1] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233727145 < output: 233756990)
2020/09/18 11:55:11 - SPINE: Poller[poller-02] PID[35193] Device[407] HT[1] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233727145 < output: 233756990)
2020/09/18 11:55:11 - SPINE: Poller[poller-02] PID[35193] Device[407] HT[2] DQ[Cisco EnvMon IOS - Memory Usage] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233727145 < output: 233756987)
2020/09/18 11:55:11 - SPINE: Poller[poller-02] PID[35193] Device[407] HT[2] DQ[Cisco EnvMon IOS - Temperature] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233727145 < output: 233756987)
2020/09/18 11:55:11 - SPINE: Poller[poller-02] PID[35193] Device[407] HT[2] DQ[Cisco EnvMon IOS - Power] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233727145 < output: 233756987)
2020/09/18 11:55:11 - SPINE: Poller[poller-02] PID[35193] Device[407] HT[2] DQ[SNMP - Interface Statistics] RECACHE OID: .1.3.6.1.2.1.1.3.0, (assert: 233727145 < output: 233756987)
DatBoiTroy commented 3 years ago

After spending the last few days reading through and understanding Spine's poller.c source code, I have a better idea what is happening but I do not know why it is happening.

What seems to be happening is that once every 2 hours roughly, the remote pollers are querying the main poller's cacti.poller_reindex table to check query reindex points rather than their local cacti.poller_reindex tables. Sometimes this happen for one polling cycle and other times, 2 or 3 consecutive polling cycles (as shown in the last list of logs I posted above). The rest of the time, they query and update these values in their local table. Also, the values in the main poller's cacti.poller_reindex table are never updated, so unless the actual device uptime is larger, this loop keeps happening. It makes sense why changing the reindex method for a data query works because doing so also deletes the row for that host_id/data_query_id pair from the cacti.poller_reindex table on all pollers.

Anyone have any idea why this would be happening? In the meantime, I am going monitor and see if a process is stepping on things every two hours for some reason.

TheWitness commented 3 years ago

When you go to the Data Collectors interface, do you have a full sync interval set? If so, disable it for a time and see if the problems go away. Report back.

DatBoiTroy commented 3 years ago

Wow, I didn't realize that was a setting under each individual remote poller. They were set for 2 hours, so I disabled the sync interval on one of them. I'll let you know what happens.

DatBoiTroy commented 3 years ago

I think that may be the issue. Looking through the logs, I can predict exactly when I will see gaps in my graphs. There should be two gaps on the graphs of my affected devices since I made the change this morning. The devices associated to the poller I disabled have not had a gap since the change while all others have had a gap on schedule.

DatBoiTroy commented 3 years ago

After having the one poller's sync interval disabled all day, I re-enabled it and the gaps did come back whenever it ran its full sync. I think we can safely say that is the cause of the gaps. Do you have any suggestions on how to clear up this issue other than disabling the sync interval on the remote pollers?

TheWitness commented 3 years ago

I'm not sure how I want to fix this. Minimizing full sync's to only when you have a site offline for long periods would be the proper solution.

TheWitness commented 3 years ago

Okay, this is resolved now.

TheWitness commented 3 years ago

Should be a simple update from lib/poller.php from the 1.2.x branch to resolve.