Cacti / cacti

Cacti ™
http://www.cacti.net
GNU General Public License v2.0
1.62k stars 403 forks source link

When performing actions against Devices, replicated device information could sometimes be lost #3916

Closed eschoeller closed 3 years ago

eschoeller commented 3 years ago

I've been at this for about 5 hours now...

I upgraded to 1.2.15. My remote pollers stopped working. I could revive them by moving devices around. Then, they'd stop working again. Suddenly reporting "DataSources:0" but still having an accurate Host count. I then realized I could disable/enable the devices to get the remote pollers to collect data again. But this kept happening over and over again.

Finally I realized that it was only semi-intermittent enough and figured that it could be related to the boost process, since it too is something that runs intermittently. Sure enough, once the boost process ran, all the pollers went to having no data sources. And even after the boost process was complete, they never came back until I manually intervened. I reproduced this several times. Then I downgraded to 1.2.14. Everything is fine again, and I've seen several boost processes run and the remote pollers retain their data source polling.

Looking back through the extensive changelog for 1.2.15 I see numerous issues that could have triggered this. I'll let you guys take a first past on which issue this might be, since you'll probably zero in on it faster than I will. And I'm already spent on this.

Cheers!

c0sm1n commented 3 years ago

I'm experiencing the same issues after upgrading to 1.2.5. Remote poller processes all RRDs for about an hour and then stops without warning. Moving all managed devices to the main poller and back 'fixes' it temporarily

2020/11/17 13:50:08 - SYSTEM STATS: Time:5.3593 Method:spine Processes:1 Threads:60 Hosts:13 HostsPerProcess:13 DataSources:1844 RRDsProcessed:0 2020/11/17 13:55:08 - SYSTEM STATS: Time:5.3588 Method:spine Processes:1 Threads:60 Hosts:13 HostsPerProcess:13 DataSources:1844 RRDsProcessed:0 2020/11/17 14:00:09 - SYSTEM STATS: Time:5.3579 Method:spine Processes:1 Threads:60 Hosts:13 HostsPerProcess:13 DataSources:1844 RRDsProcessed:0 2020/11/17 14:05:09 - SYSTEM STATS: Time:5.3673 Method:spine Processes:1 Threads:60 Hosts:13 HostsPerProcess:13 DataSources:1844 RRDsProcessed:0 2020/11/17 14:10:08 - SYSTEM STATS: Time:5.3485 Method:spine Processes:1 Threads:60 Hosts:13 HostsPerProcess:13 DataSources:1844 RRDsProcessed:0 2020/11/17 14:15:08 - SYSTEM STATS: Time:5.3571 Method:spine Processes:1 Threads:60 Hosts:13 HostsPerProcess:13 DataSources:1844 RRDsProcessed:0 2020/11/17 14:20:04 - SYSTEM STATS: Time:1.2630 Method:spine Processes:1 Threads:60 Hosts:15 HostsPerProcess:15 DataSources:0 RRDsProcessed:0 2020/11/17 14:25:04 - SYSTEM STATS: Time:1.2594 Method:spine Processes:1 Threads:60 Hosts:15 HostsPerProcess:15 DataSources:0 RRDsProcessed:0 2020/11/17 14:30:05 - SYSTEM STATS: Time:1.2619 Method:spine Processes:1 Threads:60 Hosts:15 HostsPerProcess:15 DataSources:0 RRDsProcessed:0 2020/11/17 14:35:04 - SYSTEM STATS: Time:1.2580 Method:spine Processes:1 Threads:60 Hosts:15 HostsPerProcess:15 DataSources:0 RRDsProcessed:0

eschoeller commented 3 years ago

Glad I'm not alone. Hopefully the fix is relatively easy.

TheWitness commented 3 years ago

I'll be looking at this over the weekend to try to reproduce. Day job is calling though.

eschoeller commented 3 years ago

Thanks!!

Eric.

On Nov 20, 2020, at 7:41 AM, TheWitness notifications@github.com wrote:

 I'll be looking at this over the weekend to try to reproduce. Day job is calling though.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub, or unsubscribe.

TheWitness commented 3 years ago

I'm still tracing this. I'm not quite sure what is breaking it yet.

TheWitness commented 3 years ago

Okay, I think I have the last issue remediated.

c0sm1n commented 3 years ago

Still experiencing the same issue after applying the patch. Moved all devices from remote poller to main and back. The first poll cycle after the move picks up all hosts and then drops back down to 0 on the next cycle.

2020/11/23 13:05:09 - SYSTEM STATS: Time:6.3806 Method:spine Processes:1 Threads:60 Hosts:15 HostsPerProcess:15 DataSources:1872 RRDsProcessed:0 2020/11/23 13:10:03 - SYSTEM STATS: Time:0.1288 Method:spine Processes:1 Threads:60 Hosts:0 HostsPerProcess:0 DataSources:0 RRDsProcessed:0 2020/11/23 13:15:03 - SYSTEM STATS: Time:0.1306 Method:spine Processes:1 Threads:60 Hosts:0 HostsPerProcess:0 DataSources:0 RRDsProcessed:0 2020/11/23 13:20:03 - SYSTEM STATS: Time:0.1247 Method:spine Processes:1 Threads:60 Hosts:0 HostsPerProcess:0 DataSources:0 RRDsProcessed:0

TheWitness commented 3 years ago

Did you cherry pick or did you take the entire 1.2 point x branch?

c0sm1n commented 3 years ago

Cherry-picked the modified files only.

TheWitness commented 3 years ago

Do a full update approximately this way:

[root@vmhost3 html]# cat refresh.sh
#!/bin/sh
rm -rf cacti-develop
git clone -b 1.2.x https://github.com/Cacti/cacti.git cacti-develop
/bin/cp -rpf cacti-develop/* cacti
chown -R apache:apache cacti

Make sure that your main poller can push out changes to the remotes automatically, then re-populate the poller cache.

c0sm1n commented 3 years ago

I applied the changes as instructed and while the behavior is slightly different in that the log shows the correct number of datasources, the remote poller is still not polling any OIDs

2020/11/23 18:00:04 - SYSTEM STATS: Time:0.4570 Method:spine Processes:1 Threads:60 Hosts:0 HostsPerProcess:0 DataSources:1872 RRDsProcessed:0 2020/11/23 18:05:03 - SYSTEM STATS: Time:0.4588 Method:spine Processes:1 Threads:60 Hosts:0 HostsPerProcess:0 DataSources:1872 RRDsProcessed:0 2020/11/23 18:10:03 - SYSTEM STATS: Time:0.4363 Method:spine Processes:1 Threads:60 Hosts:0 HostsPerProcess:0 DataSources:1872 RRDsProcessed:0 2020/11/23 18:15:04 - SYSTEM STATS: Time:0.4906 Method:spine Processes:2 Threads:60 Hosts:0 HostsPerProcess:0 DataSources:1872 RRDsProcessed:0

TheWitness commented 3 years ago

Can you do a full sync on the pollers?

c0sm1n commented 3 years ago

Still no change after a full sync and a poller rebuilt.

2020/11/24 06:15:04 - SYSTEM STATS: Time:1.3685 Method:spine Processes:2 Threads:60 Hosts:15 HostsPerProcess:8 DataSources:0 RRDsProcessed:0

TheWitness commented 3 years ago

Okay, maybe I missed a commit. I'll research later tonight.

TheWitness commented 3 years ago

Okay, just confirmed when it's happening. Now to figure our why it continues to happen.

TheWitness commented 3 years ago

Okay, take the latest lib/poller.php and do a full sync to your pollers. This should get it fixed.

TheWitness commented 3 years ago

Bump!

eschoeller commented 3 years ago

Unfortunately I probably can't test this ... I'm a little hesitant to go through the entire upgrade/downgrade process again. It took a lot of time. I can only make changes on a full production instance, I have no real testing environment in place.

TheWitness commented 3 years ago

Okay, in that case, since I have reproduced and fixed the underlying bugs, I'll go ahead and close.

eschoeller commented 3 years ago

I think @c0sm1n should sign off on it, since it seems like he/she was able to readily verify functionality.

TheWitness commented 3 years ago

We can wait a few days. But due to the gravity, we have to move fast.

eschoeller commented 3 years ago

I agree. This is a serious bug, at least for people with distributed environments. I would almost consider pulling 1.2.15 entirely. What else is pending for a 1.2.16 release? If not much else, then maybe I would consider upgrading today.

TheWitness commented 3 years ago

Nothing major. Talked to @netviV via slack today. We may do a weekend release Sunday. We've been trying for a few releases to get to the 1.3 release in ernest.

TheWitness commented 3 years ago

Need to do a bunch of testing.

c0sm1n commented 3 years ago

The fixed worked for me and the remote pollers have been running successfully since I pulled the latest poller.php Thanks for varying and correcting the issue

TheWitness commented 3 years ago

Perfect, thanks.