IDO icinga_statehistory not completely written

Wintermute2k6 commented 4 years ago

Describe the bug

As it seems in larger setups, the core seems to skip write statehistory into the IDO. So it leaves state changes missing in the history of hosts and services.

This makes it difficult for SLA reporting to be accurate.

To Reproduce

Hard to reproduce in smaller setups, because the isn't enough happening to make skips in sequential writing to the statehistory. Needs to be a larger Setup.

Expected behavior

Icinga2 should write sequential every state change into the statehistory and not skipping states.

Screenshots

none available

Your Environment

Include as many relevant details about the environment you experienced the problem in

Version used (icinga2 --version): Icinga 2 Version : 2.10.3
Operating System and version: CentOS Linux release 7.2.1511
Enabled features (icinga2 feature list): Enabled features: api checker command graphite ido-mysql influxdb livestatus mainlog notification
Icinga Web 2 version and modules (System - About): Packages: icingaweb2-2.7.1-1.el7.icinga.noarch php-5.4.16-36.1.el7_2.1.x86_64 httpd-2.4.6-40.el7.centos.1.x86_64

Icinga Web 2 Modules: MODULE VERSION STATE DESCRIPTION director 1.6.2 enabled Director - Config tool for Icinga 2 doc 2.7.1 enabled Documentation module fileshipper 1.0.1 enabled Fileshipper for Icinga Director monitoring 2.7.1 enabled Icinga monitoring module

Additional context

ref/NC/666791 & ref/NC/673835

How big is that "larger setup" you're talking about? 6956 Hosts 57772 Services

How many objects are affected? Many, customer is still gathering Information because it is a separate department which gave the information.

Is this a permanent situation or did it happen just once? How often does it happen? Permanent situation, happens ob regular basis.

Is it possible to correlate database restarts, Icinga restarts or something else? The Icinga2 service reloads on a fixed schedule, via chronjob but database reloads are not fully out of picture. Customer gathers information about it. Seperate Department handling databases.

Is this a master master setup? Yes

Is the database replicated? Yes, Info about the replication is gathered by the customer atm.

Do you use HA for the IDO feature? IDO HA Feature is set to "enable_ha = false"

... Customer ist still gathering additional info .. futher information will be provided via reference

Are there other observations made which may have caused such scenario? The icinga2 system is running on an older Version which hat API Issues, upgrade already advised but outstanding due to corona circumstances.

lippserd commented 4 years ago

Hi,

We need some more details here:

How big is that "larger setup" you're talking about?
How many objects are affected?
Is this a permanent situation or did it happen just once? How often does it happen?
Is it possible to correlate database restarts, Icinga restarts or something else?
Is this a master master setup?
Is the database replicated?
Do you use HA for the IDO feature?
...
Are there other observations made which may have caused such scenario?

Also, we need logs from the period when it was happening.

All the best, Eric

Wintermute2k6 commented 4 years ago

Info provided see above..

Wintermute2k6 commented 4 years ago

ref/NC/666791 & ref/NC/673835

usmanC9 commented 3 years ago

We are also experiencing the issue. One thing we notice that when we deploy changes from director, ido drops the pending writes into statehistory table. Our env has around 2000 hosts and 12000 services.

usmanC9 commented 3 years ago

We are using icinga2 version 2.11.0-1 (master-master setup).

NielsH commented 2 years ago

We have the same issue (for a very long time), mentioned https://github.com/Icinga/icinga2/issues/5702#event-1617666385 earlier but I didn't follow up properly there). I can sort of correlate this to when we perform maintenance on Icinga; i.e. we may be performing some config changes / reloads or restarts, or do maintenance on the backend database (which means a few queries might go to a backend server where mysql is down before it failover after a few seconds)

I guess it's understandable this could happen during those maintenances. Is there perhaps a way to periodically check if the current state (i.e. OK/Healthy) has a corresponding state transition and if not, re-trigger it?

We're checking out the SLA reports now and we now see some incorrect uptime percentages, because a service might have missed the recovery state transition so according to the SLA was down for days, but in fact it was up all the time. Icingaweb showed it as up as well, it just missed the transition in the state history.

julianbrost commented 2 years ago

It's unlikely that this will get any more attention given that Icinga DB was released. Anyways, it's quite possible for the IDO to lose history on restarts. This should be fine with Icinga DB.

quangtamle commented 1 year ago

Hi @julianbrost and @lippserd , Our system is currently facing the same problem as above. We are using Icinga2 version 2.13.2-1

For @lippserd question list:

How big is that "larger setup" you're talking about?
- 2700 hosts and 102000 services
How many objects are affected?
- A lot, we can not give you the exact number but we randomly checked based on events which were happended on our devices.
Is this a permanent situation or did it happen just once? How often does it happen?
- We still monitor this problem. It did not happend just once for sure and we just noticed that it happended around 3-4 months ago
Is it possible to correlate database restarts, Icinga restarts or something else?
- We also considered those cases but unfortunately statehistory loss is not correlated with those events above
Is this a master master setup?
- Yes, it is a master master setup
Is the database replicated?
- Yes, the database is using maxscale for HA and replication
Do you use HA for the IDO feature?
- Yes, we enabled IDO feature on all master nodes
Are there other observations made which may have caused such scenario?
- We considered all the possibilities we can think of but like i said above none of them can explain the losses

For the log Icinga check, I can confirm that Icinga aldready checked and aware the events happended and had sent the notifications needed

For @julianbrost comment: Like my answer for question 4, statehistory loss is not correlated with Icinga restarts or any restarts. Would you mind explain to me more about what is the difference between IDO and IcingaDB ?

Thank you in advance

julianbrost commented 1 year ago

Would you mind explain to me more about what is the difference between IDO and IcingaDB ?

With Icinga DB, the history data is first sent to Redis which is quicker than your typical relational database, so the data is moved out of the scope of the reloading process faster. Also, there's better handling for flushing any still pending Redis writes on reload/shutdown.

quangtamle commented 1 year ago

Hi @julianbrost so you mean that there are possibilities that IDO will drop/flush history data in some circumstances and you recommend to switch to IcingaDB. Is that what recommend to us ?

julianbrost commented 1 year ago

Yes. Not only because of that but also because Icinga DB is supposed to replace the IDO in general.

quangtamle commented 1 year ago

Thank you so much

Icinga / icinga2