Closed Wintermute2k6 closed 2 years ago
Hi,
We need some more details here:
Also, we need logs from the period when it was happening.
All the best, Eric
Info provided see above..
ref/NC/666791 & ref/NC/673835
We are also experiencing the issue. One thing we notice that when we deploy changes from director, ido drops the pending writes into statehistory table. Our env has around 2000 hosts and 12000 services.
We are using icinga2 version 2.11.0-1 (master-master setup).
We have the same issue (for a very long time), mentioned https://github.com/Icinga/icinga2/issues/5702#event-1617666385 earlier but I didn't follow up properly there). I can sort of correlate this to when we perform maintenance on Icinga; i.e. we may be performing some config changes / reloads or restarts, or do maintenance on the backend database (which means a few queries might go to a backend server where mysql is down before it failover after a few seconds)
I guess it's understandable this could happen during those maintenances. Is there perhaps a way to periodically check if the current state (i.e. OK/Healthy) has a corresponding state transition and if not, re-trigger it?
We're checking out the SLA reports now and we now see some incorrect uptime percentages, because a service might have missed the recovery state transition so according to the SLA was down for days, but in fact it was up all the time. Icingaweb showed it as up as well, it just missed the transition in the state history.
It's unlikely that this will get any more attention given that Icinga DB was released. Anyways, it's quite possible for the IDO to lose history on restarts. This should be fine with Icinga DB.
Hi @julianbrost and @lippserd , Our system is currently facing the same problem as above. We are using Icinga2 version 2.13.2-1
For @lippserd question list:
How big is that "larger setup" you're talking about?
How many objects are affected?
Is this a permanent situation or did it happen just once? How often does it happen?
Is it possible to correlate database restarts, Icinga restarts or something else?
Is this a master master setup?
Is the database replicated?
Do you use HA for the IDO feature?
Are there other observations made which may have caused such scenario?
For the log Icinga check, I can confirm that Icinga aldready checked and aware the events happended and had sent the notifications needed
For @julianbrost comment: Like my answer for question 4, statehistory loss is not correlated with Icinga restarts or any restarts. Would you mind explain to me more about what is the difference between IDO and IcingaDB ?
Thank you in advance
Would you mind explain to me more about what is the difference between IDO and IcingaDB ?
With Icinga DB, the history data is first sent to Redis which is quicker than your typical relational database, so the data is moved out of the scope of the reloading process faster. Also, there's better handling for flushing any still pending Redis writes on reload/shutdown.
Hi @julianbrost so you mean that there are possibilities that IDO will drop/flush history data in some circumstances and you recommend to switch to IcingaDB. Is that what recommend to us ?
Yes. Not only because of that but also because Icinga DB is supposed to replace the IDO in general.
Thank you so much
Describe the bug
As it seems in larger setups, the core seems to skip write statehistory into the IDO. So it leaves state changes missing in the history of hosts and services.
This makes it difficult for SLA reporting to be accurate.
To Reproduce
Hard to reproduce in smaller setups, because the isn't enough happening to make skips in sequential writing to the statehistory. Needs to be a larger Setup.
Expected behavior
Icinga2 should write sequential every state change into the statehistory and not skipping states.
Screenshots
none available
Your Environment
Include as many relevant details about the environment you experienced the problem in
icinga2 --version
): Icinga 2 Version : 2.10.3icinga2 feature list
): Enabled features: api checker command graphite ido-mysql influxdb livestatus mainlog notificationIcinga Web 2 Modules: MODULE VERSION STATE DESCRIPTION director 1.6.2 enabled Director - Config tool for Icinga 2 doc 2.7.1 enabled Documentation module fileshipper 1.0.1 enabled Fileshipper for Icinga Director monitoring 2.7.1 enabled Icinga monitoring module
Additional context
ref/NC/666791 & ref/NC/673835
How big is that "larger setup" you're talking about? 6956 Hosts 57772 Services
How many objects are affected? Many, customer is still gathering Information because it is a separate department which gave the information.
Is this a permanent situation or did it happen just once? How often does it happen? Permanent situation, happens ob regular basis.
Is it possible to correlate database restarts, Icinga restarts or something else? The Icinga2 service reloads on a fixed schedule, via chronjob but database reloads are not fully out of picture. Customer gathers information about it. Seperate Department handling databases.
Is this a master master setup? Yes
Is the database replicated? Yes, Info about the replication is gathered by the customer atm.
Do you use HA for the IDO feature? IDO HA Feature is set to "enable_ha = false"
... Customer ist still gathering additional info .. futher information will be provided via reference
Are there other observations made which may have caused such scenario? The icinga2 system is running on an older Version which hat API Issues, upgrade already advised but outstanding due to corona circumstances.