Closed Al2Klimov closed 2 years ago
The Deployment request takes less then a second up to a very few seconds, but the Deployment carried out by Icinga can take minutes. There is no safe way of knowing whether a deployment is still ongoing, failed - or will never return at all. That's why Config Packages are deployed to isolated directories, with dedicated outcome and startup log.
director/config/deploy
and compare it to director/config/deployment-status
. This would of course require a coordinated way of steering those API request. No chance of doing so, when you have hundreds of systems firing lots of autonomous deploy requests. But that's an imaginary nightmare scenario, isn't it? :rofl: Long story short: Director is only the "transport" for these requests, and shouldn't take the responsibility for deciding whether Icinga is in the right mood to accept a deployment or not ;-)
There is no safe way of knowing whether a deployment is still ongoing, failed - or will never return at all.
Huh? If you fire a deploy, isn't there a spinner turning sooner or later into a tick/x along with an Icinga validation log?
Huh? If you fire a deploy, isn't there a spinner turning sooner or later into a tick/x along with an Icinga validation log?
Yes, and as long as it is spinning, the background daemon tries to collect related information again and again. However, sometimes Icinga completely "forgets" about a Deployment. Probably when deploying twice, when being restarted by other means (Puppet, Ansible) or whatever. Then it would wait forever, and older versions of Director did so.
Recent versions apply the following logic: in case a more recent configuration has been deployed, and Icinga shows an outcome (status, startup.log) for that deployment, all former ones (which still show no outcome) are considered being lost. Then Director marks them as such in it's DB and wipes the orphaned stages via the Icinga API.
So, when Director would stop deploying as requested, the first lost Deployment would result in a deadlock situation.
What we see occasionally is that IDO is not able to finish synching and hosts are present, hostgroups are present, and even though hosts are members of the hostgroup, they are not reflected in the interface. if we check in the DB, the director one shows correct belonging, but the IDO one does not. to fix this, we simply remove group belonging for hosts, do a deploy, then re-add the group to those hosts.
For us the deploy takes ~32 secs, but the IDO / web interface is updated after ~3-4 min.
The expected behaviour would be that a new deploy does not take place until all logical units of the past one are completed.
@Al2Klimov: I'd propose to move this issue to the Icinga 2 issue tracker. It's not feasible for an API-based tool like the Director to check whether all Add-ons (IDO, IcingaDB, Graphing solutions) are in Sync after a deployment. Director is not the only Software out there triggering Icinga restarts.
Keeping it's data in a consistent state is with no doubt a task that can only be accomplished by the Icinga daemon himself.
Sounds like the sensible approach @Thomas-Gelf . Forgive my ignorance but I do not know the process of making the move for this issue.
Adding some more details to the situation, we have seem deployments go ahead correctly (green checkmark), but spaced 3 min apart, and still present the group membership behaviour. We're further investigating on our side, yet this seems to be related to post-deployment IDO updates which do not finish in time/ don't get updated before the next deployment/ push, and when the next deployment comes along, remain in an inconsistent state.
Will bring more details as we troubleshoot this/ dig more details.
Don't see in the Icinga 2 code any (let's call it) "missing mutex" problem. But let's assume there is a such problem for now...
@Thomas-Gelf Stupid question: even if Icinga 2 would handle (see above) those properly, wouldn't this lead to more or less the same problems? I mean: user A adds+deploys host A, user B adds+deploys host B. Actually they deploy (let's call it) "merge base" + "changeset" as whole. So "at best" either host A or B would get lost, wouldn't they? So wouldn't have the Director to act more or less as a Wiki here and say: hold on, user B, please "rebase" your changeset on the most recent "master"?
Please don't make assumptions without trying it out. There is no such thing as "Deploying one host", it's always all or nothing, and it is always a defined specific consistent version.
Has anyone actually looked at this in more detail? Like where exactly are things missing? Only in IDO or also in icinga2 itself (i.e. are they missing in icinga2 API responses)? So sounds like this could also be an IDO bug when restarting in the wrong moment or something like that.
There is no such thing as "Deploying one host", it's always all or nothing
That's exactly what I'm talking about!
Actually they deploy (let's call it) "merge base" + "changeset" as whole.
Unless Icinga refuses deployments (preferably with a dedicated error message and/or status code) while being in a busy state, there is nothing we can do here I guess. Closing the feature request for now, please do not hesitate to create a new one (or re-open this one) in case this changes.
Expected Behavior
At best a deployment waits while others are running.
Current Behavior
Possible Solution
Steps to Reproduce (for bugs)
Your Environment