Icinga / icingaweb2-module-director

The Director aims to be your new favourite Icinga config deployment tool. Director is designed for those who want to automate their configuration deployment and those who want to grant their “point & click” users easy access to the configuration.
https://icinga.com/docs/director/latest
GNU General Public License v2.0
413 stars 203 forks source link

Prevent/handle concurrent deployments #2412

Closed Al2Klimov closed 2 years ago

Al2Klimov commented 3 years ago

Expected Behavior

At best a deployment waits while others are running.

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

Thomas-Gelf commented 3 years ago

The Deployment request takes less then a second up to a very few seconds, but the Deployment carried out by Icinga can take minutes. There is no safe way of knowing whether a deployment is still ongoing, failed - or will never return at all. That's why Config Packages are deployed to isolated directories, with dedicated outcome and startup log.

Long story short: Director is only the "transport" for these requests, and shouldn't take the responsibility for deciding whether Icinga is in the right mood to accept a deployment or not ;-)

Al2Klimov commented 3 years ago

There is no safe way of knowing whether a deployment is still ongoing, failed - or will never return at all.

Huh? If you fire a deploy, isn't there a spinner turning sooner or later into a tick/x along with an Icinga validation log?

Thomas-Gelf commented 3 years ago

Huh? If you fire a deploy, isn't there a spinner turning sooner or later into a tick/x along with an Icinga validation log?

Yes, and as long as it is spinning, the background daemon tries to collect related information again and again. However, sometimes Icinga completely "forgets" about a Deployment. Probably when deploying twice, when being restarted by other means (Puppet, Ansible) or whatever. Then it would wait forever, and older versions of Director did so.

Recent versions apply the following logic: in case a more recent configuration has been deployed, and Icinga shows an outcome (status, startup.log) for that deployment, all former ones (which still show no outcome) are considered being lost. Then Director marks them as such in it's DB and wipes the orphaned stages via the Icinga API.

So, when Director would stop deploying as requested, the first lost Deployment would result in a deadlock situation.

danielmicu commented 3 years ago

What we see occasionally is that IDO is not able to finish synching and hosts are present, hostgroups are present, and even though hosts are members of the hostgroup, they are not reflected in the interface. if we check in the DB, the director one shows correct belonging, but the IDO one does not. to fix this, we simply remove group belonging for hosts, do a deploy, then re-add the group to those hosts.

For us the deploy takes ~32 secs, but the IDO / web interface is updated after ~3-4 min.

The expected behaviour would be that a new deploy does not take place until all logical units of the past one are completed.

Thomas-Gelf commented 3 years ago

@Al2Klimov: I'd propose to move this issue to the Icinga 2 issue tracker. It's not feasible for an API-based tool like the Director to check whether all Add-ons (IDO, IcingaDB, Graphing solutions) are in Sync after a deployment. Director is not the only Software out there triggering Icinga restarts.

Keeping it's data in a consistent state is with no doubt a task that can only be accomplished by the Icinga daemon himself.

danielmicu commented 3 years ago

Sounds like the sensible approach @Thomas-Gelf . Forgive my ignorance but I do not know the process of making the move for this issue.

Adding some more details to the situation, we have seem deployments go ahead correctly (green checkmark), but spaced 3 min apart, and still present the group membership behaviour. We're further investigating on our side, yet this seems to be related to post-deployment IDO updates which do not finish in time/ don't get updated before the next deployment/ push, and when the next deployment comes along, remain in an inconsistent state.

Will bring more details as we troubleshoot this/ dig more details.

Al2Klimov commented 3 years ago

Don't see in the Icinga 2 code any (let's call it) "missing mutex" problem. But let's assume there is a such problem for now...

@Thomas-Gelf Stupid question: even if Icinga 2 would handle (see above) those properly, wouldn't this lead to more or less the same problems? I mean: user A adds+deploys host A, user B adds+deploys host B. Actually they deploy (let's call it) "merge base" + "changeset" as whole. So "at best" either host A or B would get lost, wouldn't they? So wouldn't have the Director to act more or less as a Wiki here and say: hold on, user B, please "rebase" your changeset on the most recent "master"?

Thomas-Gelf commented 3 years ago

Please don't make assumptions without trying it out. There is no such thing as "Deploying one host", it's always all or nothing, and it is always a defined specific consistent version.

julianbrost commented 3 years ago

Has anyone actually looked at this in more detail? Like where exactly are things missing? Only in IDO or also in icinga2 itself (i.e. are they missing in icinga2 API responses)? So sounds like this could also be an IDO bug when restarting in the wrong moment or something like that.

Al2Klimov commented 3 years ago

There is no such thing as "Deploying one host", it's always all or nothing

That's exactly what I'm talking about!

Actually they deploy (let's call it) "merge base" + "changeset" as whole.

Thomas-Gelf commented 2 years ago

Unless Icinga refuses deployments (preferably with a dedicated error message and/or status code) while being in a busy state, there is nothing we can do here I guess. Closing the feature request for now, please do not hesitate to create a new one (or re-open this one) in case this changes.