Prevent/handle concurrent deployments

Al2Klimov commented 3 years ago

Expected Behavior

At best a deployment waits while others are running.

Current Behavior

Possible Solution

Steps to Reproduce (for bugs)

Your Environment

Thomas-Gelf commented 3 years ago

The Deployment request takes less then a second up to a very few seconds, but the Deployment carried out by Icinga can take minutes. There is no safe way of knowing whether a deployment is still ongoing, failed - or will never return at all. That's why Config Packages are deployed to isolated directories, with dedicated outcome and startup log.

if we mess up Scheduled Downtimes and Comments in such scenario (as suggested in one of the above forum posts), I'd consider this being a severe bug
when Icinga get's a second deployment request, while still being busy with the former one, it could either reject or accept the request. Currently it would accept the request and probably interrupt the ongoing former deployment - which is IMHO the correct behaviour
if an application hammers the system with lot's of harmful parallel deployment requests, Director behaves as how it's being told to do
using the Deployment Job in the Director helps greatly: you can auto-deploy with a very short interval, and add a reasonable grace period - just to be on the safe side. This is IMHO currently the best solution. In large environments with hundreds of changes a day, it makes it easy to bring deployments down to just a very few per day
alternatively, the deployment logic could be optimized by using the information provided by the Deployment Status API Request. You can use the checksum returned by director/config/deploy and compare it to director/config/deployment-status. This would of course require a coordinated way of steering those API request. No chance of doing so, when you have hundreds of systems firing lots of autonomous deploy requests. But that's an imaginary nightmare scenario, isn't it? :rofl:

Long story short: Director is only the "transport" for these requests, and shouldn't take the responsibility for deciding whether Icinga is in the right mood to accept a deployment or not ;-)

Al2Klimov commented 3 years ago

There is no safe way of knowing whether a deployment is still ongoing, failed - or will never return at all.

Huh? If you fire a deploy, isn't there a spinner turning sooner or later into a tick/x along with an Icinga validation log?

Thomas-Gelf commented 3 years ago

Huh? If you fire a deploy, isn't there a spinner turning sooner or later into a tick/x along with an Icinga validation log?

Yes, and as long as it is spinning, the background daemon tries to collect related information again and again. However, sometimes Icinga completely "forgets" about a Deployment. Probably when deploying twice, when being restarted by other means (Puppet, Ansible) or whatever. Then it would wait forever, and older versions of Director did so.

Recent versions apply the following logic: in case a more recent configuration has been deployed, and Icinga shows an outcome (status, startup.log) for that deployment, all former ones (which still show no outcome) are considered being lost. Then Director marks them as such in it's DB and wipes the orphaned stages via the Icinga API.

So, when Director would stop deploying as requested, the first lost Deployment would result in a deadlock situation.

danielmicu commented 3 years ago

What we see occasionally is that IDO is not able to finish synching and hosts are present, hostgroups are present, and even though hosts are members of the hostgroup, they are not reflected in the interface. if we check in the DB, the director one shows correct belonging, but the IDO one does not. to fix this, we simply remove group belonging for hosts, do a deploy, then re-add the group to those hosts.

For us the deploy takes ~32 secs, but the IDO / web interface is updated after ~3-4 min.

The expected behaviour would be that a new deploy does not take place until all logical units of the past one are completed.

Thomas-Gelf commented 3 years ago

@Al2Klimov: I'd propose to move this issue to the Icinga 2 issue tracker. It's not feasible for an API-based tool like the Director to check whether all Add-ons (IDO, IcingaDB, Graphing solutions) are in Sync after a deployment. Director is not the only Software out there triggering Icinga restarts.

Keeping it's data in a consistent state is with no doubt a task that can only be accomplished by the Icinga daemon himself.

danielmicu commented 3 years ago

Sounds like the sensible approach @Thomas-Gelf . Forgive my ignorance but I do not know the process of making the move for this issue.

Adding some more details to the situation, we have seem deployments go ahead correctly (green checkmark), but spaced 3 min apart, and still present the group membership behaviour. We're further investigating on our side, yet this seems to be related to post-deployment IDO updates which do not finish in time/ don't get updated before the next deployment/ push, and when the next deployment comes along, remain in an inconsistent state.

Will bring more details as we troubleshoot this/ dig more details.

Al2Klimov commented 3 years ago

Don't see in the Icinga 2 code any (let's call it) "missing mutex" problem. But let's assume there is a such problem for now...

@Thomas-Gelf Stupid question: even if Icinga 2 would handle (see above) those properly, wouldn't this lead to more or less the same problems? I mean: user A adds+deploys host A, user B adds+deploys host B. Actually they deploy (let's call it) "merge base" + "changeset" as whole. So "at best" either host A or B would get lost, wouldn't they? So wouldn't have the Director to act more or less as a Wiki here and say: hold on, user B, please "rebase" your changeset on the most recent "master"?

Thomas-Gelf commented 3 years ago

Please don't make assumptions without trying it out. There is no such thing as "Deploying one host", it's always all or nothing, and it is always a defined specific consistent version.

julianbrost commented 3 years ago

Has anyone actually looked at this in more detail? Like where exactly are things missing? Only in IDO or also in icinga2 itself (i.e. are they missing in icinga2 API responses)? So sounds like this could also be an IDO bug when restarting in the wrong moment or something like that.

Al2Klimov commented 3 years ago

There is no such thing as "Deploying one host", it's always all or nothing

That's exactly what I'm talking about!

Actually they deploy (let's call it) "merge base" + "changeset" as whole.

Thomas-Gelf commented 2 years ago

Unless Icinga refuses deployments (preferably with a dedicated error message and/or status code) while being in a busy state, there is nothing we can do here I guess. Closing the feature request for now, please do not hesitate to create a new one (or re-open this one) in case this changes.

Icinga / icingaweb2-module-director