argoproj / argo-cd

Declarative Continuous Deployment for Kubernetes
https://argo-cd.readthedocs.io
Apache License 2.0
17.74k stars 5.41k forks source link

Application dependencies #7437

Open jessesuen opened 3 years ago

jessesuen commented 3 years ago

Summary

I was speaking with @jasonmorgan from Buoyant today about a missing feature in Argo CD for blocking application syncs based on required dependencies on other applications. The use case is:

  1. I need to deploy apps A and B
  2. B must not be deployed before A (because A has a mutating webhook which must be in place before B starts)
  3. I want to sync them all at the same time and don't want to think about clicking sync in some correct order

This is especially important for the bootstrapping use case where you're recreating a cluster from git, and you need to create many apps after a bunch of system-level add-ons are fully available. e.g. linkerd must be in place before any applications come up, because linkerd's mutating webhook needs to inject sidecars into application pods starting up.

The use case is very compelling and I'm convinced we should prioritize this. I think this feature, combined with ApplicationSets will really start to complete our bootstrapping story.

Motivation

Please give examples of your use case, e.g. when would you use this.

During cluster bootstrapping, cluster addons (especially ones with mutating webhooks) need to be in place before application pods can come up.

Proposal

How do you think this should be implemented?

It turns out, @jannfis already started some work on this, and the spec changes close to what we need: https://github.com/argoproj/argo-cd/pull/3892

Given the age of the original PR, I'm filing an issue in case we abandon https://github.com/argoproj/argo-cd/pull/3892 for a new attempt, and targeting this for tentative next milestone in case someone wants to pick this up.

leoluz commented 1 year ago

This is a highly voted proposal and while I think the main use-case (mutation webhook) makes some sense, I am also concerned about how this feature could be promoting anti-patterns when it comes to micro-service designs.

The first example that comes to mind is the distributed monolith. Ideally (in the perfect world :) ) an application should be resilient enough to allow it to be deployed even if its dependencies aren't satisfied. A simple example is one service that depends on Prometheus infra to expose metrics. It doesn't really matter if Prometheus is available on the cluster or not. The core functionality of this service should still be available and once Prometheus infra is up it will start scraping metrics without requiring the application to restart. If someone configures this service in Argo CD with a dependency to Prometheus it will block new syncs if Prometheus is unavailable (maybe even if it is Degraded?) while it shouldn't. This is a very simplistic example but I am pretty sure that there are much more in terms of how this feature could be misused which would make support much harder for Argo CD admins.

If the dependency graph is complex with many apps and levels involved, how users would be able to visualize the dependency tree to understand what is causing their application to remain out-of-sync?

@jessesuen @jannfis

jannfis commented 1 year ago

If the dependency graph is complex with many apps and levels involved, how users would be able to visualize the dependency tree to understand what is causing their application to remain out-of-sync?

In the most recent incarnation, if the sync is blocked by a dependency's state, it will be noted in the Application's .status field. So far, there are no plans on visualization, but the information is readily available in the Application CRs. The wait state will also be reflected in an Application's conditions, so the information is easily accessible from the UI.

leoluz commented 1 year ago

The wait state will also be reflected in an Application's conditions, so the information is easily accessible from the UI

I am sorry but as far as I know the Application's status fields are not exposed in the UI. Am I wrong? It requires kubectl access in the cluster where the Applications are synced. Anyhow, let's put ourselves in the user's shoes: As a devops, I pushed a change in git and my application remains out of sync. Even if I click the sync button nothing happens. There is no place in Argo CD UI to tell me why the application is not syncing. I have to call support. Argo CD admin must look in the gigantic Application's status field to dig where the error is.

We are having many different support issues where the answer is in the resource's status field but users just don't look at it. The direction that we are going is to surface important status fields data in Argo CD UI to make it more user friendly.

jannfis commented 1 year ago

@leoluz While waiting for any dependencies, it will look in the UI right now as follows:

image

and

image

So no direct cluster access required. Obviously, this information could be surfaced a little better. I'm open to suggestions, but I believe for an MVP, this might be good enough.

shinebayar-g commented 1 year ago

I can use sync waves and App of App hierarchies to get everything to deploy in the right order when I bootstrap a cluster

Excuse me, how do you do this? I am using App of Apps pattern and added argocd.argoproj.io/sync-wave: '-1' to the CRDs application. But kube-prometheus-stack still started syncing before even CRDs are installed.

apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
  annotations:
    argocd.argoproj.io/sync-wave: '-1'
  name: kube-prometheus-stack-crds
  namespace: argocd
  finalizers:
    - resources-finalizer.argocd.argoproj.io

Edit: Found this really nice blog post that explains it. https://codefresh.io/blog/argo-cd-application-dependencies/

aiceball commented 11 months ago

@jannfis am I correct in understanding that your PR: https://github.com/argoproj/argo-cd/pull/15280 would function for any application deployment strategies?

i.e. it would cover all of the following cases:

jannfis commented 11 months ago

@aiceball Yes, the dependency mechanism would be rather independent of the pattern you use to create/maintain your applications.

zs-dima commented 10 months ago

What about dependsOn for ApplicationSet elements?

apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata:
  name: my-applications
  namespace: argocd
spec:
  generators:
    - list:
        elements:
          # Infrastructure
          - name: cert-manager
            path: infrastructure/networking/cert-manager
          - name: traefik
            path: infrastructure/networking/traefik
            dependsOn:
              - cert-manager
          - name: rancher
            path: infrastructure/system/rancher
            dependsOn:
              - traefik
          # Apps
          - name: n8n
            path: apps/n8n
            dependsOn:
              - traefik
  template:
    metadata:
      name: '{{name}}'
    spec:
      project: default
      source:
        repoURL: 'https://github.com/${GITHUB_USER}/${GITHUB_REPO}.git'
        targetRevision: HEAD
        path: '{{path}}'
      destination:
        server: 'https://kubernetes.default.svc'
        namespace: '{{name}}-system'

FluxCD has dependencies: https://fluxcd.io/flux/components/kustomize/kustomizations/#dependencies

Event Docker Compose and Docker Swarm have depends_on: https://docs.docker.com/compose/compose-file/compose-file-v3/#depends_on

christianh814 commented 10 months ago

@zs-dima There's already a way to do that with progressive syncs

https://argo-cd.readthedocs.io/en/stable/operator-manual/applicationset/Progressive-Syncs/

vvatlin commented 6 months ago

It's still impossible to guarantee orders between apps. Sync waves don't work.

nneram commented 6 months ago

Hi @vvatlin, I can confirm that it's working, at least in the version I use, v2.8.4. I have an app of apps pattern with 11 applications and still growing, with nearly 7 waves. All you need is here: https://argo-cd.readthedocs.io/en/stable/operator-manual/cluster-bootstrapping/#app-of-apps-pattern. However, you need to add health assessment since v1.8 (https://github.com/argoproj/argo-cd/issues/3781). Otherwise, it will not work.

For more information: https://argo-cd.readthedocs.io/en/stable/operator-manual/upgrading/1.7-1.8/. I think you also have ApplicationSets, but I didn't look in that way.

They are working solutions but it would be easier with dependencies. I agree with that.

christianh814 commented 6 months ago

It's still impossible to guarantee orders between apps. Sync waves don't work.

hey @vvatlin , I wrote a blog about getting Syncwaves working with App of Apps

vvatlin commented 6 months ago

I have app of apps and Health assessment also. And my child apps still synchronize randomly. argocd 2.10.7

chanakya-svt commented 2 months ago

Hi @vvatlin, I can confirm that it's working, at least in the version I use, v2.8.4. I have an app of apps pattern with 11 applications and still growing, with nearly 7 waves. All you need is here: https://argo-cd.readthedocs.io/en/stable/operator-manual/cluster-bootstrapping/#app-of-apps-pattern. However, you need to add health assessment since v1.8 (#3781). Otherwise, it will not work.

For more information: https://argo-cd.readthedocs.io/en/stable/operator-manual/upgrading/1.7-1.8/. I think you also have ApplicationSets, but I didn't look in that way.

They are working solutions but it would be easier with dependencies. I agree with that.

Hi @vvatlin, with the setup thats working for you, are you using ServerSideApply/ServerSideDiff in the ApplicationSet?

crenshaw-dev commented 2 weeks ago

I'm resistant to adding a dependsOn feature because I'm not confident that other features don't already cover the described use cases.

I've re-read every comment on this issue and a related issue to make sure I understand the use cases. Some are clear-cut, others are (currently) more vague. I'll group and address them.

Clear-cut use cases

CRDs need to be installed before CRs

Retries solve this. Apply both the CRD and the CR apps and sync them both until the CR app sync succeeds.

Mutating webhooks need to exist before the resources they mutate

You can accomplish this by enabling the Application health check and using an app-of-apps with sync waves. The first wave installs the mutating webhook, and subsequent waves install the resources they mutate.

You could also accomplish this by defining your apps in an AppSet and using progressive syncs to make sure the webhook app syncs before the apps that contain resources to be mutated.

Resources need to be deleted before their controllers are deleted so that finalizers are handled

This can also be accomplished with an Application health check using an app-of-apps with sync waves. Apps are deleted in reverse-wave order. So if your controller is synced before the resources with finalizers, then deleting the parent app will delete the resource app(s) before deleting the controller app.

(Currently) vague use cases

X thing should be installed before Y thing (with no more specificity)

This is a more general form of the CRD, mutating webhooks, and finalizers use cases. But for each case I've read, it's unclear why 1) retries don't solve the problem, or 2) app-of-apps waves or AppSets+progressive syncs don't solve the problem.

Retries fail to solve the problem if and only if Argo CD can successfully create (or delete) the resources when they should not be created (or deleted). If a mutating webhook is missing, the mutated resources don't know they need to be mutated, so Argo CD successfully applies them when it should not. If a controller is missing, Argo CD will successfully delete resources even though they need their finalizers handled first.

In order for an "X before Y" problem to justify an ordering feature, we need an explanation of why retries fail, as they do for mutating webhooks and for finalizers.

And even then, to really justify a dependsOn feature, we need an explanation of of why an app-of-apps or a progressively-synced AppSet would not solve the problem.

Changes to X should be synced before changes to Y

This style of use case requires order-of-syncing in addition to order-of-creation-and-deletion. Examples include "applying a database schema change in one app and using it in another" and "promoting a change from lower to higher environments."

This is exactly what the AppSet progressive syncs feature aims to do. But I don't think it's a silver bullet, and I'm not sure that GitOps is always (or even often) the best way to handle these use cases. For example, in the case of a database schema migration, if your schema change is decoupled from your code change, then your code must be compatible with both schemas - no amount of ordering in Argo CD can avoid downtime if you're running code that's incompatible with the deployed schema. So why rely on ordering at all?

If folks believe I'm missing something and that we need dependsOn for this style of use case, please explain the use case in depth.

Other reasons for dependsOn

Throughout this issue and others, there are a few other reasons given for adding dependsOn. I'll address each:

Argo CD doesn't do sync retries well

It's been argued that Argo CD doesn't handle retrying syncs very well, so eventual consistency isn't a realistic option. I don't think these are fundamental issues with the concept of retrying syncs, and I think we can improve our implementation to alleviate this problem.

The app-of-apps + sync waves experience is bad

Fair enough. You have to enable the app health check, craft an app-of-apps, and set numeric sync wave numbers. And after all that, the visualization of the ordering isn't great. The feature might also be buggy, and the docs aren't wonderful.

My recommendation is that, assuming the feature can actually solve the most common use cases, then we should improve the feature's experience. We should consider enabling the health check by default, we should improve visualization, and we should write better docs. The experience may never be as good as dependsOn, but rather than risking having two features with bad UX, we have one feature with an okay UX.

Flux has the feature

Flux is structured quite a bit differently than Argo, and it's not clear to me that their implementation can be applied directly to Argo CD's design. Flux doesn't have the concepts of projects or apps-in-any-namespace, so the question of how to construct, enforce, and visualize a dependency graph across tenancy units (problems we'd need to solve) aren't applicable. In other words, "just do what Flux did" isn't really an option.

Summary

I'm not fundamentally opposed to the feature or to merging Jann's PR. But, if merged, the new feature will have bugs and usability issues, as all big new features do. If the only problems with the existing feature are bugs and usability issues, I recommend we try to tackle those first or explain why they fundamentally can't be solved.

rumstead commented 2 weeks ago

I agree, that addressing clear-cut use cases with retries, app-of-apps, and progressive syncs is practical. Providing clear and documented examples summarizing the many slack, discussions, and issues would be very user-friendly and possibly help justify the need for a dependsOn feature with use-cases that aren't handled.

crenshaw-dev commented 2 weeks ago

Agreed, maybe the best way to demonstrate the need for dependsOn is to try to document how to do without it and fix outstanding bugs in the other potential solutions.

crenshaw-dev commented 2 weeks ago

I've done a refresh of issues under the sync-waves label. https://github.com/argoproj/argo-cd/issues?q=is%3Aopen+is%3Aissue+label%3Async-waves+sort%3Areactions-%2B1-desc

kotyara85 commented 2 weeks ago

It's still impossible to guarantee orders between apps. Sync waves don't work.

hey @vvatlin , I wrote a blog about getting Syncwaves working with App of Apps

I'm not 100 sure if that's true. I did see apps in a specific wave go into a failed state and argo would switch to a next wave anyway

crenshaw-dev commented 2 weeks ago

@kotyara85 that sounds like a bug, not a fundamental issue with sync waves. If you can reproduce it, would be great to have a new issue opened for that.

nathan-bowman commented 2 weeks ago

Hi @crenshaw-dev, within the context of "retries solve this" my only feedback is that many of my apps don't need retries. And, introducing retries might make it more difficult for me to spot problems with a deployment.

I can see myself not catching an issue that is a show-stopper vs just assuming that a bunch of apps are just retrying indefinitely, until a line of other apps finish their own retries and eventually turn green.

rouke-broersma commented 2 weeks ago

@nathan-bowman I would assume that retries would be configurable per application or even per resource because argocd can't know what is an acceptable retry.

crenshaw-dev commented 2 weeks ago

retries solve this

To be specific, retries solve the case of "thing A must exist before thing B can be successfully applied to the cluster." Retries don't solve all the cases described above.

many of my apps don't need retries

+1 to what @rouke-broersma said: retries are configurable on a per-app basis.

introducing retries might make it more difficult for me to spot problems with a deployment... assuming that a bunch of apps are just retrying indefinitely

That's a retry config / alerting issue. You can tune your alert to fire on the first failure, after N failures, after some timeout, etc. You'll definitely want to configure a max retry count or timeout.

Both these problems also apply to a dependsOn feature. If "things are stuck" and you don't have good alerting for things being stuck, you'll likely miss that things are stuck.

nathan-bowman commented 2 weeks ago

@rouke-broersma This is sort of my heartburn with going the "retries" route. I have apps that should never have to retry. And if I go down this path I'm now saddled with sort of trying to always hit a moving target of "how many retries and for how long" for each app.

crenshaw-dev commented 2 weeks ago

I'm just not sure how the task of configuring appropriate retries is significantly more challenging than figuring out which apps to configure with dependsOn and then figuring out how to properly alert when a dependency app is blocking a dependent app. You have to "find and configure the apps" either way - the only difference is what kind of configuration you apply once you find them.

nathan-bowman commented 2 weeks ago

For me personally, telling an app "wait until this other app(s) is healthy before you deploy" is vastly different from "now I need to make a bunch of apps retry for 5 minutes longer since I added this other app in the chain of requirements".

Hopefully that makes sense...

crenshaw-dev commented 2 weeks ago

Sure, I can understand how the experience would suffer if you need to configure, say, 50 apps with retries (which may not even be useful after the initial deployment). In that case, I think an app-of-apps with sync waves could be preferable. It's almost identical overhead. Just instead of setting dependsOn: [prereq-app] on each app, you'd set the sync wave annotation.

nathan-bowman commented 2 weeks ago

Sure, I can understand how the experience would suffer if you need to configure, say, 50 apps with retries (which may not even be useful after the initial deployment). In that case, I think an app-of-apps with sync waves could be preferable. It's almost identical overhead. Just instead of setting dependsOn: [prereq-app] on each app, you'd set the sync wave annotation.

I think I agree with you here, I'll have to do some testing with sync waves.

simonoff commented 1 week ago

Sure, I can understand how the experience would suffer if you need to configure, say, 50 apps with retries (which may not even be useful after the initial deployment). In that case, I think an app-of-apps with sync waves could be preferable. It's almost identical overhead. Just instead of setting dependsOn: [prereq-app] on each app, you'd set the sync wave annotation.

one question. I have case when i have a few applicationsets and they have a dependecies. Like before start deploying of main app we need to deploy a setup, which will create secrets, buckets, etc. Waves are not working there at all. Or I need to set a quite big waves in apps?

rouke-broersma commented 1 week ago

Sure, I can understand how the experience would suffer if you need to configure, say, 50 apps with retries (which may not even be useful after the initial deployment). In that case, I think an app-of-apps with sync waves could be preferable. It's almost identical overhead. Just instead of setting dependsOn: [prereq-app] on each app, you'd set the sync wave annotation.

one question. I have case when i have a few applicationsets and they have a dependecies. Like before start deploying of main app we need to deploy a setup, which will create secrets, buckets, etc. Waves are not working there at all. Or I need to set a quite big waves in apps?

In that case why is retry and eventual consistency not sufficient? If a secret is missing, kubernetes will not schedule your pod. Once it is there, it will be scheduled. This is how kubernetes is designed and you should preferably architect your solutions to take full advantage of this. Hard dependencies with wait times are not the optimal solution.

simonoff commented 1 week ago

Sure, I can understand how the experience would suffer if you need to configure, say, 50 apps with retries (which may not even be useful after the initial deployment). In that case, I think an app-of-apps with sync waves could be preferable. It's almost identical overhead. Just instead of setting dependsOn: [prereq-app] on each app, you'd set the sync wave annotation.

one question. I have case when i have a few applicationsets and they have a dependecies. Like before start deploying of main app we need to deploy a setup, which will create secrets, buckets, etc. Waves are not working there at all. Or I need to set a quite big waves in apps?

In that case why is retry and eventual consistency not sufficient? If a secret is missing, kubernetes will not schedule your pod. Once it is there, it will be scheduled. This is how kubernetes is designed and you should preferably architect your solutions to take full advantage of this. Hard dependencies with wait times are not the optimal solution.

The problkem that they running in parralel. All waves settings just not working. Even I setting on setup a wave: -1 and for app wave: 10 they still running at same time. Yes, i know that i can check a secrets and configmaps. I'm doing it. But it causes that deployment is much longer due to the retries.

blakepettersson commented 1 week ago

I'd like to give my two cents.

I'm resistant to adding a dependsOn feature because I'm not confident that other features don't already cover the described use cases.

I totally understand your position being the lead and steward of Argo CD, and I understand not wanting to add a bunch of ad-hoc features to the codebase. I don't think this is one of them though.

The app-of-apps + sync waves experience is bad

Fair enough. You have to enable the app health check, craft an app-of-apps, and set numeric sync wave numbers. And after all that, the visualization of the ordering isn't great. The feature might also be buggy, and the docs aren't wonderful.

IMO I don't think "bad" suffices. Apart from what you have mentioned, there are now two ways of doing dependency management, none of which are compatible with each other depending on whether you use an ApplicationSet or a plain old Application. The way I like to think of ApplicationSets is that they are a superset of a normal application - everything that a normal Application can do an ApplicationSet should be able to do. As far as I know this is true for all Argo CD features - except for dependency management.

In addition to this, Progressive Syncs are currently an alpha-feature which means that this has to be enabled by an Argo CD admin in order for this feature to be used (not a major blocker but something to keep in mind). Progressive Syncs being alpha does give a bit of an opening though... for it to be replaced with a proper dependency management feature.

My recommendation is that, assuming the feature can actually solve the most common use cases, then we should improve the feature's experience. We should consider enabling the health check by default, we should improve visualization, and we should write better docs. The experience may never be as good as dependsOn, but rather than risking having two features with bad UX, we have one feature with an okay UX.

I think rather than having two features with bad UX, we would have one (or really two, since we can't get rid of sync waves) feature with a better UX. IMO Progressive Syncs should be removed in favor of this one. Sadly the ship has sailed on sync waves, but we could at least try to deprecate it and in a future far, far away we can maybe hope to remove it someday.

crenshaw-dev commented 1 week ago

All waves settings just not working

@simonoff if the waves simply aren't being enforced, then I think we need a new issue to investigate that problem.

it causes that deployment is much longer due to the retries

The cost of ordered dependencies is T_a + T_b, where those are the times of the prerequisite and of the dependent resources, respectively. The cost with retries is T_a + T_r + T_b, where T_r is the time spent after the prerequisite is satisfied waiting for the next retry to occur. If T_a is pretty short and if your retries are configured with a not-too-aggressive backoff, then the additional time cost shouldn't be high.

  • Can you use sync waves with ApplicationSets? ❌
  • Can you use progressive syncs with Applications? ❌
  • When using an ApplicationSet, can you depend on another ApplicationSet when using Progressive Syncs? ❌

Sync waves and progressive syncs are both context-specific (the context being either a parent app or an AppSet). dependsOn solves these three problems by being a global ordering mechanism. Of course, making ordering global also makes it more difficult to inspect, which is why I think dependsOn must be accompanied by a UI/CLI to calculate and visualize the DAG. It shouldn't take an hour and a whiteboard to determine "why isn't my app syncing?"

So I do see why having a unified, global ordering mechanism is desirable, especially if it can cover all the use cases of sync waves and possibly completely replace Progressive Syncs.

But it's not obvious to me that we have to solve these limitations of the existing contextual ordering mechanisms in order to solve the concrete use cases described above (CRDs, webhooks, and finalizers).

If there are other use cases which do require a global ordering mechanism, then those use cases need be described in detail so we can evaluate thedependsOn feature's suitability.

And if it ultimately is solely about providing a better UX and not about solving some otherwise unsolved use case, then we have to weigh the costs and UX benefits of the new feature against the cost and UX benefits of improving the existing ones. I think we'd need a document that says:

1) Here's how CRDs, webhooks, and finalizers are solved with retries, sync waves, and progressive syncs. Here are the ways in which the experiences are bad. Here are the ways we could improve those experiences and an estimate of the work involved. 2) Here's how CRDs, webhooks, and finalizers would be solved with dependsOn. Here are the potential risks associated with that experience and the estimated cost to build and maintain the feature.

simonoff commented 1 week ago

Yeah, you covering "ideal" world where all app developed by standarts, best practices and others. But in real world is never 100%. By the way if there will be a possibilty to make a dependency on Kubernetes manifest - its will be resolved all issues. Like we start this app and on start checkig if AND/OR specific resources are exists in Kubernetes in specific fields match. For example, i need to start pushing app only when DNS records are ready. Or Certificate request is done and its completed. It will be much usable as for me. And gives the abilitites to implement "dependencies" based on existing in Kubernetes resources. As for example existing of s3bucket resource in Kubernetes is no enough. Also need to check the status that it have been created on S3.

crenshaw-dev commented 1 week ago

Yeah, you covering "ideal" world where all app developed by standarts, best practices and others. But in real world is never 100%.

I don't think I'm assuming developers adhere to any particular best practices. I'm just trying to understand "In what concrete scenarios do retries, sync waves, and progressive syncs all fail to solve the problem?" whether that scenario represents best practices or not. I could be missing something, but so far I haven't found a detailed articulation of a use case where a global ordering is necessary or even why one of the existing contextual ordering mechanisms is unacceptably difficult to use. I accept that there could be such use cases. But I think in order to justify a large and potentially-redundant feature, those use cases ought to be explained in some depth.

there will be a possibilty to make a dependency on Kubernetes manifest - its will be resolved all issues

This is already possible. Both sync waves and progressive syncs will block on an app being out of sync (Kubernetes resource not applied) or some resource's field not being what it's expected to be (health check failed).

For example, i need to start pushing app only when DNS records are ready. Or Certificate request is done and its completed... As for example existing of s3bucket resource in Kubernetes is no enough. Also need to check the status that it have been created on S3.

There are already CRDs that represent things like DNS records, Certificate requests, and S3 buckets. With a properly-configured health check Argo CD will block sync waves and progressive syncs waiting on those resources to resolve.

Argo CD has built-in health checks for all three of these examples: DNS, Certs, and Buckets.

rouke-broersma commented 1 week ago

All waves settings just not working @simonoff if the waves simply aren't being enforced, then I think we need a new issue to investigate that problem.

Since they are talking about setting a large gap between sync waves I don't think they fully understand sync waves. Most likely their problem is that the resource is already sync'd and doesn't have a proper health check for their scenario so the next wave starts too soon for their use case.

@simonoff Sync waves are an ordering only, there is no time difference depending on the numbers you select.

Choosing -1 - 10 and 500 - 100000 is the same thing, if those are your only sync waves. That just means that -1 comes before 10 or 500 comes before 100000. It is up to the wave itself to be in Progressing state due to health checks to block the next wave from starting. If your Application in wave 1 is Healthy and the sync is finished before your app is online and has created the secrets or buckets or whatever dependency you need, then the next wave will start before the dependencies are met.

There are already CRDs that represent things like DNS records, Certificate requests, and S3 buckets. With a properly-configured health check Argo CD will block sync waves and progressive syncs waiting on those resources to resolve.

Argo CD has built-in health checks for all three of these examples: DNS, Certs, and Buckets.

To be fair these are only a tiny number of potential methods for configuring these resources, the number of needed health checks is potentially near infinite. We for example use externaldns using ingress-shim. Ingress does not receive a status update from externaldns on whether or not DNS is configured, so I would have to potentially change the tools I use, if my use case requires that DNS is in place before a next wave starts.

Now I don't personally agree that this is a way of working that should be supported in a Kubernetes ecosytem and is in my opinion antithesis to gitops and using ArgoCD in the first place. If you use tools like ArgoCD (and imo Kubernetes) then you should always strive for reconciliation towards desired state over configuration. I personally prefer that this is enforced as much as possible and am very much against dependency configuration.

sidewinder12s commented 1 week ago

One of the biggest blockers for us using AppSets with Progressive Sync is the following:

We currently use App of Apps to install cluster components. There are these dependencies between apps on each cluster, largely as articulated where we need CRDs installed and/or resources in earlier apps before we can sync later apps. This can be somewhat solved with sync waves, but there are UX issues with sync waves that could be tackled to make this better. Or better examples of how to tackle it being documented.

We want to move to AppSets so that we can deploy each cluster component as a unit using the AppSet.

We further had been waiting for ProgressiveSync so that we could order the deployments so that we're able to catch any issues in lower environments before we get to production. This I believe is possible now, but again feature is alpha and requires us to completely re-work our applications which is a lot to ask with no ability to map the dependency ordering within the cluster.

crenshaw-dev commented 1 week ago

the number of needed health checks is potentially near infinite

Yep! The examples are just to show that these specific use cases aren't especially difficult to solve today.

And of course, dependsOn doesn't help us escape the need for health checks. The only way Argo CD knows "can I move on" is to know "are the dependencies synced and healthy."

crenshaw-dev commented 1 week ago

This is one of our biggest issues, we have large clusters with apps tracking daemonsets where we are scheduling or scaling constantly and this action will block sync waves because they are continuously progressing

This is a common class of issue with app-of-apps + sync waves: the app health check doesn't handle X scenario. There's also reason to believe that the recommended default App health check should be improved.

For your use case, I'd explore building a custom App health check that can detect the stuck DaemonSet and, if appropriate, ignore it. There may or may not be enough information on the Application status to make that determination, but I think it's worth looking into.

As you noted, the problems mentioned are things that need to be improved in sync waves / progressive syncs. They're problems that would apply equally to a dependsOn feature, because that feature would depend on the same health checks.

We further had been waiting for ProgressiveSync so that we could order the deployments so that we're able to catch any issues in lower environments before we get to production

Environment promotions in GitOps is a nascent space. I find Progressive Syncs unsatisfying for most promotion use cases, because it relies on intentional drift (apps remain out of sync with git until the promotion is complete). A lot of tools are emerging for promotion (Telefonistka, Kargo, Codefresh Products, gitops-promoter), and I think it'll be a while before we have a clear idea the best approach(es) to environment promotion in GitOps. If sync waves or progressive syncs work for some use cases, great! (We use progressive syncs to deploy the Argo CD metrics extension through environments at Intuit.) But we shouldn't consider it a core use case for those features yet.

sambonbonne commented 1 week ago

My problem with app-of-apps + sync waves for applications synchronization is more a UX problem: having a dependsOn field (or something like this) in applications would make it far easier to understand the dependencies between applications.

When I use sync wave, I write a comment above the annotation to explain why the resource (in this case, the Application) is in this sync wave (eg "Sync before XXX but after YYY") but it is not intuitive. And I find it error-prone: if I want to change the sync wave of one application, I have to ensure that dependent applications need a change (if yes, then I have to continue checking until the last dependent application, which can be error-prone).

I'm not sure a dependsOn field would be the best solution, but right now it would be more intuitive than sync waves for app-of-apps.

zoop-btc commented 1 week ago

Just throwing an idea out here: What about a syncwave CRD you could supply with an array of app names? ArgoCD could then use the CRD to create the syncwaves with numbering etc internally. This would alleviate the annotation juggling.