Kubernetes v1.24 introduced the concept of a suspended job, where jobs can be marked with spec.suspend (K8s Doc).
When a Job is created, the Job controller will immediately begin creating Pods to satisfy the Job's requirements and will continue to do so until the Job is complete. However, you may want to temporarily suspend a Job's execution and resume it later, or start Jobs in suspended state and have a custom controller decide later when to start them.
To suspend a Job, you can update the .spec.suspend field of the Job to true; later, when you want to resume it again, update it to false. Creating a Job with .spec.suspend set to true will create it in the suspended state.
Argo unfortunately does not view apps containing these jobs as "Healthy", they are suspended. It has caused issues with some scripts, and also UX issues (why doesn't the app work?).
Motivation
In our particular case we are creating jobs that our Ops team can use in DR or other scenarios. Argo is configured to ignoreDifferences on the suspend field, and so if something comes up they can simply enable the job. The jobs depend on application code and are deployed as part of a a helm chart with the service.
As an example imagine you are running some kind of database in Argo, say MySQL. The job might be, take S3 backup. It's part of the chart with MySQL, and just stays there in case the ops team needs to run it. Monitors we have on Apps that aren't healthy, or scripts, or even users in Argo, just see the MySQL App, and see uh oh... it's suspended.
Proposal
I'm unclear about what the motivation was for changing the App health to Suspended was, based on #4838 #10600 #11603 and #11626 (which I only skimmed), this was a fix to the fact that by default these jobs don't run to completion, which Argo was doing as part of a health check.
I want to push for two options, I'm not particularly opinionated about it:
In my possibly naive view, I think the suspended state is maybe never appropriate. In my view the Application is the set of resources that Argo is keeping in sync. I understand that waiting for a job to complete makes sense (i.e., I understand why Argo cares about Jobs completing for app health, for jobs that should run), but here Argo successfully synced a Job whose manifest says don't run, something else is responsible for doing that. This is precisely a use case that the K8s manual talks about (you or a custom controller flipping the state). I would (and again I have a limited perspective, so don't put too much weight in what I'm saying, I'm just offering an idea) maybe suggest that like the breaking change with Applications Argo just removes Suspend. Honestly the old behavior for apps makes more sense to me than the new one, some child resources being unhealthy (say an App in an App-of-App), don't bubble up, but an Job being suspended does bubble up to the App.
A small change to this view would be to change the voting algorithm such that an App is Suspended only if everything in it is Suspended , if there is one Healthy resource, then it should report Healthy.
A further argument that I had (that developed into an idea) is it would be like if Argo cared if I deployed a ReplicaSet or a Deployment with replica count of zero. These don't do anything, but it's exactly what was asked for, Argo shouldn't care for health reasons. That made me wonder if in fact it might make the suspend flag more useful (when coupled with the previous idea), to mark these kinds of resources as suspended as well. So an app would be suspended if say the jobs are suspended and all the deployments have zero replicas.
Have an annotation that lets the job be marked as healthy even though it's Suspended. I'm aware that there is Lua Custom Health Checks, but that's maybe more customization than we want to deal with at the time, and I don't see (and could be wrong), why people from an Argo CD perspective care about this state.
Summary
Kubernetes v1.24 introduced the concept of a suspended job, where jobs can be marked with
spec.suspend
(K8s Doc).Argo unfortunately does not view apps containing these jobs as "Healthy", they are suspended. It has caused issues with some scripts, and also UX issues (why doesn't the app work?).
Motivation
In our particular case we are creating jobs that our Ops team can use in DR or other scenarios. Argo is configured to ignoreDifferences on the
suspend
field, and so if something comes up they can simply enable the job. The jobs depend on application code and are deployed as part of a a helm chart with the service.As an example imagine you are running some kind of database in Argo, say MySQL. The job might be, take S3 backup. It's part of the chart with MySQL, and just stays there in case the ops team needs to run it. Monitors we have on Apps that aren't healthy, or scripts, or even users in Argo, just see the MySQL App, and see uh oh... it's suspended.
Proposal
I'm unclear about what the motivation was for changing the App health to Suspended was, based on #4838 #10600 #11603 and #11626 (which I only skimmed), this was a fix to the fact that by default these jobs don't run to completion, which Argo was doing as part of a health check.
I want to push for two options, I'm not particularly opinionated about it:
In my possibly naive view, I think the suspended state is maybe never appropriate. In my view the Application is the set of resources that Argo is keeping in sync. I understand that waiting for a job to complete makes sense (i.e., I understand why Argo cares about Jobs completing for app health, for jobs that should run), but here Argo successfully synced a Job whose manifest says don't run, something else is responsible for doing that. This is precisely a use case that the K8s manual talks about (you or a custom controller flipping the state). I would (and again I have a limited perspective, so don't put too much weight in what I'm saying, I'm just offering an idea) maybe suggest that like the breaking change with Applications Argo just removes Suspend. Honestly the old behavior for apps makes more sense to me than the new one, some child resources being unhealthy (say an App in an App-of-App), don't bubble up, but an Job being suspended does bubble up to the App.
Have an annotation that lets the job be marked as healthy even though it's Suspended. I'm aware that there is Lua Custom Health Checks, but that's maybe more customization than we want to deal with at the time, and I don't see (and could be wrong), why people from an Argo CD perspective care about this state.