argoproj / argo-workflows

Workflow Engine for Kubernetes
https://argo-workflows.readthedocs.io/
Apache License 2.0
15.09k stars 3.2k forks source link

`workflow-controller` wrongly assumes it has missed a CronWorkflow execution #12625

Open partomatl opened 9 months ago

partomatl commented 9 months ago

Pre-requisites

What happened/what did you expect to happen?

We are running Argo Workflows on a GKE Autopilot cluster. We have two instances for the workflow-controller. Our CronWorkflows are configured with startingDeadlineSeconds: 60 and concurrencyPolicy: Replace. Usually, everything works well.

But sometimes, some CronWorkflows start failing: they are terminated by the workflow-controller. What happens is: the workflow-controller handles a CronWorkflow on schedule, and starts a workflow. Instants later, the workflow-controller thinks it has missed the cron schedule and since we are still within the startingDeadlineSeconds, it launches a new workflow. The concurrencyPolicy is Replace, so the controller kills the first workflow. It then starts a new workflow to replace the first workflow, but this one is also killed. In the end, the CronWorkflow never successfully runs.

For instance this morning, cron-redacted was supposed to run at 09:00 UTC. The workflow-controller logs for this CronWorkflow are attached.

It seems this problem occurs following an automatic update on the GKE Autopilot cluster. Some CronWorkflows (~15% of the scheduled workflows) started failing after this automatic operation on our cluster:

[
  {
      "serviceName": "container.googleapis.com",
      "methodName": "google.container.internal.ClusterManagerInternal.PatchCluster",
      "metadata": {
        "operationType": "UPDATE_CLUSTER"
      },
  }
]

According to https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-upgrades, UPDATE_CLUSTER means an update not changing the Kubernetes control plane version and google.container.v1.ClusterManager.PatchCluster means a cluster configuration change.

It seems like following such an update, the controller loses track of which CronWorkflow was executed or is currently running. Everything went back to normal after a workflow-controller restart. I imagine we could set startingDeadlineSeconds: 0 for our CronWorkflows to avoid the issue, but I'd like to know if this problem rings any bell and if it's fixable.

Obviously this is hard to reproduce, but I'll be happy to provide any useful logs from our cluster.

Version

v3.4.6

Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.

N/A

Logs from the workflow controller

"time="2024-02-05T08:59:03.335Z" level=info msg="Enforcing history limit for 'cron-redacted'" namespace=argo workflow=cron-redacted" "time="2024-02-05T08:59:13.837Z" level=info msg="Enforcing history limit for 'cron-redacted'" namespace=argo workflow=cron-redacted" "time="2024-02-05T08:59:24.313Z" level=info msg="Enforcing history limit for 'cron-redacted'" namespace=argo workflow=cron-redacted" "time="2024-02-05T08:59:35.103Z" level=info msg="Enforcing history limit for 'cron-redacted'" namespace=argo workflow=cron-redacted" "time="2024-02-05T08:59:45.620Z" level=info msg="Enforcing history limit for 'cron-redacted'" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:00:00.002Z" level=info msg="Running cron-redacted" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:00:01.839Z" level=info msg="Processing argo/cron-redacted" cronWorkflow=argo/cron-redacted" "time="2024-02-05T09:00:07.038Z" level=info msg="cron-redacted missed an execution at Mon Feb 5 09:00:00 2024 and is within StartingDeadline" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:00:07.038Z" level=info msg="Running cron-redacted" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:00:08.697Z" level=info msg="Processing workflow" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:09.894Z" level=info msg="Updated phase -> Running" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:09.916Z" level=info msg="Created PDB resource for workflow." namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:09.917Z" level=info msg="Retry node cron-redacted-1707123600 initialized Running" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:09.917Z" level=info msg="Steps node cron-redacted-1707123600-1510020045 initialized Running" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:09.917Z" level=info msg="StepGroup node cron-redacted-1707123600-1621000785 initialized Running" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:09.919Z" level=info msg="Retry node cron-redacted-1707123600-2552572314 initialized Running" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:09.919Z" level=info msg="DAG node cron-redacted-1707123600-3917452609 initialized Running" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:09.920Z" level=info msg="All of node cron-redacted-1707123600(0)[0].cron-redacted(0).redacted dependencies [] completed" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:09.921Z" level=info msg="Retry node cron-redacted-1707123600-2043635159 initialized Running" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:09.921Z" level=info msg="Pod node cron-redacted-1707123600-2684772554 initialized Pending" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:09.992Z" level=info msg="Created pod: cron-redacted-1707123600(0)[0].cron-redacted(0).redacted(0) (cron-redacted-1707123600-default-2684772554)" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:09.992Z" level=info msg="Workflow step group node cron-redacted-1707123600-1621000785 not yet completed" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:09.993Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:09.993Z" level=info msg=reconcileAgentPod namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:10.099Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=394021267 workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:10.391Z" level=info msg="Processing argo/cron-redacted" cronWorkflow=argo/cron-redacted" "time="2024-02-05T09:00:10.988Z" level=info msg="cron-redacted missed an execution at Mon Feb 5 09:00:00 2024 and is within StartingDeadline" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:00:10.988Z" level=info msg="Running cron-redacted" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:00:12.290Z" level=info msg="Enforcing history limit for 'cron-redacted'" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:00:19.994Z" level=info msg="Processing workflow" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:19.998Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:19.998Z" level=info msg="node changed" namespace=argo new.message= new.phase=Running new.progress=0/1 nodeID=cron-redacted-1707123600-2684772554 old.message= old.phase=Pending old.progress=0/1 workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:20.000Z" level=info msg="Workflow step group node cron-redacted-1707123600-1621000785 not yet completed" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:20.000Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:20.000Z" level=info msg=reconcileAgentPod namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:20.018Z" level=info msg="Workflow update successful" namespace=argo phase=Running resourceVersion=394021420 workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:23.414Z" level=info msg="Enforcing history limit for 'cron-redacted'" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:00:23.426Z" level=info msg="Processing argo/cron-redacted" cronWorkflow=argo/cron-redacted" "time="2024-02-05T09:00:23.510Z" level=info msg="cron-redacted missed an execution at Mon Feb 5 09:00:00 2024 and is within StartingDeadline" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:00:23.510Z" level=info msg="Running cron-redacted" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:00:23.588Z" level=info msg="cron-redacted has 'ConcurrencyPolicy: Replace' and has active Workflows" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:00:23.588Z" level=info msg="stopping 'cron-redacted-1707123600'" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:00:30.021Z" level=info msg="Processing workflow" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.024Z" level=info msg="Task-result reconciliation" namespace=argo numObjs=0 workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.025Z" level=info msg="node unchanged" namespace=argo nodeID=cron-redacted-1707123600-2684772554 workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.025Z" level=info msg="Terminating pod as part of workflow shutdown" namespace=argo podName=cron-redacted-1707123600-default-2684772554 shutdownStrategy=Terminate workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.025Z" level=info msg="node cron-redacted-1707123600-2684772554 phase Running -> Failed" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.025Z" level=info msg="node cron-redacted-1707123600-2684772554 message: workflow shutdown with strategy: Terminate" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.025Z" level=info msg="node cron-redacted-1707123600-2684772554 finished: 2024-02-05 09:00:30.025353826 +0000 UTC" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.027Z" level=info msg="Stopped with strategy 'Terminate'" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.027Z" level=info msg="node cron-redacted-1707123600-2043635159 phase Running -> Failed" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.027Z" level=info msg="node cron-redacted-1707123600-2043635159 message: Stopped with strategy 'Terminate'" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.027Z" level=info msg="node cron-redacted-1707123600-2043635159 finished: 2024-02-05 09:00:30.027494846 +0000 UTC" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.031Z" level=info msg="cleaning up pod" action=terminateContainers key=argo/cron-redacted-1707123600-default-2684772554/terminateContainers" "time="2024-02-05T09:00:30.031Z" level=info msg="Skipped node cron-redacted-1707123600-3462471962 initialized Omitted (message: omitted: depends condition not met)" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.031Z" level=info msg="Skipped node cron-redacted-1707123600-3864935480 initialized Omitted (message: omitted: depends condition not met)" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.031Z" level=info msg="Outbound nodes of cron-redacted-1707123600-3917452609 set to [cron-redacted-1707123600-3864935480]" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.031Z" level=info msg="node cron-redacted-1707123600-3917452609 phase Running -> Failed" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.031Z" level=info msg="node cron-redacted-1707123600-3917452609 finished: 2024-02-05 09:00:30.031911911 +0000 UTC" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.031Z" level=info msg="Checking daemoned children of cron-redacted-1707123600-3917452609" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.031Z" level=info msg="https://redacted:443/api/v1/namespaces/argo/pods/cron-redacted-1707123600-default-2684772554/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=main&stderr=true&stdout=true&tty=false"" "time="2024-02-05T09:00:30.032Z" level=info msg="Stopped with strategy 'Terminate'" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.032Z" level=info msg="node cron-redacted-1707123600-2552572314 phase Running -> Failed" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.032Z" level=info msg="node cron-redacted-1707123600-2552572314 message: Stopped with strategy 'Terminate'" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.032Z" level=info msg="node cron-redacted-1707123600-2552572314 finished: 2024-02-05 09:00:30.032841684 +0000 UTC" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.032Z" level=info msg="Step group node cron-redacted-1707123600-1621000785 deemed failed: child 'cron-redacted-1707123600-2552572314' failed" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.032Z" level=info msg="node cron-redacted-1707123600-1621000785 phase Running -> Failed" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.032Z" level=info msg="node cron-redacted-1707123600-1621000785 message: child 'cron-redacted-1707123600-2552572314' failed" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.032Z" level=info msg="node cron-redacted-1707123600-1621000785 finished: 2024-02-05 09:00:30.032932183 +0000 UTC" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.032Z" level=info msg="step group cron-redacted-1707123600-1621000785 was unsuccessful: child 'cron-redacted-1707123600-2552572314' failed" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.032Z" level=info msg="Outbound nodes of cron-redacted-1707123600-2552572314 is [cron-redacted-1707123600-3917452609]" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.033Z" level=info msg="Outbound nodes of cron-redacted-1707123600-1510020045 is [cron-redacted-1707123600-3917452609]" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.033Z" level=info msg="node cron-redacted-1707123600-1510020045 phase Running -> Failed" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.033Z" level=info msg="node cron-redacted-1707123600-1510020045 message: child 'cron-redacted-1707123600-2552572314' failed" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.033Z" level=info msg="node cron-redacted-1707123600-1510020045 finished: 2024-02-05 09:00:30.033050646 +0000 UTC" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.033Z" level=info msg="Checking daemoned children of cron-redacted-1707123600-1510020045" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.033Z" level=info msg="Stopped with strategy 'Terminate'" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.033Z" level=info msg="node cron-redacted-1707123600 phase Running -> Failed" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.033Z" level=info msg="node cron-redacted-1707123600 message: Stopped with strategy 'Terminate'" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.033Z" level=info msg="node cron-redacted-1707123600 finished: 2024-02-05 09:00:30.033443976 +0000 UTC" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.033Z" level=info msg="TaskSet Reconciliation" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.033Z" level=info msg=reconcileAgentPod namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.033Z" level=info msg="Updated phase Running -> Failed" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.033Z" level=info msg="Updated message -> Stopped with strategy 'Terminate'" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.033Z" level=info msg="Marking workflow completed" namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.043Z" level=info msg="Deleted PDB resource for workflow." namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.043Z" level=info msg="Checking daemoned children of " namespace=argo workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.049Z" level=info msg="cleaning up pod" action=deletePod key=argo/cron-redacted-1707123600-1340600742-agent/deletePod" "time="2024-02-05T09:00:30.059Z" level=info msg="Workflow update successful" namespace=argo phase=Failed resourceVersion=394021560 workflow=cron-redacted-1707123600" "time="2024-02-05T09:00:30.061Z" level=info msg="Queueing Failed workflow argo/cron-redacted-1707123600 for delete in 192h0m0s due to TTL"" "time="2024-02-05T09:00:32.921Z" level=info msg="signaled container" container=main error="" namespace=argo pod=cron-redacted-1707123600-default-2684772554 stderr= stdout="killing 1 with terminated\n"" "time="2024-02-05T09:00:32.922Z" level=info msg="https://redacted.1:443/api/v1/namespaces/argo/pods/cron-redacted-1707123600-default-2684772554/exec?command=%2Fvar%2Frun%2Fargo%2Fargoexec&command=kill&command=15&command=1&container=wait&stderr=true&stdout=true&tty=false"" "time="2024-02-05T09:00:33.926Z" level=info msg="Enforcing history limit for 'cron-redacted'" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:00:33.936Z" level=info msg="Processing argo/cron-redacted" cronWorkflow=argo/cron-redacted" "time="2024-02-05T09:00:34.016Z" level=info msg="cron-redacted missed an execution at Mon Feb 5 09:00:00 2024 and is within StartingDeadline" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:00:34.017Z" level=info msg="Running cron-redacted" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:00:34.676Z" level=info msg="signaled container" container=wait error="command terminated with exit code 137" namespace=argo pod=cron-redacted-1707123600-default-2684772554 stderr="" stdout=""" "time="2024-02-05T09:00:35.124Z" level=info msg="cleaning up pod" action=deletePod key=argo/cron-redacted-1707123600-default-2684772554/deletePod" "time="2024-02-05T09:00:44.455Z" level=info msg="Enforcing history limit for 'cron-redacted'" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:00:55.592Z" level=info msg="Enforcing history limit for 'cron-redacted'" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:01:04.676Z" level=info msg="cleaning up pod" action=killContainers key=argo/cron-redacted-1707123600-default-2684772554/killContainers" "time="2024-02-05T09:01:06.109Z" level=info msg="Enforcing history limit for 'cron-redacted'" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:01:16.659Z" level=info msg="Enforcing history limit for 'cron-redacted'" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:01:27.171Z" level=info msg="Enforcing history limit for 'cron-redacted'" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:01:38.271Z" level=info msg="Enforcing history limit for 'cron-redacted'" namespace=argo workflow=cron-redacted" "time="2024-02-05T09:01:48.773Z" level=info msg="Enforcing history limit for 'cron-redacted'" namespace=argo workflow=cron-redacted"

Logs from in your workflow's wait container

N/A
agilgur5 commented 9 months ago

According to https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-upgrades, UPDATE_CLUSTER means an update not changing the Kubernetes control plane version and google.container.v1.ClusterManager.PatchCluster means a cluster configuration change.

Or, in standard k8s terms, this seems to be a Node Pool upgrade, in a surge or blue-green fashion.

I'd be curious to know if this is specifically happening after the Controller's k8s node was rolled or during a roll of a different node.

The Controller should be resilient to restarts, in this case it seems to have not checked the start time of the running CronWorkflow(?) 🤔

We have two instances for the workflow-controller.

Also are these two instances on two separate k8s nodes? I wonder if the hot-standby specifically is having an issue, i.e. the change of control results in some information loss

I imagine we could set startingDeadlineSeconds: 0 for our CronWorkflows to avoid the issue

This is a good workaround to note for anyone else experiencing this, thanks!

but I'd like to know if this problem rings any bell and if it's fixable.

Doesn't ring any bells for me (idk about others). It should be fixable, though complicated race conditions take time to get assigned and fixed (as they take a lot of time but do not necessarily have large impact as they are races)

eduardodbr commented 9 months ago

Hi @partomatl,

can you check if the Cron Workflow Status field lastScheduledTime is updated after the cron is executed for the first time?
For the example you gave, the command kubectl get cronworkflow cron-redacted -oyaml should show:

status:
  lastScheduledTime: "2024-01-05T09:00:00Z"
partomatl commented 9 months ago

can you check if the Cron Workflow Status field lastScheduledTime is updated after the cron is executed for the first time? For the example you gave, the command kubectl get cronworkflow cron-redacted -oyaml should show:

status:
  lastScheduledTime: "2024-01-05T09:00:00Z"

Indeed, another CronWorkflow that was scheduled to run for the last time on "2024-02-05T08:00:00Z" was launched on schedule but failed with the same problem. Its lastScheduledTime was not updated:

status:
  conditions: []
  lastScheduledTime: "2024-02-04T08:00:00Z"
eduardodbr commented 9 months ago

After every schedule the cron workflow is updated to have the most recent lastScheduledTime and the executed WF is appended to the Active list. It seems that the problem is with the Patch k8s API request, it is not updating the cron status, so whenever it syncs again it retriggers the workflow. From your controller logs, I cannot see any failed to update cron workflow which probably means the request did not return an error, but it seems that it didn't take effect as well.

agilgur5 commented 9 months ago

It seems that the problem is with the Patch k8s API request, it is not updating the cron status, so whenever it syncs again it retriggers the workflow. From your controller logs, I cannot see any failed to update cron workflow which probably means the request did not return an error, but it seems that it didn't take effect as well.

Saw Patch and that makes me wonder if there's a race condition here like the one I mentioned in https://github.com/argoproj/argo-workflows/pull/12596#issuecomment-1927397072 et al. I'm not too familiar with the CronWorkflow code however

tooptoop4 commented 1 week ago

is only concurrencyPolicy: Replace affected?

https://github.com/argoproj/argo-workflows/blob/f73e7f9f6f70beafbbd0fbb49870336698f612af/workflow/cron/operator.go#L75-L77