Open partomatl opened 9 months ago
According to https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-upgrades,
UPDATE_CLUSTER means an update not changing the Kubernetes control plane version
andgoogle.container.v1.ClusterManager.PatchCluster
meansa cluster configuration change
.
Or, in standard k8s terms, this seems to be a Node Pool upgrade, in a surge or blue-green fashion.
I'd be curious to know if this is specifically happening after the Controller's k8s node was rolled or during a roll of a different node.
The Controller should be resilient to restarts, in this case it seems to have not checked the start time of the running CronWorkflow(?) 🤔
We have two instances for the
workflow-controller
.
Also are these two instances on two separate k8s nodes? I wonder if the hot-standby specifically is having an issue, i.e. the change of control results in some information loss
I imagine we could set
startingDeadlineSeconds: 0
for our CronWorkflows to avoid the issue
This is a good workaround to note for anyone else experiencing this, thanks!
but I'd like to know if this problem rings any bell and if it's fixable.
Doesn't ring any bells for me (idk about others). It should be fixable, though complicated race conditions take time to get assigned and fixed (as they take a lot of time but do not necessarily have large impact as they are races)
Hi @partomatl,
can you check if the Cron Workflow Status field lastScheduledTime
is updated after the cron is executed for the first time?
For the example you gave, the command kubectl get cronworkflow cron-redacted -oyaml
should show:
status:
lastScheduledTime: "2024-01-05T09:00:00Z"
can you check if the Cron Workflow Status field
lastScheduledTime
is updated after the cron is executed for the first time? For the example you gave, the commandkubectl get cronworkflow cron-redacted -oyaml
should show:status: lastScheduledTime: "2024-01-05T09:00:00Z"
Indeed, another CronWorkflow that was scheduled to run for the last time on "2024-02-05T08:00:00Z" was launched on schedule but failed with the same problem. Its lastScheduledTime
was not updated:
status:
conditions: []
lastScheduledTime: "2024-02-04T08:00:00Z"
After every schedule the cron workflow is updated to have the most recent lastScheduledTime
and the executed WF is appended to the Active
list. It seems that the problem is with the Patch
k8s API request, it is not updating the cron status, so whenever it syncs again it retriggers the workflow.
From your controller logs, I cannot see any failed to update cron workflow
which probably means the request did not return an error, but it seems that it didn't take effect as well.
It seems that the problem is with the
Patch
k8s API request, it is not updating the cron status, so whenever it syncs again it retriggers the workflow. From your controller logs, I cannot see anyfailed to update cron workflow
which probably means the request did not return an error, but it seems that it didn't take effect as well.
Saw Patch
and that makes me wonder if there's a race condition here like the one I mentioned in https://github.com/argoproj/argo-workflows/pull/12596#issuecomment-1927397072 et al.
I'm not too familiar with the CronWorkflow
code however
is only concurrencyPolicy: Replace affected?
Pre-requisites
:latest
What happened/what did you expect to happen?
We are running Argo Workflows on a GKE Autopilot cluster. We have two instances for the
workflow-controller
. Our CronWorkflows are configured withstartingDeadlineSeconds: 60
andconcurrencyPolicy: Replace
. Usually, everything works well.But sometimes, some CronWorkflows start failing: they are terminated by the
workflow-controller
. What happens is: theworkflow-controller
handles a CronWorkflow on schedule, and starts a workflow. Instants later, theworkflow-controller
thinks it has missed the cron schedule and since we are still within thestartingDeadlineSeconds
, it launches a new workflow. TheconcurrencyPolicy
isReplace
, so the controller kills the first workflow. It then starts a new workflow to replace the first workflow, but this one is also killed. In the end, the CronWorkflow never successfully runs.For instance this morning,
cron-redacted
was supposed to run at 09:00 UTC. Theworkflow-controller
logs for this CronWorkflow are attached.It seems this problem occurs following an automatic update on the GKE Autopilot cluster. Some CronWorkflows (~15% of the scheduled workflows) started failing after this automatic operation on our cluster:
According to https://cloud.google.com/kubernetes-engine/docs/concepts/cluster-upgrades,
UPDATE_CLUSTER means an update not changing the Kubernetes control plane version
andgoogle.container.v1.ClusterManager.PatchCluster
meansa cluster configuration change
.It seems like following such an update, the controller loses track of which CronWorkflow was executed or is currently running. Everything went back to normal after a
workflow-controller
restart. I imagine we could setstartingDeadlineSeconds: 0
for our CronWorkflows to avoid the issue, but I'd like to know if this problem rings any bell and if it's fixable.Obviously this is hard to reproduce, but I'll be happy to provide any useful logs from our cluster.
Version
v3.4.6
Paste a small workflow that reproduces the issue. We must be able to run the workflow; don't enter a workflows that uses private images.
Logs from the workflow controller
Logs from in your workflow's wait container