Open gzb1128 opened 11 months ago
INFORMER_WRITE_BACK
only controls whether changes written to the kubernetes API server are also stored internally by the controller.
When you write anything to an existing kubernetes object you must tell the API server which resourceVersion
of the object you are modifying, so any conflict in a write is resolved by one of the writes failing because it is attempting to modify the older resourceVersion
.
Are you sure your controller which is setting .spec.suspend back to false is working correctly.
There are delays through the informer, so I would expect on a busy system for workflows-controller to be able to observe a workflow as still suspended at a later timestamp than the workflow was unsuspended by your controller. Does the controller NEVER see the workflow as unsuspended?
@Joibel Hi Joibel
This is the weird thing, I can use kubectl
to make sure workflow is unsuspended already, and my controller reported it unsuspended
workflow normally in log. I build my controller with controller-runtime, which I pretty sure it contains right resourceVersion
. However, workflow controller NEVER see the workflow as unsuspended
.
My opinion is based on the assumption that when different clients have the same resourceVersion
resource, if one is writing Status and the other is writing Spec, Apiserver will accept both changes.
I think the logical flow of the two controllers looks like this. First workflow controller hold origin workflow woc.wf.
here.
https://github.com/argoproj/argo-workflows/blob/23292c429430e71fdc7c59c89878053ee821b80d/workflow/controller/operator.go#L737
Then my controller change workflow to unsuspended. Workflow controller's informer watch this event, and sync it into Store. After that, workflow controller update it. Based on my assumption, they both success. Then the logic goes here because I see no Warning log in workflow controller.
https://github.com/argoproj/argo-workflows/blob/23292c429430e71fdc7c59c89878053ee821b80d/workflow/controller/operator.go#L755C11-L755C11
and log "Workflow update successful" here
https://github.com/argoproj/argo-workflows/blob/23292c429430e71fdc7c59c89878053ee821b80d/workflow/controller/operator.go#L770
and write into Informer here.
https://github.com/argoproj/argo-workflows/blob/23292c429430e71fdc7c59c89878053ee821b80d/workflow/controller/operator.go#L772C20-L772C39
However, at that moment, unsuspended workflow is already synced into Store. So this operation covered Apiserver event, also my change, also caused workflow controller NEVER see workflow as unsuspended. Because the Store cache lost consistent with Apiserver for this INFORMER_WRITE_BACK
operation.
Any suggestion will be appreciated
@Joibel sorry Joibel, looks like I make things complicated.
Here is a simple scene which result in same consequence. After
workflow controller update
workflow, and before
write into informer
, cpu change context to my controller. Previous updating event goes through Apiserver and received by my controller. I set workflow to unsuspended
. Event finally sent to workflow controller and sync
to Informer store
. At this time, workflow controller write holding woc.wf
into informer, which cover my first sync
.
This actually races, and workflow controller Informer Store lost consistency with Apiserver. Also, I don't need to confuse about whether my second comment assumption is true or not.
The reason is that workflow controller doesn't make update
and write into informer
atomic, so any resource changing event which synced into Informer Store will be missing.
@agilgur5 Sir can you look at this please. This actually raises a bug, and it is hard to investigate unless the guy had read informer
code and argo workflow
code.
I don't think that we should be encouraging INFORMER_WRITE_BACK
to false
, whatever the outcome of this. We should instead be fixing the true
path to work always.
So your suggestion is that the workflow is going through the following:
resourceVersion: x
resourceVersion: y
to the APIresourceVersion: y
and unsuspends the workflowresourceVersion: z
resourceVersion: y
to its informer cache and therefore never sees z
There is nothing in the workflow code that would prevent z
from coming through to the workflow controller. There is this line which prevents the same version we already have, but it very definitely allows workflow resource versions to go backwards and forwards. Both y
and z
should come through that code path at some point, and both should cause the controller to attempt a reconcile.
That line prevents us from attempting to reconcile y
again, but would allow us to see any other version.
The controller will also do a full resync every 20 minutes so I'd expect worst case it to take that long for your change to be picked up.
Very simple test: If you restart the workflow controller when the workflow is in the unsuspended state does it start running it again?
Which version of workflows are you using, and which other versions have you experienced this with? There have been various tweaks at various stages.
@Joibel Hi Joibel. This is my log. 08:21:35.200Z is really close to 08:21:35.201, which result in my concern. 2023-12-11T08:21:35.200Z INFO my controller turn workflow exampleWorkflow to unsuspended
{"level":"info","msg":"Processing workflow","namespace":"default","time":"2023-12-11T08:21:35.172Z","workflow":"exampleWorkflow"} {"level":"info","msg":"Updated phase -\u003e Running","namespace":"default","time":"2023-12-11T08:21:35.176Z","workflow":"exampleWorkflow"} {"level":"info","msg":"workflow suspended","namespace":"default","time":"2023-12-11T08:21:35.176Z","workflow":"exampleWorkflow"} {"level":"info","msg":"Workflow update successful","namespace":"default","phase":"Running","resourceVersion":"288635653","time":"2023-12-11T08:21:35.201Z","workflow":"exampleWorkflow"}
{"level":"info","msg":"Processing workflow","namespace":"default","time":"2023-12-11T08:21:45.189Z","workflow":"exampleWorkflow"} {"level":"info","msg":"Task-result reconciliation","namespace":"default","numObjs":0,"time":"2023-12-11T08:21:45.189Z","workflow":"exampleWorkflow"} {"level":"info","msg":"workflow suspended","namespace":"default","time":"2023-12-11T08:21:45.189Z","workflow":"exampleWorkflow"}
{"level":"info","msg":"Processing workflow","namespace":"default","time":"2023-12-11T08:41:45.189Z","workflow":"exampleWorkflow"} {"level":"info","msg":"Task-result reconciliation","namespace":"default","numObjs":0,"time":"2023-12-11T08:41:45.190Z","workflow":"exampleWorkflow"} {"level":"info","msg":"workflow suspended","namespace":"default","time":"2023-12-11T08:41:45.190Z","workflow":"exampleWorkflow"}
This is exactly what I did, when I thought it is a event lost. I restarted the workflow controller when the workflow is in the unsuspended state and get it running again. So I am sure it is the gap between Informer Store and Apiserver caused this. UpdateFunc
did controller workqueue
job, so it is not affecting informer cache job. Informer cache is a go routine which continuously syncing Delta
event into Informer Store, so my resourceVersion z
workflow could be synced before workflow controller write resourceVersion: y
to its informer. Also, this informer Store update just make Store update item as resourceVersion: y
, so the previous resourceVersion: z
in Informer Store was lost.
Ah, that makes it clearer, thanks for your patience explaining it. I had not read that you thought z
had made it into the workflow-controller already properly.
So the correct fix for this would be to ensure that when we do the Update in writeBackToInformer
that the informer still has version x
in it, so we're not writing blind?
didn't think that far for the fix : ). I think a check before writeBackToInformer
is good enough. If possible I'd like to open a PR for this, to get my first contribute for a CNCF project : )
That would be great, yes please. If you can assign the PR to me when you create it that would be good, as I've got the context.
I think a Store mutex should be acquired before this. code should be something like this (sure this cause deadlock), but Store do not provide a Lock Interface. Any suggestion for this?
func (woc *wfOperationCtx) writeBackToInformer() error {
un, err := wfutil.ToUnstructured(woc.wf)
if err != nil {
return fmt.Errorf("failed to convert workflow to unstructured: %w", err)
}
if Store.Get(keyfunc(un)).ResourceVersion == woc.orig {
Store.Lock()
defer Store.UnLock()
err = woc.controller.wfInformer.GetStore().Update(un)
if err != nil {
return fmt.Errorf("failed to update informer store: %w", err)
}
}
return nil
}
If we could lock the store it would want to be done before the Store.Get, not after it. But, right now, I don't have a solution for locking it.
I think Store
is designed to be managed by Informer
, and proper way to update data in Store
is to update resource in apiserver and caused informer to sync. The way to solve this bug is turn WriteBackToInformer
to false, or write a customized Informer, which is really complicated. Any advice from the Community and maintainers?
So far, I think make the default behavior of WriteBackToInformer
to false is good enough.
Also we can improve documentation, WriteBackToInformer
improves performance when there is only workflow controller is editing resources. If user is applying their changes, docs should be highlighted. Surely, the default behavior should be false.
@Joibel Hi Joibel, please add a bug label so other contributors and maintainers can see this issue. I'd like to open a PR to change default behavior of WriteBackToInformer
to false, if no more suggestions was received.
I would not want to switch us to defaulting to false. It generates a lot of Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \""WF_NAME\"": the object has been modified; please apply your changes to the latest version and try again Conflict"
messages and causes the controller to do extra work.
Despite this bug, most users are not affected by it - it was introduced in response to the set of issues around #3719, and has been the behavior for a few years.
Despite the lack of locking, the originally discussed check would improve the behavior here, and might be a better than nothing solution - I don't have a better answer at the moment.
I think documenting the downsides of it being enabled would be useful.
I get it, the error generates because the informer sync slower than next reconcile, and controller get the earlier version. But with a 1 sec sleep after update, informer is likely to fetch the latest version. So I think the extra work to do for this error is not too much. When informer is not catching up and error happens, more cases is that something indeed change the workflow and cause the ResourceVersion
to be outdated. And a log Re-applying updates on latest version and retrying update
here usually fix it well (at least I don't get any workflow suspended in Informer Store this time).
Since I don't get too much context about this writing informer action, I'll wait for more advice.
@agilgur5 Sir, seems like contributors involved in #3719 didn't submit commit lately, can you take a look at this issue and give some suggestions?
@agilgur5 Sir, seems like contributors involved in #3719 didn't submit commit lately, can you take a look at this issue and give some suggestions?
I'm not sure what you mean, the issue was closed as completed and had PRs to close it 3+ years ago. I would not expect any recent activity on a 3+ year old issue and maintainers generally do not recommend commenting on old threads; new issues like this one are preferred. You already have made this issue.
To be clear, @Joibel is a Reviewer on the project like me and has the same permissions as me. Given his involvement here, I would defer anything to him.
@agilgur5 To be clear, I want some advice to fix this issue, the solution discussed previously is not prefect. I am new to open source community, so i don't know the conventional rules, if any disturb here I am sorry. I just try to get more info from the contributor and maintainers.
I would continue discussing with @Joibel and wait for a response from him if you don't believe the previous proposal is sufficient.
Waiting a few days or even a week+ for a response is pretty common in open-source as most contributors (like me) are volunteering their time. I don't think expecting quick responses from volunteers is reasonable.
OK, I will follow the guide. At me anytime if things make progress or need me.
@gzb1128 are u running with INFORMER_WRITE_BACK as false in prod? any issues?
@tooptoop4 No issues so far, and I think currently making INFORMER_WRITE_BACK as true is not the best practice for client-go controller framework. This changes the timing relationship of the k8s resource events.
@tooptoop4 what problems are you facing now? I visited your issue here and here, If you are running into a "lost workflow informer resource event" problem, you may take a look in my issue, you can confirm that by restarting workflow controller pod, if it caused by infromer event lost, your workflow CR should start work again. This can be reproduced when you are watching and modifying workflow spec in your customized controller, also you may change INFORMER_WRITE_BACK as false to see if it avoids races.
@Joibel @tooptoop4 I have a better solution now. The origin purpose is to avoid controller do extra work and logs Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \""WF_NAME\"": the object has been modified; please apply your changes to the latest version and try again Conflict
. We can achieve same goal without using "INFORMER_WRITE_BACK" here which cause this issue. A simple solution is to write a waitCacheObjectSync(lister, namespace, updatedObject)
function, loop query cache until resourceVersion of cache object changed, throw error otherwise.
Summary
I'd like to add some comments at environment variable "INFORMER_WRITE_BACK". documentation description now is INFORMER_WRITE_BACK: "Whether to write back to informer instead of catching up." until 2023-12-12.
here is my scene I have a controller that turn the Workflow Spec.Suspend from true to false. In some race cases, I found my changes were not caught by Workflow Controller, because in Workflow Controller, I see "workflow suspended" log after my controller reported "change suspend from true to false". so I checked workflow code, https://github.com/argoproj/argo-workflows/blob/23292c429430e71fdc7c59c89878053ee821b80d/workflow/controller/operator.go#L737. I believe my controller wrote Spec and Workflow Controller wrote status, so they were both accepted by kubernetes Apiserver. So this https://github.com/argoproj/argo-workflows/blob/23292c429430e71fdc7c59c89878053ee821b80d/workflow/controller/operator.go#L739C2-L739C2 Warning didn't caught by Workflow Controller log. Finally, my Spec event first reported to Informer and synced with Store, and then Workflow Controller write workflow back into Informer which cover my Spec changes https://github.com/argoproj/argo-workflows/blob/23292c429430e71fdc7c59c89878053ee821b80d/workflow/controller/operator.go#L776C19-L776C19
Since I turn "INFORMER_WRITE_BACK" to false, I see no missing change so far for 4 days, so I think my judging is right. This environment variable could result in the lost of consistency between Workflow Controller Store Cache and Kubernetes Apiserver in this race condition. Any suggestion from Community? If any improve will be made, I'd like to do some contribute and be a naive contributor
Use Cases
When would you use this?
Message from the maintainers:
Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.
Beyond this issue: