gzb1128 commented 11 months ago

Summary

I'd like to add some comments at environment variable "INFORMER_WRITE_BACK". documentation description now is INFORMER_WRITE_BACK: "Whether to write back to informer instead of catching up." until 2023-12-12.

here is my scene I have a controller that turn the Workflow Spec.Suspend from true to false. In some race cases, I found my changes were not caught by Workflow Controller, because in Workflow Controller, I see "workflow suspended" log after my controller reported "change suspend from true to false". so I checked workflow code, https://github.com/argoproj/argo-workflows/blob/23292c429430e71fdc7c59c89878053ee821b80d/workflow/controller/operator.go#L737. I believe my controller wrote Spec and Workflow Controller wrote status, so they were both accepted by kubernetes Apiserver. So this https://github.com/argoproj/argo-workflows/blob/23292c429430e71fdc7c59c89878053ee821b80d/workflow/controller/operator.go#L739C2-L739C2 Warning didn't caught by Workflow Controller log. Finally, my Spec event first reported to Informer and synced with Store, and then Workflow Controller write workflow back into Informer which cover my Spec changes https://github.com/argoproj/argo-workflows/blob/23292c429430e71fdc7c59c89878053ee821b80d/workflow/controller/operator.go#L776C19-L776C19

Since I turn "INFORMER_WRITE_BACK" to false, I see no missing change so far for 4 days, so I think my judging is right. This environment variable could result in the lost of consistency between Workflow Controller Store Cache and Kubernetes Apiserver in this race condition. Any suggestion from Community? If any improve will be made, I'd like to do some contribute and be a naive contributor

Use Cases

When would you use this?

Message from the maintainers:

Love this enhancement proposal? Give it a 👍. We prioritise the proposals with the most 👍.

Beyond this issue:

Joibel commented 11 months ago

INFORMER_WRITE_BACK only controls whether changes written to the kubernetes API server are also stored internally by the controller. When you write anything to an existing kubernetes object you must tell the API server which resourceVersion of the object you are modifying, so any conflict in a write is resolved by one of the writes failing because it is attempting to modify the older resourceVersion. Are you sure your controller which is setting .spec.suspend back to false is working correctly. There are delays through the informer, so I would expect on a busy system for workflows-controller to be able to observe a workflow as still suspended at a later timestamp than the workflow was unsuspended by your controller. Does the controller NEVER see the workflow as unsuspended?

gzb1128 commented 11 months ago

@Joibel Hi Joibel This is the weird thing, I can use kubectl to make sure workflow is unsuspended already, and my controller reported it unsuspendedworkflow normally in log. I build my controller with controller-runtime, which I pretty sure it contains right resourceVersion. However, workflow controller NEVER see the workflow as unsuspended. My opinion is based on the assumption that when different clients have the same resourceVersion resource, if one is writing Status and the other is writing Spec, Apiserver will accept both changes.

I think the logical flow of the two controllers looks like this. First workflow controller hold origin workflow woc.wf. here. https://github.com/argoproj/argo-workflows/blob/23292c429430e71fdc7c59c89878053ee821b80d/workflow/controller/operator.go#L737 Then my controller change workflow to unsuspended. Workflow controller's informer watch this event, and sync it into Store. After that, workflow controller update it. Based on my assumption, they both success. Then the logic goes here because I see no Warning log in workflow controller. https://github.com/argoproj/argo-workflows/blob/23292c429430e71fdc7c59c89878053ee821b80d/workflow/controller/operator.go#L755C11-L755C11 and log "Workflow update successful" here https://github.com/argoproj/argo-workflows/blob/23292c429430e71fdc7c59c89878053ee821b80d/workflow/controller/operator.go#L770 and write into Informer here. https://github.com/argoproj/argo-workflows/blob/23292c429430e71fdc7c59c89878053ee821b80d/workflow/controller/operator.go#L772C20-L772C39 However, at that moment, unsuspended workflow is already synced into Store. So this operation covered Apiserver event, also my change, also caused workflow controller NEVER see workflow as unsuspended. Because the Store cache lost consistent with Apiserver for this INFORMER_WRITE_BACK operation. Any suggestion will be appreciated

gzb1128 commented 11 months ago

@Joibel sorry Joibel, looks like I make things complicated. Here is a simple scene which result in same consequence. After workflow controller update workflow, and before write into informer, cpu change context to my controller. Previous updating event goes through Apiserver and received by my controller. I set workflow to unsuspended. Event finally sent to workflow controller and sync to Informer store. At this time, workflow controller write holding woc.wf into informer, which cover my first sync. This actually races, and workflow controller Informer Store lost consistency with Apiserver. Also, I don't need to confuse about whether my second comment assumption is true or not. The reason is that workflow controller doesn't make update and write into informer atomic, so any resource changing event which synced into Informer Store will be missing.

gzb1128 commented 11 months ago

@agilgur5 Sir can you look at this please. This actually raises a bug, and it is hard to investigate unless the guy had read informer code and argo workflow code.

Joibel commented 11 months ago

I don't think that we should be encouraging INFORMER_WRITE_BACK to false, whatever the outcome of this. We should instead be fixing the true path to work always.

So your suggestion is that the workflow is going through the following:

Workflow controller operates on resourceVersion: x
Workflow controller writes resourceVersion: y to the API
Your controller operates on resourceVersion: y and unsuspends the workflow
Your controller writes that as resourceVersion: z
Workflow controller writes resourceVersion: y to its informer cache and therefore never sees z

There is nothing in the workflow code that would prevent z from coming through to the workflow controller. There is this line which prevents the same version we already have, but it very definitely allows workflow resource versions to go backwards and forwards. Both y and z should come through that code path at some point, and both should cause the controller to attempt a reconcile.

That line prevents us from attempting to reconcile y again, but would allow us to see any other version.

The controller will also do a full resync every 20 minutes so I'd expect worst case it to take that long for your change to be picked up.

Very simple test: If you restart the workflow controller when the workflow is in the unsuspended state does it start running it again?

Which version of workflows are you using, and which other versions have you experienced this with? There have been various tweaks at various stages.

gzb1128 commented 11 months ago

@Joibel Hi Joibel. This is my log. 08:21:35.200Z is really close to 08:21:35.201, which result in my concern. 2023-12-11T08:21:35.200Z INFO my controller turn workflow exampleWorkflow to unsuspended

{"level":"info","msg":"Processing workflow","namespace":"default","time":"2023-12-11T08:21:35.172Z","workflow":"exampleWorkflow"} {"level":"info","msg":"Updated phase -\u003e Running","namespace":"default","time":"2023-12-11T08:21:35.176Z","workflow":"exampleWorkflow"} {"level":"info","msg":"workflow suspended","namespace":"default","time":"2023-12-11T08:21:35.176Z","workflow":"exampleWorkflow"} {"level":"info","msg":"Workflow update successful","namespace":"default","phase":"Running","resourceVersion":"288635653","time":"2023-12-11T08:21:35.201Z","workflow":"exampleWorkflow"}

{"level":"info","msg":"Processing workflow","namespace":"default","time":"2023-12-11T08:21:45.189Z","workflow":"exampleWorkflow"} {"level":"info","msg":"Task-result reconciliation","namespace":"default","numObjs":0,"time":"2023-12-11T08:21:45.189Z","workflow":"exampleWorkflow"} {"level":"info","msg":"workflow suspended","namespace":"default","time":"2023-12-11T08:21:45.189Z","workflow":"exampleWorkflow"}

{"level":"info","msg":"Processing workflow","namespace":"default","time":"2023-12-11T08:41:45.189Z","workflow":"exampleWorkflow"} {"level":"info","msg":"Task-result reconciliation","namespace":"default","numObjs":0,"time":"2023-12-11T08:41:45.190Z","workflow":"exampleWorkflow"} {"level":"info","msg":"workflow suspended","namespace":"default","time":"2023-12-11T08:41:45.190Z","workflow":"exampleWorkflow"}

gzb1128 commented 11 months ago

This is exactly what I did, when I thought it is a event lost. I restarted the workflow controller when the workflow is in the unsuspended state and get it running again. So I am sure it is the gap between Informer Store and Apiserver caused this. UpdateFuncdid controller workqueue job, so it is not affecting informer cache job. Informer cache is a go routine which continuously syncing Delta event into Informer Store, so my resourceVersion z workflow could be synced before workflow controller write resourceVersion: y to its informer. Also, this informer Store update just make Store update item as resourceVersion: y, so the previous resourceVersion: z in Informer Store was lost.

Joibel commented 11 months ago

Ah, that makes it clearer, thanks for your patience explaining it. I had not read that you thought z had made it into the workflow-controller already properly.

So the correct fix for this would be to ensure that when we do the Update in writeBackToInformer that the informer still has version x in it, so we're not writing blind?

gzb1128 commented 11 months ago

didn't think that far for the fix : ). I think a check before writeBackToInformer is good enough. If possible I'd like to open a PR for this, to get my first contribute for a CNCF project : )

Joibel commented 11 months ago

That would be great, yes please. If you can assign the PR to me when you create it that would be good, as I've got the context.

gzb1128 commented 11 months ago

I think a Store mutex should be acquired before this. code should be something like this (sure this cause deadlock), but Store do not provide a Lock Interface. Any suggestion for this?

func (woc *wfOperationCtx) writeBackToInformer() error {
    un, err := wfutil.ToUnstructured(woc.wf)
    if err != nil {
        return fmt.Errorf("failed to convert workflow to unstructured: %w", err)
    }
    if Store.Get(keyfunc(un)).ResourceVersion == woc.orig {
        Store.Lock()
        defer Store.UnLock()
        err = woc.controller.wfInformer.GetStore().Update(un)
        if err != nil {
            return fmt.Errorf("failed to update informer store: %w", err)
        }
    }
    return nil
}

Joibel commented 11 months ago

If we could lock the store it would want to be done before the Store.Get, not after it. But, right now, I don't have a solution for locking it.

gzb1128 commented 11 months ago

I think Store is designed to be managed by Informer, and proper way to update data in Store is to update resource in apiserver and caused informer to sync. The way to solve this bug is turn WriteBackToInformer to false, or write a customized Informer, which is really complicated. Any advice from the Community and maintainers?

gzb1128 commented 11 months ago

So far, I think make the default behavior of WriteBackToInformer to false is good enough.

gzb1128 commented 11 months ago

Also we can improve documentation, WriteBackToInformer improves performance when there is only workflow controller is editing resources. If user is applying their changes, docs should be highlighted. Surely, the default behavior should be false.

gzb1128 commented 11 months ago

@Joibel Hi Joibel, please add a bug label so other contributors and maintainers can see this issue. I'd like to open a PR to change default behavior of WriteBackToInformer to false, if no more suggestions was received.

Joibel commented 11 months ago

I would not want to switch us to defaulting to false. It generates a lot of Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \""WF_NAME\"": the object has been modified; please apply your changes to the latest version and try again Conflict" messages and causes the controller to do extra work. Despite this bug, most users are not affected by it - it was introduced in response to the set of issues around #3719, and has been the behavior for a few years. Despite the lack of locking, the originally discussed check would improve the behavior here, and might be a better than nothing solution - I don't have a better answer at the moment. I think documenting the downsides of it being enabled would be useful.

gzb1128 commented 11 months ago

I get it, the error generates because the informer sync slower than next reconcile, and controller get the earlier version. But with a 1 sec sleep after update, informer is likely to fetch the latest version. So I think the extra work to do for this error is not too much. When informer is not catching up and error happens, more cases is that something indeed change the workflow and cause the ResourceVersion to be outdated. And a log Re-applying updates on latest version and retrying update here usually fix it well (at least I don't get any workflow suspended in Informer Store this time). Since I don't get too much context about this writing informer action, I'll wait for more advice.

gzb1128 commented 11 months ago

@agilgur5 Sir, seems like contributors involved in #3719 didn't submit commit lately, can you take a look at this issue and give some suggestions?

agilgur5 commented 11 months ago

@agilgur5 Sir, seems like contributors involved in #3719 didn't submit commit lately, can you take a look at this issue and give some suggestions?

I'm not sure what you mean, the issue was closed as completed and had PRs to close it 3+ years ago. I would not expect any recent activity on a 3+ year old issue and maintainers generally do not recommend commenting on old threads; new issues like this one are preferred. You already have made this issue.

To be clear, @Joibel is a Reviewer on the project like me and has the same permissions as me. Given his involvement here, I would defer anything to him.

gzb1128 commented 11 months ago

@agilgur5 To be clear, I want some advice to fix this issue, the solution discussed previously is not prefect. I am new to open source community, so i don't know the conventional rules, if any disturb here I am sorry. I just try to get more info from the contributor and maintainers.

agilgur5 commented 11 months ago

I would continue discussing with @Joibel and wait for a response from him if you don't believe the previous proposal is sufficient.

Waiting a few days or even a week+ for a response is pretty common in open-source as most contributors (like me) are volunteering their time. I don't think expecting quick responses from volunteers is reasonable.

gzb1128 commented 11 months ago

OK, I will follow the guide. At me anytime if things make progress or need me.

tooptoop4 commented 1 month ago

@gzb1128 are u running with INFORMER_WRITE_BACK as false in prod? any issues?

gzb1128 commented 1 week ago

@tooptoop4 No issues so far, and I think currently making INFORMER_WRITE_BACK as true is not the best practice for client-go controller framework. This changes the timing relationship of the k8s resource events.

gzb1128 commented 1 week ago

@tooptoop4 what problems are you facing now? I visited your issue here and here, If you are running into a "lost workflow informer resource event" problem, you may take a look in my issue, you can confirm that by restarting workflow controller pod, if it caused by infromer event lost, your workflow CR should start work again. This can be reproduced when you are watching and modifying workflow spec in your customized controller, also you may change INFORMER_WRITE_BACK as false to see if it avoids races.

gzb1128 commented 1 week ago

@Joibel @tooptoop4 I have a better solution now. The origin purpose is to avoid controller do extra work and logs Error updating workflow: Operation cannot be fulfilled on workflows.argoproj.io \""WF_NAME\"": the object has been modified; please apply your changes to the latest version and try again Conflict. We can achieve same goal without using "INFORMER_WRITE_BACK" here which cause this issue. A simple solution is to write a waitCacheObjectSync(lister, namespace, updatedObject) function, loop query cache until resourceVersion of cache object changed, throw error otherwise.

argoproj / argo-workflows

Add some doc about environment variable "INFORMER_WRITE_BACK" #12352

Summary

Use Cases