buehler / dotnet-operator-sdk

KubeOps is a kubernetes operator sdk in dotnet. Strongly inspired by kubebuilder.
https://buehler.github.io/dotnet-operator-sdk/
Apache License 2.0
237 stars 64 forks source link

[bug]: Re-triggering reconcile does not work when status is updated #554

Open buehler opened 1 year ago

buehler commented 1 year ago

Describe the bug

As discussed in #551, the timed reconcile return value is ignored when a status update is performed during the reconciliation loop.

To reproduce

  1. Create a controller
  2. Update the status of the instance
  3. Return with timed reconcile event
  4. No timed reconcile is called

Expected behavior

No response

Screenshots

No response

Additional Context

https://github.com/buehler/dotnet-operator-sdk/blob/3f42bbcb13d98a0811606b4b8ca83566c74a988a/src/KubeOps/Operator/Controller/EventQueue.cs#L44-L55

Karthik2893 commented 1 year ago

Following up on our discussion to get more context on this:

  1. What are the likely cases where duplicate events might be queued?
  2. As per my understanding, the "Reconciliation" loop should always check the state of the cluster and ensure if it is in the desired state or not. Ideally, controller should be handling duplicate events efficiently (resulting in a no-op for ex) and they should not be a problem. Correct me if I am wrong.

My understanding is that replacing "Switch" with "Merge" should fix this issue. I want to be sure that this won't cause any unintended side-effects.

https://github.com/buehler/dotnet-operator-sdk/blob/3f42bbcb13d98a0811606b4b8ca83566c74a988a/src/KubeOps/Operator/Controller/EventQueue.cs#L44-L55

Thank you.

prochnowc commented 1 year ago

@buehler I think we're also being hit by this bug - I couldnt debug our operator yet.

The issue is that after some reconcile loop's most of the resources are never reconciled again, even if the resource was updated in kubernetes (not the status). Strangely this does not happen for all resources.

After restarting the operator all resources are reconciled again for some time and stops working for most resources again.

I will open a new issue as it seems unrelated.

tomitesh commented 11 months ago

@buehler I think we're also being hit by this bug - I couldnt debug our operator yet.

The issue is that after some reconcile loop's most of the resources are never reconciled again, even if the resource was updated in kubernetes (not the status). Strangely this does not happen for all resources.

After restarting the operator all resources are reconciled again for some time and stops working for most resources again.

I will open a new issue as it seems unrelated.

@buehler

We encountered the same issue last week. We perform reconciliation for custom resources every minute, during which we update the status of each custom resource. There are seven resources undergoing reconciliation every minute. However, at random intervals, such as after 2 days or 5 days, we've noticed that the reconciliation process stops for some of these seven resources.

We did not observed any error/exception in logs.

We are using eks 1.24 and kubeops 7.6.1 library.

Is there any fix/workaround available?

Karthik2893 commented 11 months ago

@tomitesh Just wanted to double check that you are not returning "Null" for these objects (for which reconciliation process is stopping) as part of the reconcile loop. I think the issue that I opened has been fixed in the update made by @prochnowc above.

tomitesh commented 11 months ago

@tomitesh Just wanted to double check that you are not returning "Null" for these objects (for which reconciliation process is stopping) as part of the reconcile loop. I think the issue that I opened has been fixed in the update made by @prochnowc above.

Thanks @Karthik2893 for your reply.

We aim to reconcile every minute, opting for option 1 as outlined in the readme to achieve periodic reconciliation. However, after a random interval of 2-3-5 days, reconciliation ceases for certain resources.

While utilizing option 2, reconciliation occurs only once during startup (unless i am missing any configuration).

Note : we also update status as part of "// reconcile logic"

option 1

    public async Task<ResourceControllerResult?> ReconcileAsync(V1DemoEntity entity)
    {
        _logger.LogInformation($"entity {entity.Name()} called {nameof(ReconcileAsync)}.");
        await _finalizerManager.RegisterFinalizerAsync<DemoFinalizer>(entity);

      // reconcile logic

        return ResourceControllerResult.RequeueEvent(TimeSpan.FromSeconds(60));
    }

Option 2

    public async Task<ResourceControllerResult?> ReconcileAsync(V1DemoEntity entity)
    {
        _logger.LogInformation($"entity {entity.Name()} called {nameof(ReconcileAsync)}.");
        await _finalizerManager.RegisterFinalizerAsync<DemoFinalizer>(entity);

      // reconcile logic

        return null;
    }
Karthik2893 commented 11 months ago

@tomitesh Do you happen to know what you WatcherHttpTimeout is set to? Setting it to a higher value (>120min or so) resulted in failing to reconcile after a certain period of time. But if that is the case, it should fail to reconcile for all the entities and not a fraction of entities. For your case, I am suspecting one of your codepaths might be returning "null" for the entities that are failing to reconcile? If not, then we will have to ask others to look into it and it will be helpful if we have code snippets :)

emouawad commented 11 months ago

I can confirm, experiencing the same issue on (8.0.0-pre.29, 8.0.0-pre.34 - didn't try other versions) - during reconcile i update the status and publish events (not sure if related) then requeue. The reconcile is stopped for all my entity instances (only have 2 for now). If i delete those 2 and reapply the reconcile/requeue is restarted for some time and then just stops in less than an hour. Plz let me know if i can help with more logs or testing. Not sure for now - will test again but i am thinking the issue is in the watcher?

buehler commented 11 months ago

Hmm. Good question. It is such a weird behaviour. Do you experience this issue when you test locally (i.e. with docker Kubernetes or minikube or whatever) as well?

Events should not impact the watcher, since they are completely different objects in Kubernetes. However, status updates could impact the requeue cache.

But when you update the status and then requeue the resource, it should retrigger the reconcile.

I'll conduct a test of my own :)

sicavz commented 10 months ago

[Checked on v8.0.0-pre.38] Actually, the flow is pretty simple, and the behaviour is related to timing.

If in the ReconcileAsync one would simply

due to the asynchronous nature of the system, even if the entity status is changed first, the "Modified" event for it might occur AFTER the requeue. If the spec of the entity is not changed (and this is the case), something like "Entity "X" modification did not modify generation. Skip event." will be logged and the requeue will never happen on a scheduled basis (see the _queue.RemoveIfQueued(entity); in the ResourceWatcher::OnEvent - https://github.com/buehler/dotnet-operator-sdk/blob/9bc1efaa94355103e33dc0ceded1a0b666b69629/src/KubeOps.Operator/Watcher/ResourceWatcher%7BTEntity%7D.cs#L168C8-L168C39

You can check the attached screenshot (

screenshot

sicavz commented 10 months ago

It's even more obvious that the requeue is skipped if you look at the log below (after the KubernetesClient reconnects):

[14:38:51.216 - DBG - WebhookOperator.Controller.UscSystemEntityController - requeueing...
[14:38:51.216 - VRB - KubeOps.Abstractions.Queue.EntityRequeue - Requeue entity "X" in 5000ms.
[14:38:51.216 - DBG - WebhookOperator.Controller.XEntityController - requeued

[14:38:52.883 - DBG - KubeOps.Operator.Watcher.ResourceWatcher - The watcher was closed

[14:38:52.883 - DBG - KubeOps.Operator.Watcher.ResourceWatcher - Create watcher for entity of type "WebhookOperator.Entities.XEntity".
[14:38:52.888 - VRB - KubeOps.Operator.Watcher.ResourceWatcher - Received watch event "Added" for "X/alpha".
[14:38:52.888 - DBG - KubeOps.Operator.Watcher.ResourceWatcher - Entity "X/alpha" modification did not modify generation. Skip event.

NOTHING HAPPENS ANYMORE HERE EVEN IF THE THIRD LINE REQUEUED THE ENTITY!
buehler commented 9 months ago

This should be overhauled in v8.

sicavz commented 9 months ago

Unfortunately, my previous two comments were for v8.0.0-pre.38.

buehler commented 9 months ago

Hey @sicavz, so it does not work? when using the framework as documented, I could not reproduce the error. You need to use the returned entities from the client to have the updated resource versions and stuff. And status update does not update the resource version which should be fine.

sicavz commented 9 months ago

I've sketched the flow as I understand it (from the ResourceWatcher)

kubeops drawio (1)

As you see, the issue is dependent on timing, so it's a bit harder to reproduce... (Please review the screenshot I've posted with my first comment)

nullexceptiondev commented 9 months ago

I'm struggling with the same issue in v8. When running in VS locally, everything runs perfectly fine. When deployed to kubernetes I get maybe 2 or 3 reconcile calls before it stops