Open buehler opened 1 year ago
Following up on our discussion to get more context on this:
My understanding is that replacing "Switch" with "Merge" should fix this issue. I want to be sure that this won't cause any unintended side-effects.
Thank you.
@buehler I think we're also being hit by this bug - I couldnt debug our operator yet.
The issue is that after some reconcile loop's most of the resources are never reconciled again, even if the resource was updated in kubernetes (not the status). Strangely this does not happen for all resources.
After restarting the operator all resources are reconciled again for some time and stops working for most resources again.
I will open a new issue as it seems unrelated.
@buehler I think we're also being hit by this bug - I couldnt debug our operator yet.
The issue is that after some reconcile loop's most of the resources are never reconciled again, even if the resource was updated in kubernetes (not the status). Strangely this does not happen for all resources.
After restarting the operator all resources are reconciled again for some time and stops working for most resources again.
I will open a new issue as it seems unrelated.
@buehler
We encountered the same issue last week. We perform reconciliation for custom resources every minute, during which we update the status of each custom resource. There are seven resources undergoing reconciliation every minute. However, at random intervals, such as after 2 days or 5 days, we've noticed that the reconciliation process stops for some of these seven resources.
We did not observed any error/exception in logs.
We are using eks 1.24 and kubeops 7.6.1 library.
Is there any fix/workaround available?
@tomitesh Just wanted to double check that you are not returning "Null" for these objects (for which reconciliation process is stopping) as part of the reconcile loop. I think the issue that I opened has been fixed in the update made by @prochnowc above.
@tomitesh Just wanted to double check that you are not returning "Null" for these objects (for which reconciliation process is stopping) as part of the reconcile loop. I think the issue that I opened has been fixed in the update made by @prochnowc above.
Thanks @Karthik2893 for your reply.
We aim to reconcile every minute, opting for option 1 as outlined in the readme to achieve periodic reconciliation. However, after a random interval of 2-3-5 days, reconciliation ceases for certain resources.
While utilizing option 2, reconciliation occurs only once during startup (unless i am missing any configuration).
Note : we also update status as part of "// reconcile logic"
option 1
public async Task<ResourceControllerResult?> ReconcileAsync(V1DemoEntity entity)
{
_logger.LogInformation($"entity {entity.Name()} called {nameof(ReconcileAsync)}.");
await _finalizerManager.RegisterFinalizerAsync<DemoFinalizer>(entity);
// reconcile logic
return ResourceControllerResult.RequeueEvent(TimeSpan.FromSeconds(60));
}
Option 2
public async Task<ResourceControllerResult?> ReconcileAsync(V1DemoEntity entity)
{
_logger.LogInformation($"entity {entity.Name()} called {nameof(ReconcileAsync)}.");
await _finalizerManager.RegisterFinalizerAsync<DemoFinalizer>(entity);
// reconcile logic
return null;
}
@tomitesh Do you happen to know what you WatcherHttpTimeout is set to? Setting it to a higher value (>120min or so) resulted in failing to reconcile after a certain period of time. But if that is the case, it should fail to reconcile for all the entities and not a fraction of entities. For your case, I am suspecting one of your codepaths might be returning "null" for the entities that are failing to reconcile? If not, then we will have to ask others to look into it and it will be helpful if we have code snippets :)
I can confirm, experiencing the same issue on (8.0.0-pre.29, 8.0.0-pre.34 - didn't try other versions) - during reconcile i update the status and publish events (not sure if related) then requeue. The reconcile is stopped for all my entity instances (only have 2 for now). If i delete those 2 and reapply the reconcile/requeue is restarted for some time and then just stops in less than an hour. Plz let me know if i can help with more logs or testing. Not sure for now - will test again but i am thinking the issue is in the watcher?
Hmm. Good question. It is such a weird behaviour. Do you experience this issue when you test locally (i.e. with docker Kubernetes or minikube or whatever) as well?
Events should not impact the watcher, since they are completely different objects in Kubernetes. However, status updates could impact the requeue cache.
But when you update the status and then requeue the resource, it should retrigger the reconcile.
I'll conduct a test of my own :)
[Checked on v8.0.0-pre.38] Actually, the flow is pretty simple, and the behaviour is related to timing.
If in the ReconcileAsync one would simply
due to the asynchronous nature of the system, even if the entity status is changed first, the "Modified" event for it might occur AFTER the requeue.
If the spec of the entity is not changed (and this is the case), something like
"Entity "X" modification did not modify generation. Skip event."
will be logged and the requeue will never happen on a scheduled basis (see the _queue.RemoveIfQueued(entity);
in the ResourceWatcher::OnEvent - https://github.com/buehler/dotnet-operator-sdk/blob/9bc1efaa94355103e33dc0ceded1a0b666b69629/src/KubeOps.Operator/Watcher/ResourceWatcher%7BTEntity%7D.cs#L168C8-L168C39
You can check the attached screenshot (
It's even more obvious that the requeue is skipped if you look at the log below (after the KubernetesClient reconnects):
[14:38:51.216 - DBG - WebhookOperator.Controller.UscSystemEntityController - requeueing...
[14:38:51.216 - VRB - KubeOps.Abstractions.Queue.EntityRequeue - Requeue entity "X" in 5000ms.
[14:38:51.216 - DBG - WebhookOperator.Controller.XEntityController - requeued
[14:38:52.883 - DBG - KubeOps.Operator.Watcher.ResourceWatcher - The watcher was closed
[14:38:52.883 - DBG - KubeOps.Operator.Watcher.ResourceWatcher - Create watcher for entity of type "WebhookOperator.Entities.XEntity".
[14:38:52.888 - VRB - KubeOps.Operator.Watcher.ResourceWatcher - Received watch event "Added" for "X/alpha".
[14:38:52.888 - DBG - KubeOps.Operator.Watcher.ResourceWatcher - Entity "X/alpha" modification did not modify generation. Skip event.
NOTHING HAPPENS ANYMORE HERE EVEN IF THE THIRD LINE REQUEUED THE ENTITY!
This should be overhauled in v8.
Unfortunately, my previous two comments were for v8.0.0-pre.38.
Hey @sicavz, so it does not work? when using the framework as documented, I could not reproduce the error. You need to use the returned entities from the client to have the updated resource versions and stuff. And status update does not update the resource version which should be fine.
I've sketched the flow as I understand it (from the ResourceWatcher
)
As you see, the issue is dependent on timing, so it's a bit harder to reproduce... (Please review the screenshot I've posted with my first comment)
I'm struggling with the same issue in v8. When running in VS locally, everything runs perfectly fine. When deployed to kubernetes I get maybe 2 or 3 reconcile calls before it stops
Describe the bug
As discussed in #551, the timed reconcile return value is ignored when a status update is performed during the reconciliation loop.
To reproduce
Expected behavior
No response
Screenshots
No response
Additional Context
https://github.com/buehler/dotnet-operator-sdk/blob/3f42bbcb13d98a0811606b4b8ca83566c74a988a/src/KubeOps/Operator/Controller/EventQueue.cs#L44-L55