Kubernetes loses the leader due to the timeout and doesn't elect new one

myroch commented 2 weeks ago

Bug description

We have from time to time a problem with our master election via kubernetes. We are currently on Camel Quarkus 3.8.3 and Quarkus 3.8.6 LTS. There is no special configuration of the master election, just defaults:

quarkus.camel.cluster.kubernetes.enabled=true

Initially the application works as charm, but later it loses the leadership and there is no pod with it. At this point we do see in Log following messages:

pod1 (v26rk):
2024-11-04 13:07:12,199 INFO  [org.apa.cam.com.kub.clu.loc.TimedLeaderNotifier] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) The cluster has a new leader: Optional.empty
2024-11-04 13:07:12,200 INFO  [org.apa.cam.com.qua.QuartzEndpoint] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) Pausing trigger ...
2024-11-04 13:07:12,200 INFO  [org.apa.cam.com.qua.QuartzEndpoint] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) Deleting job ...

There is nothing more relevant to kubernetes in pod1 Log. The camel routes are since this moment down.

pod2 (t5bt5):
2024-11-04 13:12:13,190 WARN  [org.apa.cam.com.kub.clu.loc.KubernetesLeadershipController] (Camel (camel-1) thread #2 - CamelKubernetesLeadershipController) Pod[pod2-985884674-t5bt5] Unable to retrieve the current lease resource my-lease for group my-service from Kubernetes
2024-11-04 13:52:15,345 INFO  [org.apa.cam.com.kub.clu.loc.TimedLeaderNotifier] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) The cluster has a new leader: Optional.empty
2024-11-04 13:52:15,355 INFO  [org.apa.cam.com.kub.clu.loc.TimedLeaderNotifier] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) The cluster has a new leader: Optional[pod1-985884674-v26rk]
2024-11-04 14:07:12,343 WARN [org.apa.cam.com.kub.clu.loc.KubernetesLeadershipController] (Camel (camel-1) thread #2 - CamelKubernetesLeadershipController) Pod[srv-mdn-patientdelivery-dev-985884674-t5bt5] Unable to retrieve the current lease resource my-lease for group my-service from Kubernetes
2024-11-04 14:07:15,130 INFO  [org.apa.cam.com.kub.clu.loc.TimedLeaderNotifier] (Camel (camel-1) thread #8 - CamelKubernetesLeaderNotifier) The cluster has a new leader: Optional.empty
...

After new deployment everything works again. I have absolutely no idea where the bug occurs, that's the reason why i'm trying to report it here. Any ideas? Would really appreciate.

Thanks a lot Miro

jamesnetherton commented 1 week ago

Are you able to give any of the later Camel Quarkus releases a try? Like the 3.15 LTS?

jamesnetherton commented 1 week ago

Something else you could try to get some more debugging info, would be to turn up the logging on the Kubernetes component.

This configuration should reveal the exception behind Unable to retrieve the current lease.

quarkus.log.category."org.apache.camel.component.kubernetes.cluster.lock".level=DEBUG

Or to log all debug messages from the Kubernetes component:

quarkus.log.category."org.apache.camel.component.kubernetes".level=DEBUG

myroch commented 2 days ago

Hello James, problem exists in LTS 3.15 as well. I've enabled DEBUG logs and I can see following exceptions:

Error while closing watcher: io.fabric8.kubernetes.client.WatcherException: The resourceVersion for the provided watch is too old.
    at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onStatus(AbstractWatchManager.java:401)
    at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onMessage(AbstractWatchManager.java:369)

or

Error received during lease resource lock replace: io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: PUT at: https://100.68.0.1:443/apis/coordination.k8s.io/v1/namespaces/lab-mdc-leaderelection-dev/leases/lab-mdc-leaderelection-dev-mylease. Message: Operation cannot be fulfilled on leases.coordination.k8s.io "lab-mdc-leaderelection-dev-mylease": the object has been modified; please apply your changes to the latest version and try again. Received status: Status(apiVersion=v1, code=409, details=StatusDetails(causes=[], group=coordination.k8s.io, kind=leases, name=lab-mdc-leaderelection-dev-mylease, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=Operation cannot be fulfilled on leases.coordination.k8s.io "lab-mdc-leaderelection-dev-mylease": the object has been modified; please apply your changes to the latest version and try again, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=Conflict, status=Failure, additionalProperties={}).
    at io.fabric8.kubernetes.client.KubernetesClientException.copyAsCause(KubernetesClientException.java:238)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:507)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)

or

Exception thrown during lease resource lookup: io.fabric8.kubernetes.client.KubernetesClientException: The timeout period of 10000ms has been exceeded while executing GET /apis/coordination.k8s.io/v1/namespaces/lab-mdc-leaderelection-dev/leases/lab-mdc-leaderelection-dev-mylease for server null
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.waitForResult(OperationSupport.java:509)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleResponse(OperationSupport.java:524)
    at io.fabric8.kubernetes.client.dsl.internal.OperationSupport.handleGet(OperationSupport.java:467)

Do you have any idea what shall I change? Thx allot for helping! m.

apache / camel-quarkus

Kubernetes loses the leader due to the timeout and doesn't elect new one #6761

Bug description