The operator pod restarts frequently

ghost commented 8 months ago

Describe the bug

The operator pod restarts roughly every hour.

$ kubectl get pods -n theia-cloud-ns 
NAME                                                              READY   STATUS    RESTARTS   AGE
landing-page-deployment-56897bb6bb-flxz9                          1/1     Running   0          56d
operator-deployment-7f9f8664c6-wprmm                              1/1     Running   2102       56d
service-deployment-5696b678d9-wxdc4                               1/1     Running   0          56d

Log before operator pod restart:

05:55:19.123 [OkHttp https://10.222.0.1/...] INFO  org.eclipse.theia.cloud.operator.SpecWatch - [appdefinition-watch-] App Definition reconnecting
WARNING: sun.reflect.Reflection.getCallerClass is not supported. This will impact performance.
05:55:20.296 [OkHttp https://10.222.0.1/...] ERROR org.eclipse.theia.cloud.operator.SpecWatch - [appdefinition-watch-] App Definition watch closed because of an exception
io.fabric8.kubernetes.client.WatcherException: too old resource version: 1573109 (29184097)
    at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onStatus(AbstractWatchManager.java:300) [operator-0.8.0-SNAPSHOT-jar-with-dependencies.jar:?]
    at io.fabric8.kubernetes.client.dsl.internal.AbstractWatchManager.onMessage(AbstractWatchManager.java:284) [operator-0.8.0-SNAPSHOT-jar-with-dependencies.jar:?]
    at io.fabric8.kubernetes.client.dsl.internal.WatcherWebSocketListener.onMessage(WatcherWebSocketListener.java:68) [operator-0.8.0-SNAPSHOT-jar-with-dependencies.jar:?]
    at io.fabric8.kubernetes.client.okhttp.OkHttpWebSocketImpl$BuilderImpl$1.onMessage(OkHttpWebSocketImpl.java:92) [operator-0.8.0-SNAPSHOT-jar-with-dependencies.jar:?]
    at okhttp3.internal.ws.RealWebSocket.onReadMessage(RealWebSocket.java:323) [operator-0.8.0-SNAPSHOT-jar-with-dependencies.jar:?]
    at okhttp3.internal.ws.WebSocketReader.readMessageFrame(WebSocketReader.java:219) [operator-0.8.0-SNAPSHOT-jar-with-dependencies.jar:?]
    at okhttp3.internal.ws.WebSocketReader.processNextFrame(WebSocketReader.java:105) [operator-0.8.0-SNAPSHOT-jar-with-dependencies.jar:?]
    at okhttp3.internal.ws.RealWebSocket.loopReader(RealWebSocket.java:274) [operator-0.8.0-SNAPSHOT-jar-with-dependencies.jar:?]
    at okhttp3.internal.ws.RealWebSocket$2.onResponse(RealWebSocket.java:214) [operator-0.8.0-SNAPSHOT-jar-with-dependencies.jar:?]
    at okhttp3.RealCall$AsyncCall.execute(RealCall.java:203) [operator-0.8.0-SNAPSHOT-jar-with-dependencies.jar:?]
    at okhttp3.internal.NamedRunnable.run(NamedRunnable.java:32) [operator-0.8.0-SNAPSHOT-jar-with-dependencies.jar:?]
    at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) [?:?]
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source) [?:?]
    at java.lang.Thread.run(Unknown Source) [?:?]
Caused by: io.fabric8.kubernetes.client.KubernetesClientException: too old resource version: 1573109 (29184097)
    ... 14 more

Expected behavior

The operator pod should not be restarted often. Just like service and landing-page pod.

Cluster provider

No response

Version

theia-cloud 0.8.0

$ helm list -n theia-cloud-ns 
NAME                NAMESPACE       REVISION    UPDATED                                 STATUS      CHART                   APP VERSION
theia-cloud         theia-cloud-ns  1           2023-09-18 20:34:46.218045795 +0800 CST deployed    theia-cloud-0.8.0       0.8.0      
theia-cloud-base    theia-cloud-ns  1           2023-09-15 16:51:23.082517449 +0800 CST deployed    theia-cloud-base-0.8.0  0.8.0

Additional information

No response

jfaltermeier commented 8 months ago

Thank you for the report. I haven't seen this before on GKE or on a local deployment. On first sight it looks a bit like a bug in the fabric8 kuberntes client, because I don't think a watch should just crash like this. In the upcoming 0.9 release we will update the fabric8 client from version 5.x to 6.x, so maybe this will help.

Besides that we may try to restart the watch in onClose rather than stopping the application

jfaltermeier commented 8 months ago

I think the initial reason for stopping the operator when we get an exception from a watch was, that we might have missed events. On a restart we would check all resources from scratch

sgraband commented 7 months ago

@qiaozhi92 Could you update your deployment to version 0.9.1 and check if this error still occurs? We experienced this before, but since the update we haven't had any issues.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open for 180 days with no activity.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been inactive for 14 days since being marked as stale.

eclipse-theia / theia-cloud