eclipse-theia / theia-cloud

Eclipse Public License 2.0
61 stars 34 forks source link

Session Teardown not having effect on all K8s resources #346

Open iyannsch opened 2 months ago

iyannsch commented 2 months ago

Describe the bug

I noticed that pods of timeouted sessions are still existent on our cluster and are being recreated on deletion by the according replicaset and deployment. As the entry in the ingress is properly removed, they are not of use to the user anymore and block resources.

Expected behavior

During session startup, I expect the deployment, replicaset, and pods to be created as well as a proper entry in the ingress. This is all working perfectly already. During session teardown (due to timeout), I expect all that the system removes all those artifacts again. Currently, only the ingress configuration is removed.

Cluster provider

RKE2 1.27.9.

Version

Newest helm chart

Additional information

Using k get pods you can see the (correctly, not timeouted) session ending on aa0d-7757cf8mn85 and the incorrectly still running session ending on ebbb-77877chgb9t.

NAME                                                              READY   STATUS    RESTARTS         AGE
image-preloading-2bsrf                                            1/1     Running   0                15d
...
image-preloading-w8fcr                                            1/1     Running   0                15d
landing-page-deployment-687bffc9db-rjpp5                          1/1     Running   0                22h
operator-deployment-6df84cc7c6-68rgv                              1/1     Running   43 (21m ago)     2d21h
operator-deployment-6df84cc7c6-rgrnt                              1/1     Running   41 (3h30m ago)   2d21h
service-deployment-6cf6799c8b-26kkt                               1/1     Running   0                15d
session-yannik-schmidt-artemis-java-17-e41c0d24aa0d-7757cf8mn85   2/2     Running   0                8m
session-yannik-schmidt-artemis-java-17-e669b895ebbb-77877chgb9t   2/2     Running   0                42m

Describing the ingress, you can see that the old session has been properly removed.

Rules:
  Host                                      Path  Backends
  ----                                      ----  --------
  instance.theia.k8s.ase.cit.tum.de  
                                            /b599a576-a832-4da2-a8af-e41c0d24aa0d(/|$)(.*)   session-yannik-artemis-java-17-e41c0d24aa0d:3000 (198.18.121.51:5000)
                                            acme.cert-manager.io/http01-ingress-class: nginx
iyannsch commented 2 months ago

From other projects (e.g., Longhorn on RKE rancher), we have experienced troubles with deleting pods due to ConfigMaps somehow blocking their deletion - maybe it's something along those lines? But as the pod, the replicaset, and the deployment are easily deletable manually, I do not think so actually

jfaltermeier commented 2 months ago

So usually what we are doing is to set the ownerReferences on the created Deployment to the creating Session, e.g.

    ownerReferences:
    - apiVersion: theia.cloud/v1beta8
      kind: Session
      name: ws-theia-cloud-demo-name
      uid: 664bc325-606d-434b-9721-1c005e1b16b6

When a timeout is detected only the Session is deleted. (The deletion is picked up by the operator, which patches the Ingress)

The Kubernetes Garbage Collector should then clean up all "children/orphans" of the deleted resource. And when the deployment gets cleaned, the pods should go as well. (https://kubernetes.io/docs/concepts/architecture/garbage-collection/#orphaned-dependents) Does a deployment for the left-over pod still exist? How do the ownerReferences look like on the Deployment?

iyannsch commented 2 months ago

Thanks for the fast and detailed response, that's exactly what I was hoping for @jfaltermeier ! 😍 I was able to find the ownerReference by add -v=10 to the get command on a deployment of a running, non-timeouted session.

In there, the following information appeared to be useful:

"ownerReferences":[
   {"apiVersion":"theia.cloud/v1beta8","kind":"Session","name":"ws-artemis-java-17-yannik-tum-de-session","uid":"de3bc534-b24d-4bed-91b3-e37b3e539db8"}
],"managedFields":[
   {"manager":"fabric8-kubernetes-client","operation":"Update","apiVersion":"apps/v1","time":"2024-08-30T13:22:50Z","fieldsType":"FieldsV1","fieldsV1":{
      "f:metadata":{
         "f:ownerReferences":{
            ".":{},"k:{\"uid\":\"de3bc534-b24d-4bed-91b3-e37b3e539db8\"}":{}
         }
      },"f:spec":{"f:progressDeadlineSeconds":{},"f:replicas":{},

As we can see, the ownerReference is in fact set to a Session, but that seems to be the template session, right?

Using k describe pod session-yannik-artemis-java-17-e37b3e539db8-664445q4wcd yields the correct session data:

Name:             session-yannik-artemis-java-17-e37b3e539db8-664445q4wcd
Labels:           app=ws-artemis-java-17-yannik-tum-de-session-de3bc534-b24d
Controlled By:  ReplicaSet/session-yannik-artemis-java-17-e37b3e539db8-6644456b9f
      THEIACLOUD_SESSION_NAME:                     ws-artemis-java-17-yannik-tum-de-session
    Name:      session-proxy-yannik-artemis-jav-e37b3e539db8
    Name:        session-email-yannik-artemis-jav-e37b3e539db8
sgraband commented 2 months ago

I was able to find the ownerReference by add -v=10 to the get command on a deployment of a running, non-timeouted session.

Does this mean you cannot find the ownerReference of a running, "should-be-timeouted" session?

That would be really odd indeed. In that case i believe it would make sense to take a lot at all the Sessions that are on the cluster (k get session -n <your-namspace>). Can you still see multiple sessions of your user? You could then investigate the session with k get session <session-name> -n <your-namspace> -o yaml and check if there are any errors.

If you can find the ownerReference in a running, "should-be-timeouted" session, then you can just run k get session <name of ownerReference> -n <your-namspace> -o yaml and check if there any errors/weird values.

iyannsch commented 2 months ago

Hi @sgraband,

no I didn't mean to entail that. In a "should-be-timeouted" pod from 4d ago, the ownerReference looks like follows.

k get deployment session-yannik-schmidt-artemis-java-17-e37b3e539db8 -v 100:

"ownerReferences": [
   {
      "apiVersion": "theia.cloud/v1beta8",
      "kind": "Session",
      "name": "ws-artemis-java-17-yannik-schmidt-tum-de-session",
      "uid": "de3bc534-b24d-4bed-91b3-e37b3e539db8"
   }
],

Using k get session in the namespace that also runs the rest of theia, no resources are found. Thus, the deployment does not have an active owner anymore, right?

Maybe it has something to do that the ReplicaSet has the following ownerReference and could restrict the Deployment from being deleted?

k get replicaset session-yannik-schmidt-artemis-java-17-e37b3e539db8-6644456b9f -v 100:

"ownerReferences": [
   {
      "apiVersion": "apps/v1",
      "kind": "Deployment",
      "name": "session-yannik-schmidt-artemis-java-17-e37b3e539db8",
      "uid": "6a5834e5-39b8-426c-ac95-ebe1f1cda5fa",
      "controller": true,
      "blockOwnerDeletion": true
   }
],

When creating a new session via the landing-page, the pods, replicaset, deployment, and route in the ingress are properly setup - also the session is created.

k describe session ws-artemis-java-17-yannik-schmidt-tum-de-session:

Name:         ws-artemis-java-17-yannik-schmidt-tum-de-session
Namespace:    theia-yannik2
Labels:       <none>
Annotations:  <none>
API Version:  theia.cloud/v1beta8
Kind:         Session
Metadata:
  Creation Timestamp:  2024-09-03T07:37:33Z
  Generation:          1
  Resource Version:    248981774
  UID:                 638b6d76-7990-45c5-a1dc-4aa0a16935c7
Spec:
  App Definition:  artemis-java-17
  Env Vars:
  Env Vars From Config Maps:
  Env Vars From Secrets:
  Name:            ws-artemis-java-17-yannik-schmidt-tum-de-session
  Session Secret:  fabb370b-f169-4c68-9b9f-7f4ab3fd1006
  User:            yannik.schmidt@tum.de
Status:
  Last Activity:    0
  Operator Status:  HANDLED
  URL:              instance.theia-yannik.k8s.ase.cit.tum.de/638b6d76-7990-45c5-a1dc-4aa0a16935c7/
Events:             <none>

This time I noticed though, that the monitor connection from the operator to the session does not seem to properly work. Roughly 30secs after spawning the session, the operator notices that the MonitorActivityTracker cannot reach the pod (for whatever reasons, maybe my helm values are faulty in this regard) and stops the session. In the operator logs:

07:38:06.055 [pool-4-thread-1] TRACE org.eclipse.theia.cloud.operator.BasicTheiaCloudOperator - [3ea066da-4e6c-4504-8f93-638683b95c82] Checking ws-artemis-java-17-yannik-schmidt-tum-de-session. minutesSinceLastActivity: 0. limit: 20
07:38:06.056 [pool-4-thread-1] TRACE org.eclipse.theia.cloud.operator.BasicTheiaCloudOperator - [timeout-3ea066da-4e6c-4504-8f93-638683b95c82] Session ws-artemis-java-17-yannik-schmidt-tum-de-session will keep running until the limit of 20 is hit.
07:38:26.237 [pool-7-thread-1] DEBUG org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - Pinging sessions: [name=ws-artemis-java-17-yannik-schmidt-tum-de-session version=248981774 value=SessionSpec [name=ws-artemis-java-17-yannik-schmidt-tum-de-session, appDefinition=artemis-java-17, user=yannik.schmidt@tum.de, workspace=null]]
07:38:26.324 [pool-7-thread-1] INFO  org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Pinging session at 198.19.97.73
07:38:26.325 [pool-7-thread-1] INFO  org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] GET http://198.19.97.73:33000/monitor/activity/lastActivity
07:38:26.326 [pool-7-thread-1] INFO  org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] REQUEST FAILED: GET http://198.19.97.73:33000/monitor/activity/lastActivity. Error: java.net.ConnectException: Failed to connect to /198.19.97.73:33000
07:38:26.327 [pool-7-thread-1] INFO  org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Last reported activity was: 1970-01-01T00:00:00.000Z (28755818 minutes ago)
07:38:26.394 [pool-7-thread-1] INFO  org.eclipse.theia.cloud.operator.messaging.MonitorMessagingServiceImpl - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Could not send message to extension:java.net.ConnectException: Failed to connect to /198.19.97.73:33000
07:38:26.394 [pool-7-thread-1] INFO  org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Deleting session as timeout of 19 minutes was reached!

Then, the session resource is (correctly for a time-outed session) removed but the Deployment that correctly links to the session in its ownerReference is not removed.

sgraband commented 2 months ago

I cannot really spot something wrong with the resources you have sent. We are aware of a bug with the monitor, where, when the pod is not yet reachable, the lastActivity is set to 1.1.1970, leading to the pod being shutdown. Or to be more precise, the session resource is deleted, and therefore the pod should be deleted.

The order of ownerReferences should be:

Session resource -> Deployment -> ReplicaSet -> Pod "->" meaning that a resource has an ownerReference to the previous resource (e.g. Deployment has the Session resource as its ownerReference)

This looks correct on your end. Are all of the 4 resources in the same namespace (theia-yannik2)? Another thing that might happen, is that there are some sort of problems with the uids? Could you create a new valid, running session and check if the uid of the ownerRefernce in the deployment is correct (the same as in the session resource)?

iyannsch commented 2 months ago

Monitor configuration

Currently, I'm setting up the pod monitor in the AppDefinition to use some high port. This does not seem to work for the operator though. Do you have any insights in why this setup might not work for monitoring?

AppDefinition:

monitor:
    port: 33000
    activityTracker:
      timeoutAfter:
        19
      notifyAfter:
        15

theia-cloud-helm-values.yml:

monitor:
  enable: true
  activityTracker:
    enable: true
    interval: 1

OwnerReference

✔️ The session created has UID: e4347a58-1458-4ce1-b465-87b56db04fd8 in its metadata and does not have any ownerReference. ✔️ The deployment uses the session's correct name, and uid in its ownerReference ✔️ The replicaset uses the deployment's correct name and uid in its ownerReference. Moreover, controller and blockOwnerDeletion is set to true. Question: Is blockOwnerDeletion correct here? ✔️ The pods uses the replicaset's correct name and uid in its ownerReference. Moreover, controller and blockOwnerDeletion is set to true. Question: Is blockOwnerDeletion correct here?

Even though all ownerReferences seem to be properly set, the "teardown" chain does not start as the deployment stays existent even though the session disappears

sgraband commented 2 months ago

Alright, thank you for checking and the information. I will test this out locally and see if i can reproduce this. Could you answer me two more questions:

iyannsch commented 2 months ago

Thank you so much @sgraband! If it helps we can also schedule a quick call to debug that together? I'm currently not using any Monitoring Extension (and wouldn't know what you are concretely talking about - they never crossed my way). Do you have links for me here?

I configured the AppDefinition with a spec.timeout: 20, spec.monitor.activityTracker.timeoutAfter: 19, and spec.monitor.activityTracker.notifyAfter: 15. I reinstalled Theia Cloud in the namespace completely from scratch roughly 4 days ago. In tests >4d ago, the session was running for 20 minutes before not being properly teared down. Now, the bug makes the operator kill it in <1min.

I'm assuming that the spec.timeout is also used for killing the session after 20 min of inactivity and not after 20mins of working continuously, right?

sgraband commented 2 months ago

Let me quickly elaborate on the timeout/monitoring features. As you noticed we have 3 different values atm that are connected with shutting down the pod after a specified amount of time:

Please note that the latter two are only working with your application if you install the monitor extension. Either the vscode-extension version or the Theia extension. This is because the application needs to provide information about the activity. Basically, the extensions startup a REST service,that the operator can then communicate with. This is also the reason, why your request is failing (your application does not start the REST service, so the URL is not available).

If you do not have one of the two extensions installed you should not add spec.monitor to your appDefinitions. Then the operator will also not try to communicate with the pod.

This just as a little bit of background, i am not sure if this is really the problem you are currently experiencing, but you should definately change this. Maybe you can update this and report back if the issue still persists?

iyannsch commented 2 months ago

Oh that's very interesting @sgraband! I did not expect the spec.timeout to be unconditional and fixed, will keep away from that from now on :)

I'd like to use the VSC extension as it also offers the warning messages etc., is it correct that it has not yet been released to OpenVSX? Do you plan on releasing it there (would probably benefit almost all Theia Cloud installations, right? or would you mind if we (from TUM) do so for our use cases?

As we are building our blueprints in the theia-blueprint repo, it's quite some effort to load the theia-cloud repo and build it every time..

sgraband commented 2 months ago

The feature set of the Theia extension and the VSC extension should be the same. What do you mean with the warning messages?

In theory the theia extension offers more flexibility and is already published and consumable via npm.

Maybe @jfaltermeier could chime in here regarding the plan to publish the vsc extension?

iyannsch commented 2 months ago

to notify the user after an upcoming timeout

This is what I referred to as a "warning" :)

I'm fine using the extension too, when the features do not differ - the information on git about the Theia extension were just a bit scarce so I didn't know where to look for a published version. I'll simply add the extension to the package.json of all blueprints for now and report back with my findings.

Apart from that, I just found out that you are also building and releasing a version of the vsc extension in the theia-cloud repo's release artifacts

iyannsch commented 1 month ago

I added the metrics extension to all my blueprints via the package.json:

"dependencies": {
    "@eclipse-theiacloud/monitor-theia": "next",
    "@theia/core": "^1.53.2",
    "@theia/editor": "^1.53.2",
    "@theia/editor-preview": "^1.53.2",
    "@theia/electron": "^1.53.2",
    "@theia/filesystem": "^1.53.2"
}

The AppDefinition is configured like this and contains port 8081 which should be the default port, right?

monitor:
  port: 8081
  activityTracker:
    timeoutAfter:
      20
    notifyAfter:
      15

In the helm values of theia-cloud, I pick up the port and configure the monitor like this:

monitor:
  enable: true
  port: 8081
  activityTracker:
    enable: true
    interval: 1

According to the operator's log, the MonitorActivityTracker is still timing out and, thus, the session is stopped eventually.

INFO  org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] REQUEST FAILED: GET http://198.19.97.73:8081/monitor/activity/lastActivity. Error: java.net.SocketTimeoutException: Connect timed out
INFO  org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Last reported activity was: 1970-01-01T00:00:00.000Z (28764843 minutes ago)
TRACE org.eclipse.theia.cloud.operator.BasicTheiaCloudOperator - [timeout-2b490e91-7a3b-462b-91dc-f35e51826654] Session ws-artemis-java-17-yannik-schmidt-tum-de-session will not be stopped automatically [NoTimeout].
INFO  org.eclipse.theia.cloud.operator.messaging.MonitorMessagingServiceImpl - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Could not send message to extension:java.net.SocketTimeoutException: Connect timed out
INFO  org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Deleting session as timeout of 20 minutes was reached!
TRACE org.eclipse.theia.cloud.operator.util.SpecWatch - [session-watch-2214cdb6-44f5-410d-9d1e-5813ccf01796] Session ac5d2932-cc8c-44b7-8ed8-5dff5930e103 : received an event: DELETED

This opens two questions on my end: 1) Is this still the bug with 1970 you referred to earlier? 2) Why is the metrics endpoint not reachable? Is it sufficient to add the dependencies to the package.json without further configuration?

Apart from that, the session teardown still does not seem to work. The short period of time after starting up the session and having the operator killing it (described above), the deployment is correctly configured to have the session as ownerReference:

"ownerReferences": [
    {
        "apiVersion": "theia.cloud/v1beta8",
        "kind": "Session",
        "name": "ws-artemis-java-17-yannik-schmidt-tum-de-session",
        "uid": "ac5d2932-cc8c-44b7-8ed8-5dff5930e103"
    }
],

Still, when the session is successfully removed due to the monitor malfunction, the deployment stays existent. I think this is a separate problem from the monitor one and aligns more to the initial problem of this thread.

sgraband commented 1 month ago

The default ports are:

So you would need to use 3000. Then the endpoint should work, once the application is started. The bug with 1970 could still happen, but this only happens, when the first request is sent before the application is ready.

Yes sure, i will try to reproduce the issue with the ownerReferences locally and get back to you.

iyannsch commented 1 month ago

Could you quickly elaborate on which ports I need to use where please? In theia-cloud-helm-values.yml:

In AppDefinition:

Am I missing some configuration properties?

sgraband commented 1 month ago

In the newest theia-cloud-helm version there is no more monitor.port value, it has been completely moved to the appDefinition.

I would assume that you need to set both of them to 3000.

See also the demoApplication (demo appDefinition) part of the default values for the helm chart. Hope this helps.

iyannsch commented 1 month ago

Thanks for the clarification! I'm a bit confused about the two documentations regarding Theia: The README's and the "actual" docs (https://main--theia-cloud.netlify.app/documentation/setuptheiacloud/).

I used 3000 for both spec.port and monitor.port for the AppDefinition while removing the monitor.port from the helm values. Unfortunately, the error still persists and but changed from a TimeoutException to a REQUEST FAILED (404):

DEBUG org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - Pinging sessions: [name=ws-artemis-java-17-yannik-schmidt-tum-de-session version=255981443 value=SessionSpec [name=ws-artemis-java-17-yannik-schmidt-tum-de-session, appDefinition=artemis-java-17, user=yannik.schmidt@tum.de, workspace=null]]
INFO  org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Pinging session at 198.19.105.223
INFO  org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] GET http://198.19.105.223:3000/monitor/activity/lastActivity
TRACE org.eclipse.theia.cloud.operator.BasicTheiaCloudOperator - [timeout-205d39a7-d18b-4234-8ef5-723100bfa528] Session ws-artemis-java-17-yannik-schmidt-tum-de-session will not be stopped automatically [NoTimeout].
INFO  org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] REQUEST FAILED (Returned 404: GET http://198.19.105.223:3000/monitor/activity/lastActivity
INFO  org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Last reported activity was: 1970-01-01T00:00:00.000Z (28766050 minutes ago)
INFO  org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Deleting session as timeout of 20 minutes was reached!

Can you think of any reason for this behavior? Can I support you in any way to solve this problem?

One more thing I noticed while building the blueprint is that the Docker Build appears to fail as soon as I add @theia/core to the package.json. Regardless of the package's version, the build fails with

120.2 native node modules are already rebuilt for browser
176.0 
176.0 <--- Last few GCs --->
176.0 
176.0 [10616:0x2c3d28b0]    54243 ms: Mark-sweep 4046.3 (4138.9) -> 4034.6 (4143.2) MB, 772.4 / 0.0 ms  (average mu = 0.537, current mu = 0.010) allocation failure; scavenge might not succeed
176.0 [10616:0x2c3d28b0]    55542 ms: Mark-sweep 4050.4 (4143.2) -> 4038.9 (4146.9) MB, 1292.7 / 0.0 ms  (average mu = 0.288, current mu = 0.005) allocation failure; scavenge might not succeed
176.0 
176.0 
176.0 <--- JS stacktrace --->
176.0 
176.0 FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
...
176.1 Aborted
176.1 
176.1 Error: webpack exited with an unexpected code: 134.

on both my 32GB Mac and our university's runners.

All other dependencies ("@eclipse-theiacloud/monitor-theia": "next", "@theia/editor": "^1.53.2", "@theia/editor-preview": "^1.53.2", "@theia/electron": "^1.53.2", "@theia/filesystem": "^1.53.2") do not stop the building process. The one other error is: Error: Patch file found for package monaco-editor-core which is not present at node_modules/@theia/monaco-editor-core

iyannsch commented 1 month ago

Hey @sgraband, small update from our end. Apart from the bugs and errors identified in our last discussion, I found that our cluster is pretty ill-configured too.. Cascading deleted do not seem to work for now across all - also non-theia - manifests. I'll get back to you with updates and look forward to fixes for the identified problems with the monitor 👍

sgraband commented 1 month ago

@iyannsch The described bug with the monitor logic in the operator has been fixed with https://github.com/eclipse-theia/theia-cloud/pull/354 and will be part of the next release.