Open iyannsch opened 2 months ago
From other projects (e.g., Longhorn on RKE rancher), we have experienced troubles with deleting pods due to ConfigMaps
somehow blocking their deletion - maybe it's something along those lines?
But as the pod, the replicaset, and the deployment are easily deletable manually, I do not think so actually
So usually what we are doing is to set the ownerReferences
on the created Deployment
to the creating Session
, e.g.
ownerReferences:
- apiVersion: theia.cloud/v1beta8
kind: Session
name: ws-theia-cloud-demo-name
uid: 664bc325-606d-434b-9721-1c005e1b16b6
When a timeout is detected only the Session
is deleted. (The deletion is picked up by the operator, which patches the Ingress)
The Kubernetes Garbage Collector should then clean up all "children/orphans" of the deleted resource. And when the deployment gets cleaned, the pods should go as well. (https://kubernetes.io/docs/concepts/architecture/garbage-collection/#orphaned-dependents)
Does a deployment for the left-over pod still exist? How do the ownerReferences
look like on the Deployment?
Thanks for the fast and detailed response, that's exactly what I was hoping for @jfaltermeier ! 😍
I was able to find the ownerReference
by add -v=10
to the get
command on a deployment of a running, non-timeouted session.
In there, the following information appeared to be useful:
"ownerReferences":[
{"apiVersion":"theia.cloud/v1beta8","kind":"Session","name":"ws-artemis-java-17-yannik-tum-de-session","uid":"de3bc534-b24d-4bed-91b3-e37b3e539db8"}
],"managedFields":[
{"manager":"fabric8-kubernetes-client","operation":"Update","apiVersion":"apps/v1","time":"2024-08-30T13:22:50Z","fieldsType":"FieldsV1","fieldsV1":{
"f:metadata":{
"f:ownerReferences":{
".":{},"k:{\"uid\":\"de3bc534-b24d-4bed-91b3-e37b3e539db8\"}":{}
}
},"f:spec":{"f:progressDeadlineSeconds":{},"f:replicas":{},
As we can see, the ownerReference is in fact set to a Session
, but that seems to be the template session, right?
Using k describe pod session-yannik-artemis-java-17-e37b3e539db8-664445q4wcd
yields the correct session data:
Name: session-yannik-artemis-java-17-e37b3e539db8-664445q4wcd
Labels: app=ws-artemis-java-17-yannik-tum-de-session-de3bc534-b24d
Controlled By: ReplicaSet/session-yannik-artemis-java-17-e37b3e539db8-6644456b9f
THEIACLOUD_SESSION_NAME: ws-artemis-java-17-yannik-tum-de-session
Name: session-proxy-yannik-artemis-jav-e37b3e539db8
Name: session-email-yannik-artemis-jav-e37b3e539db8
I was able to find the ownerReference by add -v=10 to the get command on a deployment of a running, non-timeouted session.
Does this mean you cannot find the ownerReference of a running, "should-be-timeouted" session?
That would be really odd indeed. In that case i believe it would make sense to take a lot at all the Sessions that are on the cluster (k get session -n <your-namspace>
). Can you still see multiple sessions of your user? You could then investigate the session with k get session <session-name> -n <your-namspace> -o yaml
and check if there are any errors.
If you can find the ownerReference in a running, "should-be-timeouted" session, then you can just run k get session <name of ownerReference> -n <your-namspace> -o yaml
and check if there any errors/weird values.
Hi @sgraband,
no I didn't mean to entail that. In a "should-be-timeouted" pod from 4d ago, the ownerReference
looks like follows.
k get deployment session-yannik-schmidt-artemis-java-17-e37b3e539db8 -v 100:
"ownerReferences": [
{
"apiVersion": "theia.cloud/v1beta8",
"kind": "Session",
"name": "ws-artemis-java-17-yannik-schmidt-tum-de-session",
"uid": "de3bc534-b24d-4bed-91b3-e37b3e539db8"
}
],
Using k get session
in the namespace that also runs the rest of theia, no resources
are found. Thus, the deployment does not have an active owner anymore, right?
Maybe it has something to do that the ReplicaSet has the following ownerReference
and could restrict the Deployment from being deleted?
k get replicaset session-yannik-schmidt-artemis-java-17-e37b3e539db8-6644456b9f -v 100:
"ownerReferences": [
{
"apiVersion": "apps/v1",
"kind": "Deployment",
"name": "session-yannik-schmidt-artemis-java-17-e37b3e539db8",
"uid": "6a5834e5-39b8-426c-ac95-ebe1f1cda5fa",
"controller": true,
"blockOwnerDeletion": true
}
],
When creating a new session via the landing-page, the pods, replicaset, deployment, and route in the ingress are properly setup - also the session is created.
k describe session ws-artemis-java-17-yannik-schmidt-tum-de-session:
Name: ws-artemis-java-17-yannik-schmidt-tum-de-session
Namespace: theia-yannik2
Labels: <none>
Annotations: <none>
API Version: theia.cloud/v1beta8
Kind: Session
Metadata:
Creation Timestamp: 2024-09-03T07:37:33Z
Generation: 1
Resource Version: 248981774
UID: 638b6d76-7990-45c5-a1dc-4aa0a16935c7
Spec:
App Definition: artemis-java-17
Env Vars:
Env Vars From Config Maps:
Env Vars From Secrets:
Name: ws-artemis-java-17-yannik-schmidt-tum-de-session
Session Secret: fabb370b-f169-4c68-9b9f-7f4ab3fd1006
User: yannik.schmidt@tum.de
Status:
Last Activity: 0
Operator Status: HANDLED
URL: instance.theia-yannik.k8s.ase.cit.tum.de/638b6d76-7990-45c5-a1dc-4aa0a16935c7/
Events: <none>
This time I noticed though, that the monitor
connection from the operator to the session does not seem to properly work. Roughly 30secs after spawning the session, the operator notices that the MonitorActivityTracker
cannot reach the pod (for whatever reasons, maybe my helm values are faulty in this regard) and stops the session.
In the operator logs:
07:38:06.055 [pool-4-thread-1] TRACE org.eclipse.theia.cloud.operator.BasicTheiaCloudOperator - [3ea066da-4e6c-4504-8f93-638683b95c82] Checking ws-artemis-java-17-yannik-schmidt-tum-de-session. minutesSinceLastActivity: 0. limit: 20
07:38:06.056 [pool-4-thread-1] TRACE org.eclipse.theia.cloud.operator.BasicTheiaCloudOperator - [timeout-3ea066da-4e6c-4504-8f93-638683b95c82] Session ws-artemis-java-17-yannik-schmidt-tum-de-session will keep running until the limit of 20 is hit.
07:38:26.237 [pool-7-thread-1] DEBUG org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - Pinging sessions: [name=ws-artemis-java-17-yannik-schmidt-tum-de-session version=248981774 value=SessionSpec [name=ws-artemis-java-17-yannik-schmidt-tum-de-session, appDefinition=artemis-java-17, user=yannik.schmidt@tum.de, workspace=null]]
07:38:26.324 [pool-7-thread-1] INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Pinging session at 198.19.97.73
07:38:26.325 [pool-7-thread-1] INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] GET http://198.19.97.73:33000/monitor/activity/lastActivity
07:38:26.326 [pool-7-thread-1] INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] REQUEST FAILED: GET http://198.19.97.73:33000/monitor/activity/lastActivity. Error: java.net.ConnectException: Failed to connect to /198.19.97.73:33000
07:38:26.327 [pool-7-thread-1] INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Last reported activity was: 1970-01-01T00:00:00.000Z (28755818 minutes ago)
07:38:26.394 [pool-7-thread-1] INFO org.eclipse.theia.cloud.operator.messaging.MonitorMessagingServiceImpl - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Could not send message to extension:java.net.ConnectException: Failed to connect to /198.19.97.73:33000
07:38:26.394 [pool-7-thread-1] INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Deleting session as timeout of 19 minutes was reached!
Then, the session
resource is (correctly for a time-outed session) removed but the Deployment
that correctly links to the session in its ownerReference
is not removed.
I cannot really spot something wrong with the resources you have sent. We are aware of a bug with the monitor, where, when the pod
is not yet reachable, the lastActivity
is set to 1.1.1970, leading to the pod being shutdown. Or to be more precise, the session
resource is deleted, and therefore the pod
should be deleted.
The order of ownerReferences should be:
Session
resource -> Deployment
-> ReplicaSet
-> Pod
"->" meaning that a resource has an ownerReference
to the previous resource (e.g. Deployment
has the Session
resource as its ownerReference
)
This looks correct on your end. Are all of the 4 resources in the same namespace (theia-yannik2
)? Another thing that might happen, is that there are some sort of problems with the uids? Could you create a new valid, running session and check if the uid of the ownerRefernce
in the deployment is correct (the same as in the session
resource)?
Currently, I'm setting up the pod monitor in the AppDefinition
to use some high port. This does not seem to work for the operator
though. Do you have any insights in why this setup might not work for monitoring?
AppDefinition:
monitor:
port: 33000
activityTracker:
timeoutAfter:
19
notifyAfter:
15
theia-cloud-helm-values.yml:
monitor:
enable: true
activityTracker:
enable: true
interval: 1
✔️ The session
created has UID: e4347a58-1458-4ce1-b465-87b56db04fd8
in its metadata and does not have any ownerReference
.
✔️ The deployment
uses the session's correct name
, and uid
in its ownerReference
✔️ The replicaset
uses the deployment's correct name
and uid
in its ownerReference
. Moreover, controller
and blockOwnerDeletion
is set to true
. Question: Is blockOwnerDeletion
correct here?
✔️ The pods
uses the replicaset's correct name
and uid
in its ownerReference
. Moreover, controller
and blockOwnerDeletion
is set to true
. Question: Is blockOwnerDeletion
correct here?
Even though all ownerReference
s seem to be properly set, the "teardown" chain does not start as the deployment
stays existent even though the session
disappears
Alright, thank you for checking and the information. I will test this out locally and see if i can reproduce this. Could you answer me two more questions:
Thank you so much @sgraband! If it helps we can also schedule a quick call to debug that together? I'm currently not using any Monitoring Extension (and wouldn't know what you are concretely talking about - they never crossed my way). Do you have links for me here?
I configured the AppDefinition
with a spec.timeout: 20
, spec.monitor.activityTracker.timeoutAfter: 19
, and spec.monitor.activityTracker.notifyAfter: 15
. I reinstalled Theia Cloud in the namespace completely from scratch roughly 4 days ago. In tests >4d ago, the session was running for 20 minutes before not being properly teared down. Now, the bug makes the operator kill it in <1min.
I'm assuming that the spec.timeout
is also used for killing the session after 20 min
of inactivity and not after 20mins of working continuously, right?
Let me quickly elaborate on the timeout/monitoring features. As you noticed we have 3 different values atm that are connected with shutting down the pod after a specified amount of time:
spec.timeout
: This is a fixed timeout, meaning that the session will timeout x
minutes after its creation.spec.monitor.activityTracker.timeoutAfter
: this is what you were describing, so after x
minutes of inactivty the pod will be shutdown.spec.monitor.activityTracker.notifyAfter
: this is simply to notify the user after an upcoming timeout. So after x
minutes a message is displayed that the pod is about to shutdownPlease note that the latter two are only working with your application if you install the monitor extension. Either the vscode-extension version or the Theia extension. This is because the application needs to provide information about the activity. Basically, the extensions startup a REST service,that the operator can then communicate with. This is also the reason, why your request is failing (your application does not start the REST service, so the URL is not available).
If you do not have one of the two extensions installed you should not add spec.monitor
to your appDefinitions. Then the operator will also not try to communicate with the pod.
This just as a little bit of background, i am not sure if this is really the problem you are currently experiencing, but you should definately change this. Maybe you can update this and report back if the issue still persists?
Oh that's very interesting @sgraband! I did not expect the spec.timeout
to be unconditional and fixed, will keep away from that from now on :)
I'd like to use the VSC extension as it also offers the warning messages etc., is it correct that it has not yet been released to OpenVSX? Do you plan on releasing it there (would probably benefit almost all Theia Cloud installations, right? or would you mind if we (from TUM) do so for our use cases?
As we are building our blueprints in the theia-blueprint
repo, it's quite some effort to load the theia-cloud
repo and build it every time..
The feature set of the Theia extension and the VSC extension should be the same. What do you mean with the warning messages?
In theory the theia extension offers more flexibility and is already published and consumable via npm.
Maybe @jfaltermeier could chime in here regarding the plan to publish the vsc extension?
to notify the user after an upcoming timeout
This is what I referred to as a "warning" :)
I'm fine using the extension too, when the features do not differ - the information on git about the Theia extension were just a bit scarce so I didn't know where to look for a published version. I'll simply add the extension to the package.json
of all blueprints for now and report back with my findings.
Apart from that, I just found out that you are also building and releasing a version of the vsc extension in the theia-cloud
repo's release artifacts
I added the metrics extension to all my blueprints via the package.json
:
"dependencies": {
"@eclipse-theiacloud/monitor-theia": "next",
"@theia/core": "^1.53.2",
"@theia/editor": "^1.53.2",
"@theia/editor-preview": "^1.53.2",
"@theia/electron": "^1.53.2",
"@theia/filesystem": "^1.53.2"
}
The AppDefinition
is configured like this and contains port 8081 which should be the default port, right?
monitor:
port: 8081
activityTracker:
timeoutAfter:
20
notifyAfter:
15
In the helm values of theia-cloud
, I pick up the port and configure the monitor like this:
monitor:
enable: true
port: 8081
activityTracker:
enable: true
interval: 1
According to the operator's log, the MonitorActivityTracker
is still timing out and, thus, the session
is stopped eventually.
INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] REQUEST FAILED: GET http://198.19.97.73:8081/monitor/activity/lastActivity. Error: java.net.SocketTimeoutException: Connect timed out
INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Last reported activity was: 1970-01-01T00:00:00.000Z (28764843 minutes ago)
TRACE org.eclipse.theia.cloud.operator.BasicTheiaCloudOperator - [timeout-2b490e91-7a3b-462b-91dc-f35e51826654] Session ws-artemis-java-17-yannik-schmidt-tum-de-session will not be stopped automatically [NoTimeout].
INFO org.eclipse.theia.cloud.operator.messaging.MonitorMessagingServiceImpl - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Could not send message to extension:java.net.SocketTimeoutException: Connect timed out
INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Deleting session as timeout of 20 minutes was reached!
TRACE org.eclipse.theia.cloud.operator.util.SpecWatch - [session-watch-2214cdb6-44f5-410d-9d1e-5813ccf01796] Session ac5d2932-cc8c-44b7-8ed8-5dff5930e103 : received an event: DELETED
This opens two questions on my end:
1) Is this still the bug with 1970 you referred to earlier?
2) Why is the metrics endpoint not reachable? Is it sufficient to add the dependencies to the package.json
without further configuration?
Apart from that, the session teardown still does not seem to work. The short period of time after starting up the session and having the operator killing it (described above), the deployment is correctly configured to have the session as ownerReference
:
"ownerReferences": [
{
"apiVersion": "theia.cloud/v1beta8",
"kind": "Session",
"name": "ws-artemis-java-17-yannik-schmidt-tum-de-session",
"uid": "ac5d2932-cc8c-44b7-8ed8-5dff5930e103"
}
],
Still, when the session is successfully removed due to the monitor malfunction, the deployment stays existent. I think this is a separate problem from the monitor one and aligns more to the initial problem of this thread.
The default ports are:
So you would need to use 3000. Then the endpoint should work, once the application is started. The bug with 1970 could still happen, but this only happens, when the first request is sent before the application is ready.
Yes sure, i will try to reproduce the issue with the ownerReferences locally and get back to you.
Could you quickly elaborate on which ports I need to use where please?
In theia-cloud-helm-values.yml
:
monitor.port
In AppDefinition:
spec.port
monitor.port
Am I missing some configuration properties?
In the newest theia-cloud-helm version there is no more monitor.port
value, it has been completely moved to the appDefinition.
spec.port
is the port where your application is running (Theia's default is 3000).monitor.port
is the port where the monitor extension is running.
spec.port
as it is running in the backend of the Theia extension.I would assume that you need to set both of them to 3000.
See also the demoApplication
(demo appDefinition) part of the default values for the helm chart. Hope this helps.
Thanks for the clarification! I'm a bit confused about the two documentations regarding Theia: The README's and the "actual" docs (https://main--theia-cloud.netlify.app/documentation/setuptheiacloud/).
I used 3000
for both spec.port
and monitor.port
for the AppDefinition while removing the monitor.port
from the helm values. Unfortunately, the error still persists and but changed from a TimeoutException
to a REQUEST FAILED (404)
:
DEBUG org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - Pinging sessions: [name=ws-artemis-java-17-yannik-schmidt-tum-de-session version=255981443 value=SessionSpec [name=ws-artemis-java-17-yannik-schmidt-tum-de-session, appDefinition=artemis-java-17, user=yannik.schmidt@tum.de, workspace=null]]
INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Pinging session at 198.19.105.223
INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] GET http://198.19.105.223:3000/monitor/activity/lastActivity
TRACE org.eclipse.theia.cloud.operator.BasicTheiaCloudOperator - [timeout-205d39a7-d18b-4234-8ef5-723100bfa528] Session ws-artemis-java-17-yannik-schmidt-tum-de-session will not be stopped automatically [NoTimeout].
INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] REQUEST FAILED (Returned 404: GET http://198.19.105.223:3000/monitor/activity/lastActivity
INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Last reported activity was: 1970-01-01T00:00:00.000Z (28766050 minutes ago)
INFO org.eclipse.theia.cloud.operator.plugins.MonitorActivityTracker - [ws-artemis-java-17-yannik-schmidt-tum-de-session] Deleting session as timeout of 20 minutes was reached!
Can you think of any reason for this behavior? Can I support you in any way to solve this problem?
One more thing I noticed while building the blueprint is that the Docker Build appears to fail as soon as I add @theia/core
to the package.json
. Regardless of the package's version, the build fails with
120.2 native node modules are already rebuilt for browser
176.0
176.0 <--- Last few GCs --->
176.0
176.0 [10616:0x2c3d28b0] 54243 ms: Mark-sweep 4046.3 (4138.9) -> 4034.6 (4143.2) MB, 772.4 / 0.0 ms (average mu = 0.537, current mu = 0.010) allocation failure; scavenge might not succeed
176.0 [10616:0x2c3d28b0] 55542 ms: Mark-sweep 4050.4 (4143.2) -> 4038.9 (4146.9) MB, 1292.7 / 0.0 ms (average mu = 0.288, current mu = 0.005) allocation failure; scavenge might not succeed
176.0
176.0
176.0 <--- JS stacktrace --->
176.0
176.0 FATAL ERROR: Reached heap limit Allocation failed - JavaScript heap out of memory
...
176.1 Aborted
176.1
176.1 Error: webpack exited with an unexpected code: 134.
on both my 32GB Mac and our university's runners.
All other dependencies ("@eclipse-theiacloud/monitor-theia": "next", "@theia/editor": "^1.53.2", "@theia/editor-preview": "^1.53.2", "@theia/electron": "^1.53.2", "@theia/filesystem": "^1.53.2") do not stop the building process.
The one other error is: Error: Patch file found for package monaco-editor-core which is not present at node_modules/@theia/monaco-editor-core
Hey @sgraband, small update from our end. Apart from the bugs and errors identified in our last discussion, I found that our cluster is pretty ill-configured too.. Cascading deleted do not seem to work for now across all - also non-theia - manifests. I'll get back to you with updates and look forward to fixes for the identified problems with the monitor 👍
@iyannsch The described bug with the monitor logic in the operator has been fixed with https://github.com/eclipse-theia/theia-cloud/pull/354 and will be part of the next release.
Describe the bug
I noticed that pods of timeouted sessions are still existent on our cluster and are being recreated on deletion by the according replicaset and deployment. As the entry in the ingress is properly removed, they are not of use to the user anymore and block resources.
Expected behavior
During session startup, I expect the deployment, replicaset, and pods to be created as well as a proper entry in the ingress. This is all working perfectly already. During session teardown (due to timeout), I expect all that the system removes all those artifacts again. Currently, only the ingress configuration is removed.
Cluster provider
RKE2 1.27.9.
Version
Newest helm chart
Additional information
Using
k get pods
you can see the (correctly, not timeouted) session ending onaa0d-7757cf8mn85
and the incorrectly still running session ending onebbb-77877chgb9t
.Describing the ingress, you can see that the old session has been properly removed.