SeleniumHQ / selenium

A browser automation framework and ecosystem.
https://selenium.dev
Apache License 2.0
30.22k stars 8.11k forks source link

Dynamic Selenium 4 grid on kubernetes #9845

Closed gazal-k closed 9 months ago

gazal-k commented 3 years ago

🚀 Feature Proposal

Just like dynamic Selenium 4 grid using docker, having a similar k8s "pod factory" (or something similar) would be nice.

https://github.com/zalando/zalenium does that. Perhaps some of that can be ported to grid 4

diemol commented 2 years ago

We are happy to discuss approaches, what do you have in mind, @gazal-k?

gazal-k commented 2 years ago

Sorry, I'm not really familiar with the selenium grid codebase. I imagine this: https://github.com/SeleniumHQ/selenium/blob/trunk/java/src/org/openqa/selenium/grid/node/docker/DockerSessionFactory.java has some of the logic to dynamically create browser nodes to join the grid. It would be nice to have something similar to create k8s Pods so that the kubernetes selenium 4 grid scales based on the test as opposed to creating a static number of browser nodes.

Again, sorry that I don't have something more solid to contribute.

sahajamit commented 2 years ago

I have attempted to build something similar for Kubernetes with Selenium Grid3. More details here: https://link.medium.com/QQMCXLqQMjb

pearj commented 2 years ago

I have some thoughts about how the Kubernetes support could be implemented. I remember having a look at the Grid 4 codebase in December 2018 and I wrote up my thoughts in this ticket over in Zalenium when someone asked if we planned to support Grid 4: https://github.com/zalando/zalenium/issues/1028#issuecomment-522230092 This was largely based on my ideas on how to add High-Availability support Zalenium for Kubernetes: https://github.com/zalando/zalenium/issues/484#issue-305907701 from early 2018.

So assuming the grid architecture is still the same as it was in 2018, ie router, sessionMap and distributor. Then I think my original ideas are still valid.

The crux of it was to implement the sessionMap as annotations (metadata) on a Kubernetes pod, so that Selenium Grid didn't need to maintain the session state, which means that you could scale it and make it highly available much easier.

So it means you could run multiple copies of the router, and you probably just want one distributor as you'd get into race conditions when creating new selenium pods. The sessionMap would end up just being a shared module/library that the router and distributor used to talk to the Kubernetes API server.

LukeIGS commented 2 years ago

If we wanted a more pure k8s solution, if there were metrics exposed around how many selenium sessions are in queue, or even how long they've been waiting, maybe even rate of queue processing, it would be possible to configure a horizontal pod autoscaler (HPA) around the node deployment itself to target a given rate of message processing.

Warxcell commented 2 years ago

There is https://keda.sh/docs/2.4/scalers/selenium-grid-scaler/ which can autoscale nodes, it's working fine - the problem is with tearing down a node. Since it doesn't keep track of which node is working - it could kill test in progress, and it seems Chrome Node doesn't handle it gracefull.

MissakaI commented 2 years ago

I tried another approach by implementing an application which intercepts the docker-engine calls from the selenium node-docker component and then translates those calls to k8s calls and then call the Kubernetes API. It works properly on creating and stopping browser nodes depending on the calls from node-docker. But this has a major problem because node-docker doesn't support concurrency. It can only create single browser-node, run test, destroy it and then next. (I will be creating a separate issue for that involving the docker-node as for the concurrency issue).

From what i noticed is the node-docker binds those browser nodes to itself and expose it as an session of the node-docker to the distributor. So all that the distributor sees is the node-docker and not the browser node. I think this approach is not appropriate during concurrent execution as i feel it is a point of failure and end all the sessions routed through the node-docker.

Therefore I think KEDA Selenium-Grid-AutoScaler is a much better approach.

MissakaI commented 2 years ago

The crux of it was to implement the sessionMap as annotations (metadata) on a Kubernetes pod, so that Selenium Grid didn't need to maintain the session state, which means that you could scale it and make it highly available much easier.

There is slight issue with this as this will make the Grid 4 dependent on Kubernetes. This will make two different implementations of Grid which is specific to K8s and one that is not dependent on Kubernetes. I think much better approach is to make the Grid HA with other approaches like sharing the current state with all the instances of particular grid component type.

quarckster commented 2 years ago

There is slight issue with this as this will make the Grid 4 dependent on Kubernetes.

It's already dependent on Docker. Perhaps there should be some middleware for different environments.

qalinn commented 2 years ago

@MissakaI I have tested KEDA Selenium-Grid-AutoScaler and is scaling up how many nodes you need based on the queue session and is ok. The problem is with video part because doesn't work in kubernetes. I have managed to deploy video container on the same pod but the video file is not saved till the video container is not stop gracefully and also you cannot set the name of the video for every test, is recording all the time till will be closed.

LukeIGS commented 2 years ago

There is slight issue with this as this will make the Grid 4 dependent on Kubernetes.

It's already dependent on Docker. Perhaps there should be some middleware for different environments.

The selenium repository is currently dependent on Ruby, Python, dotnet, and quite a few other things that it probably shouldn't be, there's certainly an argument for a lot of stuff to be split out into separate modules, but that's probably a conversation for another issue.

tomkerkhove commented 2 years ago

We had a note in the standup meeting of KEDA to see if we can help with Selenium & video. Is the person who added it part of this thread? If so, please open a discussion how we can help: https://github.com/kedacore/keda/discussions/new

LukeIGS commented 2 years ago

Will do, issue in question is https://github.com/SeleniumHQ/selenium/issues/10018

These two are pretty intertwined.

qalinn commented 2 years ago

@tomkerkhove I am that who added the note on your note standup meeting. Please see also the next issue: #10018

tomkerkhove commented 2 years ago

Tracking this in https://github.com/kedacore/keda/discussions/2494

msvticket commented 2 years ago

As was mentioned in kedacore/keda#2494 you can either use KEDA to scale a deployment or jobs. I'm thinking that scaling jobs might be more fitting. But you then need to make sure that the container exits when it's done with a session. On the other hand you don't have the problem with Kubernetes trying to delete pod that is still executing a test.

To make a node exit after a session is done you need to add a property to to the node section of config.toml:

implementation=org.openqa.selenium.grid.node.k8s.OneShotNode

With the official docker images this isn't enough since supervidord would still be running. So for that case you would need to add a supervisord event listener that finishes supervisord with its subprocesses.

One good thing with this approach is that combined with the video feature you get one video per session. Regarding graceful shutdown: In the dynamic grid code any video container is stopped before the node/browser container. So I guess the video file gets corrupted if Xvfb exits before ffmpeg is done saving the file. The event listener described above should therefore shutdown the supervisord in the video container before shutting down the one in the same container.

For shutting down supervisord, you can use the unix_http_server and supervisorctl features of supervisord. That works between containers in the pod as well.

I've also been thinking about how to have the video file uploaded to s3 (or similar) automatically. The tricky part is supplying the pod with the url to upload the file to. I have some ideas, but that have to wait until the basic solution is implemented.

MissakaI commented 2 years ago

I have managed to deploy video container on the same pod but the video file is not saved till the video container is not stop gracefully and also you cannot set the name of the video for every test, is recording all the time till will be closed.

I think this case should be followed with the thread dedicated to it. Which is mentioned by @LukeIGS

Will do, issue in question is

10018

MissakaI commented 2 years ago

Also we need a way to implement liveliness and readiness probes because i ran into few instances that the selenium process was killed and pod continues to run which results in no new pod is reinstated by Kubernetes after terminating the currently crashed pod.

MissakaI commented 2 years ago

As was mentioned in kedacore/keda#2494 you can either use KEDA to scale a deployment or jobs. I'm thinking that scaling jobs might be more fitting. But you then need to make sure that the container exits when it's done with a session. On the other hand you don't have the problem with Kubernetes trying to delete pod that is still executing a test.

Thank you for this. I have been using deployments and thought of raising a issue KEDA to add the annotation controller.kubernetes.io/pod-deletion-cost: -999 which sets the replication controller to delete the pod with least cost and leave the others.

To make a node exit after a session is done you need to add a property to to the node section of config.toml:

implementation=org.openqa.selenium.grid.node.k8s.OneShotNode

Also can you point me to where this was included in the Selenium Documentation if it was documented.

msvticket commented 2 years ago

As was mentioned in kedacore/keda#2494 you can either use KEDA to scale a deployment or jobs. I'm thinking that scaling jobs might be more fitting. But you then need to make sure that the container exits when it's done with a session. On the other hand you don't have the problem with Kubernetes trying to delete pod that is still executing a test.

Thank you for this. I have been using deployments and thought of raising a issue KEDA to add the annotation controller.kubernetes.io/pod-deletion-cost: -999 which sets the replication controller to delete the pod with least cost and leave the others.

I don't see how that would help. You could put that cost in the manifest to begin with. But in any case you end up with having to remove/update that annotation when the test is done and KEDA don't know when that is.

There is a recent proposal for Kubernetes to let the pod inform Kubernetes on which pods to delete through a probe: kubernetes/kubernetes#107598. Until something like that is implemented either the node itself or maybe the distributor would need to update the annotation.

To make a node exit after a session is done you need to add a property to to the node section of config.toml: implementation=org.openqa.selenium.grid.node.k8s.OneShotNode

Also can you point me to where this was included in the Selenium Documentation if it was documented.

I haven't found anything about it in the documentation. I stumbled upon org.openqa.selenium.grid.node.k8s.OneShotNode when I was looking in the selenium code. It then took a while for me to find out how to make use of the class. That's implemented here: https://github.com/SeleniumHQ/selenium/blob/2decee49816aa611ce7bbad4e52fd1b29629b1df/java/src/org/openqa/selenium/grid/node/config/NodeOptions.java#L148

On the other hand I haven't tested it, so who knows if OneShotNode still works...

This is where it should be documented: https://www.selenium.dev/documentation/grid/configuration/toml_options/

MissakaI commented 2 years ago

I don't see how that would help. You could put that cost in the manifest to begin with. But in any case you end up with having to remove/update that annotation when the test is done and KEDA don't know when that is.

I was intending to either write an application that will monitor the test sessions along with the respective pod or write a custom KEDA scaler that will do what i mentioned previously.

msvticket commented 2 years ago

There is an issue about shutting down the node container when the node server has exited: SeleniumHQ/docker-selenium#1435

MissakaI commented 2 years ago

On the other hand I haven't tested it, so who knows if OneShotNode still works...

It seems like even though the code is available in the repo it causes ClassNotFoundException after adding it to the config.toml. Extracting the selenium-server-4.1.1.jar revealed that the k8s folder is completely removed.

java.lang.reflect.InvocationTargetException
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:566)
        at org.openqa.selenium.grid.Bootstrap.runMain(Bootstrap.java:77)
        at org.openqa.selenium.grid.Bootstrap.main(Bootstrap.java:70)
Caused by: org.openqa.selenium.grid.config.ConfigException: java.lang.ClassNotFoundException: org.openqa.selenium.grid.node.k8s.OneShotNode
        at org.openqa.selenium.grid.config.MemoizedConfig.getClass(MemoizedConfig.java:115)
        at org.openqa.selenium.grid.node.config.NodeOptions.getNode(NodeOptions.java:148)
        at org.openqa.selenium.grid.node.httpd.NodeServer.createHandlers(NodeServer.java:127)
        at org.openqa.selenium.grid.node.httpd.NodeServer.asServer(NodeServer.java:183)
        at org.openqa.selenium.grid.node.httpd.NodeServer.execute(NodeServer.java:230)
        at org.openqa.selenium.grid.TemplateGridCommand.lambda$configure$4(TemplateGridCommand.java:129)
        at org.openqa.selenium.grid.Main.launch(Main.java:83)
        at org.openqa.selenium.grid.Main.go(Main.java:57)
        at org.openqa.selenium.grid.Main.main(Main.java:42)
        ... 6 more
Caused by: java.lang.ClassNotFoundException: org.openqa.selenium.grid.node.k8s.OneShotNode
        at java.base/jdk.internal.loader.BuiltinClassLoader.loadClass(BuiltinClassLoader.java:581)
        at java.base/jdk.internal.loader.ClassLoaders$AppClassLoader.loadClass(ClassLoaders.java:178)
        at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
        at java.base/java.lang.Class.forName0(Native Method)
        at java.base/java.lang.Class.forName(Class.java:398)
        at org.openqa.selenium.grid.config.ClassCreation.callCreateMethod(ClassCreation.java:35)
        at org.openqa.selenium.grid.config.MemoizedConfig.lambda$getClass$4(MemoizedConfig.java:100)
        at java.base/java.util.concurrent.ConcurrentHashMap.computeIfAbsent(ConcurrentHashMap.java:1737)
        at org.openqa.selenium.grid.config.MemoizedConfig.getClass(MemoizedConfig.java:95)
        ... 14 more

The docker image that i used was selenium/node-firefox.

msvticket commented 2 years ago

Well, the selenium project is a bit confusing. Apparently the selenium build system excludes the package org.openqa.selenium.grid.node.k8s from selenium-server.jar. Here I found bazel build configurations for building docker images: https://github.com/SeleniumHQ/selenium/tree/trunk/deploys/docker

The firefox_node and chrome_node images are there declared to include a layer (called one-shot) that includes a library with that class. But these images and the library doesn't seem to be published publicly anywhere.

In https://github.com/SeleniumHQ/selenium/tree/trunk/deploys/k8s you can see how that library is utilized: https://github.com/SeleniumHQ/selenium/blob/451fc381325437942bc953e3f79facee9f2a3c22/deploys/k8s/firefox-node.yaml#L19-L44

It seems like the idea is that you checkout the code to build and deploy these images and k8s manifest to your local infrastructure.

diemol commented 2 years ago

Thank you all for sharing your thoughts and offering paths to move forward. I will reply to the comments below.

diemol commented 2 years ago

There is https://keda.sh/docs/2.4/scalers/selenium-grid-scaler/ which can autoscale nodes, it's working fine - the problem is with tearing down a node. Since it doesn't keep track of which node is working - it could kill test in progress, and it seems Chrome Node doesn't handle it gracefull.

Something new in Grid 4 is the "Drain Node" feature. With it, you can start draining the Node, so no new sessions are accepted, and when the last session is completed, the Node shutsdown. It gets tricky when the Node is inside a Docker container because supervisor does not exit, which is the point of https://github.com/SeleniumHQ/docker-selenium/issues/1435. I have not had the time to implement it, but hoping someone can contribute to it.

diemol commented 2 years ago

As was mentioned in kedacore/keda#2494 you can either use KEDA to scale a deployment or jobs. I'm thinking that scaling jobs might be more fitting. But you then need to make sure that the container exits when it's done with a session. On the other hand you don't have the problem with Kubernetes trying to delete pod that is still executing a test.

To make a node exit after a session is done you need to add a property to to the node section of config.toml:

implementation=org.openqa.selenium.grid.node.k8s.OneShotNode

With the official docker images this isn't enough since supervidord would still be running. So for that case you would need to add a supervisord event listener that finishes supervisord with its subprocesses.

One good thing with this approach is that combined with the video feature you get one video per session. Regarding graceful shutdown: In the dynamic grid code any video container is stopped before the node/browser container. So I guess the video file gets corrupted if Xvfb exits before ffmpeg is done saving the file. The event listener described above should therefore shutdown the supervisord in the video container before shutting down the one in the same container.

For shutting down supervisord, you can use the unix_http_server and supervisorctl features of supervisord. That works between containers in the pod as well.

I've also been thinking about how to have the video file uploaded to s3 (or similar) automatically. The tricky part is supplying the pod with the url to upload the file to. I have some ideas, but that have to wait until the basic solution is implemented.

I believe this comment has most of the information needed for this issue.

To shutdown the Node, we can either go with Draining or try to use the OneShotNode. The OneShotNode was an experiment and has not been tested throughly. Either way, if we end up using OneShotNode, we can see how to include it in the server jar.

Probably the things that need to be tackled are:

Bjego commented 2 years ago

Is there any activity on this issue? What's the recommendation? Is the keda bug still there?

Bjego commented 2 years ago

@diemol how about your suggestion here: https://github.com/SeleniumHQ/selenium/issues/7243 to document how to scale the nodes by a kubernetes cli? I guess people could spin up their own sidecar / cronjob to check the endpoints and to scale nodes. Js/Ts do have a pretty good kubernetes client lib - which should be good enough to scale pods.

diemol commented 2 years ago

The containers now exit when the Node shutsdown. We still need to add a flag too the Node so it exits after X sessions.

Bjego commented 2 years ago

@diemol well I've did some investigation at the current k8s setup and also the apis (Queue, Drain, Status) and I figured out that - it's pretty easy to scale a dynamic grid with the current chrome nodes (we are testing on chrome only).

The key facts are:

Write a scaler tool:

I'll share some example code and write an blog article on medium later today.

I think, if the selenium project doesn't want to deal with the k8s apis. It would be much easier for your users, if you could webhook a service when the hub queue fills up and tell a "manager" service - which kind of node with wich capability has been requested. Your users could write and deploy a "manager" service - which handles the k8s pod deployments. Draining nodes could be either - send a second webhook or drain the nodes automatically.

Then managing would be much easier as a "manager" service wouldn't need to poll the queue and the status all the time. I'll run my manager in a cron job now.

Hope this helps other people who want to scale their selenium hub.

diemol commented 2 years ago

@Bjego yes, that is conceptually something I wanted to have, so we are aligned on that sense. Leveraging the tools that Kubernetes offers sounds like a more maintainable way of doing this.

Did you have any thoughts regarding video recording?

Bjego commented 2 years ago

@diemol I didn't try video recording - I just tried the "session" viewer. But is there an api where you can download recorded sessions from a node? I mean - we do have all the information in the status api (nodeid). Maybe a download api would be good.

So the manager could - before draining the node - call the download session recordings api from the hub - and persist the videos somewhere.. And finally your testsuite could try to download the videos from the manager instead of the hub/node..

But honestly - I'm not yet in the topic of video downloading!

Bjego commented 2 years ago

@diemol or mabye an easier solution . have a persistant storeage on the hub. The hub automatically downloads the videos from each session and stores them on the persistant storage. Have an api which can be called from my testsuite and download the videos from the centralised storage of the hub (or any other service from the grid). This could work in a k8s environment as well as in an "classic" vm based environment.

LukeIGS commented 2 years ago

@diemol I was doing some experimenting with video recording using selenosis https://github.com/castone22/selenosis, i ended up figuring out that the cause of the video corruption in ffpmeg is the container it's recording against being terminated before it. Adding a prestop hook to the selenium container to sleep for a few seconds completely fixes it. I'm guessing it's got something to do with attempting to record from a non-existent source for like a quarter of a second before it gets its TERM signal.

msvticket commented 2 years ago

I've tried to think of a way to implement storage of video recording as simply as possible in the selenium codebase.

My idea is to add a capability, say videoUploadURI, that the node stores somewhere. Probably just in a predefined file. When the video recording is done the file could then be uploaded to this URI. The upload would preferably be done by a separate container to easily leverage existing docker images. I could for example use public.ecr.aws/bitnami/aws-cli:2 to upload the file to S3.

I could flesh out the details for a possible implementation if you like.

Bjego commented 2 years ago

As promised - here is my article: https://bw861987.medium.com/dynamic-scaling-selenium-grid-in-kubernetes-f642ae2dc561 and the code: https://github.com/Bjego/Selenium-dynamic-kubernetes-grid

Wolfe1 commented 2 years ago

@Bjego @diemol Thought I would chime in as I have been working on getting a good autoscaling grid setup in selenium grid 4 and think the community may find it helpful as we can use the out of the box HPA in k8s with a custom metrics server (keda).

Keda runs the autoscaling of the pods based on the session queue thanks to a hook they have written for selenium grid: https://keda.sh/docs/2.5/scalers/selenium-grid-scaler/

This is configured with a "scaled-object":

Scaled-Object.yaml:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: selenium-grid-chrome-scaledobject
  namespace: legion-selenium-grid
  labels:
    deploymentName: selenium-chrome-node-deployment
spec:
  minReplicaCount: 0
  maxReplicaCount: 80
  scaleTargetRef:
    name: selenium-chrome-node-deployment
  triggers:
    - type: selenium-grid
      metadata:
        url: 'URL_FOR_GRID/graphql'
        browserName: 'chrome'

And yes this setup does allow us to scaled down to 0 pods. 😁

This of course still had the issue of scaling down the pods at random, resulting in some tests being killed. To combat this I used a preStop command (coupled with a longer terminationGracePeriodSeconds) to watch the chromedriver, drain once it's finished, watch the node session, and then terminate.

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: selenium-node-chrome
  name: selenium-chrome-node-deployment
  namespace: legion-selenium-grid
spec:
  ...
  template:
    ...
    spec:
      terminationGracePeriodSeconds: 3600
      containers:
      - lifecycle:
          preStop:
            exec:
              command: ["/bin/sh", "-c", "tail --pid=$(pgrep -f '[o]pt/selenium/chromedriver') -f /dev/null; curl --request POST 'localhost:5555/se/grid/node/drain' --header 'X-REGISTRATION-SECRET;'; tail --pid=$(pgrep -f '[n]ode --bind-host false --config /opt/selenium/config.toml') -f /dev/null"]

Full steps:

  1. Install keda: kubectl apply -f https://github.com/kedacore/keda/releases/download/v2.6.1/keda-2.6.1.yaml
  2. Create a file for the scaled-object.yaml, save the code.
  3. kubectl apply -f ./scaled-object.yaml --namespace=NAMESAPCE_OF_BROWSER
  4. Add a longer terminationGracePeriodSeconds to your browser deployment, this is to allow your test to finish before the pod is forced to terminate.
  5. Add the preStop code to your browser deployment. If you don't use chrome the name/path of the driver will change of course.
  6. That should be it, your grid will start scaling up and down based on your session queue. When pods are asked to terminate they will have terminationGracePeriodSeconds to finish up the test.
  7. You can view the running HPA with kubectl describe hpa keda-hpa-selenium-grid-chrome-scaledobject -n=NAMESAPCE_OF_BROWSER

More details explained here: https://github.com/sahajamit/selenium-grid-autoscaler/issues/2 Keda docs: https://keda.sh/docs/2.6/ Keda docs (for selenium grid): https://keda.sh/docs/2.5/scalers/selenium-grid-scaler/

Bjego commented 2 years ago

@Wolfe1 keda is interesting as well, but I'd expect such a "plugin" in keda to handle the shutdown of the nodes correctly. But I guess this isn't part of the design of keda, as it seems to be build on top of the horizontal autoscaler. Guess this works okay, but the selenium apis are quite easy to use. Would think that such a scaler plugin might could handle the apis properly and you don't need the shutdown workaround with the graceperiod of 1 hour and to run the drain on the node and not on the hub.

diemol commented 2 years ago

I've tried to think of a way to implement storage of video recording as simply as possible in the selenium codebase.

My idea is to add a capability, say videoUploadURI, that the node stores somewhere. Probably just in a predefined file. When the video recording is done the file could then be uploaded to this URI. The upload would preferably be done by a separate container to easily leverage existing docker images. I could for example use public.ecr.aws/bitnami/aws-cli:2 to upload the file to S3.

I could flesh out the details for a possible implementation if you like.

A probable easier approach is to use persistent storage, mentioned by @Bjego. However, this would be a only Kubernetes solution. Most likely the right way to do it would be to actually upload the video and metadata to a given url, as suggested by @msvticket, because this approach would work for both Kubernetes and only Docker.

diemol commented 2 years ago

@Wolfe1 keda is interesting as well, but I'd expect such a "plugin" in keda to handle the shutdown of the nodes correctly. But I guess this isn't part of the design of keda, as it seems to be build on top of the horizontal autoscaler. Guess this works okay, but the selenium apis are quite easy to use. Would think that such a scaler plugin might could handle the apis properly and you don't need the shutdown workaround with the graceperiod of 1 hour and to run the drain on the node and not on the hub.

@Wolfe1, regarding this, have you checked the draining endpoint for Nodes? That would exit the Node and therefore exit the container in a clean way (if a session is running it will wait until it stops and then exits). The draining endpoint could avoid having tests being killed.

Wolfe1 commented 2 years ago

@Wolfe1 keda is interesting as well, but I'd expect such a "plugin" in keda to handle the shutdown of the nodes correctly. But I guess this isn't part of the design of keda, as it seems to be build on top of the horizontal autoscaler. Guess this works okay, but the selenium apis are quite easy to use. Would think that such a scaler plugin might could handle the apis properly and you don't need the shutdown workaround with the graceperiod of 1 hour and to run the drain on the node and not on the hub.

@Wolfe1, regarding this, have you checked the draining endpoint for Nodes? That would exit the Node and therefore exit the container in a clean way (if a session is running it will wait until it stops and then exits). The draining endpoint could avoid having tests being killed.

@Bjego That is correct, Keda's function (from my understanding) is to extend the base HPA in k8s. The plugin is still relatively new (July 2021) so I am not sure how much testing it has had by the selenium community. I am sure there may be ways to enhance such functionality but I am not sure how we get past the random termination of pods on scale down.

@diemol Yes I have looked into that endpoint (I use it in the preStop command) and it does indeed do its job. The issue is though that when HPA scales down it will do so at random without consideration of a pod running a test or not. I could drain specific nodes before termination but I would still need to reduce the replica set which wouldn't necessarily pick the correct pod to remove.

The preStop code at least allows the pod to terminate more gracefully if a test is running. If a test is running we wait and then terminate, If no test is running it terminates immediately as the drain takes just a second so no real drawback from what I can see (aside from a hanged pod taking a while to terminate).

Bjego commented 2 years ago

@Wolfe1 that's how I understood it as well. When you have a look at the external scalers - those are only for upscaling in keda. So there is not yet a build in way to scale down objects in k8s based on an api - as I would do it with selenium nodes. https://keda.sh/docs/1.4/concepts/external-scalers/ But I guess, if your shutdown script works for - thats a solution as well.

Bjego commented 2 years ago

@Wolfe1 just wanted to give you a heads up - the chrome nodes "selenium/node-chrome:4" don't have curl installed. So the shutdown command should be: wget -O /dev/null --method=POST --header='X-REGISTRATION-SECRET:' http://localhost:5555/se/grid/node/drain

Wolfe1 commented 2 years ago

@Bjego Hmm odd, seems to be included in selenium/node-chrome:latest

image

MissakaI commented 2 years ago

kind: ScaledObject

I find it more appropriate to use ScaledJob instead of the `ScaledObjects. Because ScaledObjects, scale up and down in the order they are scaled up. This caused my instances to terminate pods with active sessions because a session ended in one of the other pods and keda decided to scale down.

So instead i started using ScaledJob because they terminate the pod after the session ends.

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: selenium-firefox-node
spec:
  pollingInterval: 30                         # Optional. Default: 30 seconds
  successfulJobsHistoryLimit: 1               # Optional. Default: 100. How many completed jobs should be kept.
  failedJobsHistoryLimit: 1                   # Optional. Default: 100. How many failed jobs should be kept.
  envSourceContainerName: selenium-firefox-node    # Optional. Default: .spec.JobTargetRef.template.spec.containers[0]
  maxReplicaCount: 2                          # Optional. Default: 100
  triggers:
    - type: selenium-grid
      metadata:
        url: 'http://selenium-hub.NAMESPACE.svc.cluster.local:4444/graphql'
        browserName: 'firefox'   
  jobTargetRef:
    parallelism: 2                            # [max number of desired pods](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism)
    completions: 1                            # [desired number of successfully finished pods](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism)
    activeDeadlineSeconds: 600                #  Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it; value must be positive integer
    backoffLimit: 6                           # Specifies the number of retries before marking this job failed. Defaults to 6
    template:
      spec:
        volumes:
        - name: dshm
          emptyDir: { "medium": "Memory" }
        containers:
        - name: selenium-firefox-node
          image: selenium/node-firefox:4.1.1
          resources:
            limits:
              memory: "500Mi"
              cpu: "500m"
          volumeMounts:
            - name: dshm
              mountPath: /dev/shm
          env:
            - name: SE_EVENT_BUS_HOST
              value: selenium-hub
            - name: SE_EVENT_BUS_PUBLISH_PORT
              value: '4442'
            - name: SE_EVENT_BUS_SUBSCRIBE_PORT
              value: '4443'
          ports:
            - containerPort: 5553
              protocol: TCP
        restartPolicy: Never

Hope this helps. There are still some hiccups which I will include here as I find them. As for the videos I didn't yet start on fixing it but i am aware that videos doesn't get saved properly.

Bjego commented 2 years ago

@MissakaI thanks for the snippet - but how does keda notice that the session ended on a node? Is that handled by keda? Or how did you configure your firefox node? @diemol said that single session nodes are not yet implemented.

Maybe there is the opportunity to override the start command of a chrome node to just accept a single session an then to shut down?

Warxcell commented 2 years ago

Maybe there is the opportunity to override the start command of a chrome node to just accept a single session an then to shut down?

That will lead to loop crash backoff I believe.

JontyMC commented 2 years ago

How are people scaling up their kubernetes clusters for this? We use azure aks and seems like the ideal scenario would be to use virtual nodes (container instances) and have a single test per container, so ideally a test run would take as long as the slowest test + time to spin up a container. Is the plan to support this setup? Does anyone do anything like this now? If not, are you using other metrics like CPU to add new kubernetes nodes?

Bjego commented 2 years ago

How are people scaling up their kubernetes clusters for this? We use azure aks and seems like the ideal scenario would be to use virtual nodes (container instances) and have a single test per container, so ideally a test run would take as long as the slowest test + time to spin up a container. Is the plan to support this setup? Does anyone do anything like this now? If not, are you using other metrics like CPU to add new kubernetes nodes?

I think you might could have a look into the code I've shared and extend the manager to call the azure APIs for scaling up the AKS and scale it down as well. Guess this could be done via calling the azure cli from node , or using some javascript/typescript azure sdks.