Dynamic Selenium 4 grid on kubernetes

gazal-k commented 3 years ago

🚀 Feature Proposal

Just like dynamic Selenium 4 grid using docker, having a similar k8s "pod factory" (or something similar) would be nice.

https://github.com/zalando/zalenium does that. Perhaps some of that can be ported to grid 4

dalgibbard commented 2 years ago

How are people scaling up their kubernetes clusters for this

Is the Cluster Autoscaler not what you're looking for? https://docs.microsoft.com/en-us/azure/aks/cluster-autoscaler

JontyMC commented 2 years ago

Yeah, this is probably the best option for now, because dynamic grid can't work with azure container instances.

withinboredom commented 2 years ago

@Bjego

how does keda notice that the session ended on a node?

It looks like it just drains the node a few seconds after the queue shrinks. One side-effect of that is that you can't use VNC from the grid ui because the node is draining. Seems like a bug.

baflQA commented 2 years ago

Hi. Wouldn't https://github.com/wrike/callisto be a nice replacement for now? In general, what's Your opinion about their approach?

LukeIGS commented 2 years ago

@baflQA Ting that comes to mind is that it's a solenoid implementation which has its own selection of problems with running on k8s, this one at least doesn't try mounting the docker socket though.

Biggest thing is that it doesn't seem to have any way to stand up a video recorder at a glance.

diemol commented 2 years ago

The linked commit adds the feature where the Node will be drained after X sessions have been started. Defaulting to zero, and any value higher than that will enable the feature. For this issue's use case, setting the value to 1 will make the Node run one session and when it ends, it will shut down. Which will cause the container to shut down as well.

Will be part of 4.1.4 or 4.2.0, whatever comes first. We will also add some friendly flag to the Docker images so it can be easily set as an environment variable.

diemol commented 2 years ago

The drained after X sessions feature has been released https://github.com/SeleniumHQ/docker-selenium/releases/tag/4.1.4-20220427 We still need to add it to the Grid docs. Will happen soon.

diemol commented 2 years ago

Regarding the previous comment, this is already part of the Docker Selenium docs and the Grid docs

kmcrawford commented 2 years ago

So I tried ScaledObject with DRAIN_AFTER_SESSION_COUNT=1 yet b/c it's a deployment it restarts after the pod completes, so I assume the best approach was to use a ScaledJob with the DRAIN_AFTER_SESSION_COUNT=1 yet the ScaledJob wouldn't scale out correctly (many jobs in the queue and not enough pods).

I also noticed I wasn't able to VNC with this setup either.

Below is my ScaledJob:

apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: selenium-chrome-node
  namespace: selenium
spec:
  pollingInterval: 15                         # Optional. Default: 30 seconds
  successfulJobsHistoryLimit: 1               # Optional. Default: 100. How many completed jobs should be kept.
  failedJobsHistoryLimit: 1                   # Optional. Default: 100. How many failed jobs should be kept.
  envSourceContainerName: selenium-chrome-node    # Optional. Default: .spec.JobTargetRef.template.spec.containers[0]
  maxReplicaCount: 80                          # Optional. Default: 100
  triggers:
    - type: selenium-grid
      metadata:
        url: 'http://selenium-router.selenium:4444/graphql'
        browserName: 'chrome'
  jobTargetRef:
    parallelism: 80                            # [max number of desired pods](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism)
    completions: 1                            # [desired number of successfully finished pods](https://kubernetes.io/docs/concepts/workloads/controllers/jobs-run-to-completion/#controlling-parallelism)
    activeDeadlineSeconds: 600                #  Specifies the duration in seconds relative to the startTime that the job may be active before the system tries to terminate it; value must be positive integer
    backoffLimit: 6                           # Specifies the number of retries before marking this job failed. Defaults to 6
    template:
      metadata:
        labels:
          app: selenium-chrome-node
          name: selenium-chrome-node
          component: "selenium-grid-4"
      spec:
        volumes:
        - name: dshm
          emptyDir: { "medium": "Memory" }
        containers:
        - name: selenium-chrome-node
          image: selenium/node-chrome:101.0
          resources:
            requests:
              memory: "500Mi"
              cpu: "500m"
            limits:
              memory: "2Gi"
              cpu: "2"
          volumeMounts:
            - name: dshm
              mountPath: /dev/shm
          env:
            - name: SE_EVENT_BUS_HOST
              value: "selenium-event-bus"
            - name: SE_EVENT_BUS_PUBLISH_PORT
              value: "4442"
            - name: SE_EVENT_BUS_SUBSCRIBE_PORT
              value: "4443"
            - name: SE_NODE_MAX_SESSIONS
              value: "1"
            - name: VNC_NO_PASSWORD
              value: "1"
            - name: VNC_VIEW_ONLY
              value: "1"
            - name: SE_NODE_OVERRIDE_MAX_SESSIONS
              value: "true"
            - name: SE_SESSION_RETRY_INTERVAL
              value: "2"
            - name: SE_SESSION_REQUEST_TIMEOUT
              value: "500"
            - name: DRAIN_AFTER_SESSION_COUNT
              value: "1"
          livenessProbe:
            exec:
              command:
                - pgrep
                - Xvfb
            initialDelaySeconds: 5
            periodSeconds: 5
          readinessProbe:
            exec:
              command:
                - pgrep
                - Xvfb
            initialDelaySeconds: 5
            periodSeconds: 5
          ports:
            - containerPort: 5553
              protocol: TCP
        restartPolicy: Never

Fuut2000 commented 2 years ago

With Keda scaledjob it works for me with parallelism set to 1

StellarNear commented 2 years ago

If i'm not mistaken there is a bug currently with the new DRAIN_AFTER_SESSION_COUNT feature. If the session is timmed out (with client sleeping for longer than the timeout setting, crashing or else) selenium close the session like normal but the drain feature is not activated.

Small log example to show this case : 09:36:38.703 INFO [SessionSlot.stop] - Stopping session d93160d7cd8372910f87d3d97ca37f22

Then nothing more the node stays there but not drained (even if it's status is in drainning it won't accepts new session but stays alive

This happen when the browser.quit() is called normally from the test side : 11:48:18.200 INFO [SessionSlot.stop] - Stopping session 594df9451dd3811cc95b6fe5f841a96e 11:48:18.200 INFO [LocalNode.stop] - Node draining complete! 11:48:19.205 INFO [NodeServer.lambda$createHandlers$3] - Shutting down

deepakguna commented 2 years ago

If we wanted a more pure k8s solution, if there were metrics exposed around how many selenium sessions are in queue, or even how long they've been waiting, maybe even rate of queue processing, it would be possible to configure a horizontal pod autoscaler (HPA) around the node deployment itself to target a given rate of message processing.

Have any one found success in this approach.

unitrade commented 2 years ago

Have any one found success in this approach.

Need too

diemol commented 2 years ago

@StellarNear good catch, thank you. This has been fixed in the linked commit above. Will be available in 4.2

LukeIGS commented 2 years ago

@deepakguna I tried this very early on and found that HPA didn't really have a great way to determine when a node was "ready" to be scaled down. If you can set up Keda it's going to be less of an uphill battle.

quarckster commented 2 years ago

I believe one of the possible solutions could be a Kubernetes Operator for Selenium. It could contain the required logic for deployment of Selenium Grid as well as scaling logic.

https://kubernetes.io/docs/concepts/extend-kubernetes/operator/

Wolfe1 commented 2 years ago

Finally got around to documenting how I got this working with KEDA in case it helps anyone: https://www.linkedin.com/pulse/scaling-kubernetes-selenium-grid-keda-brandon-wolfe/

chuongnguyen5397 commented 2 years ago

I tried to deploy Dynamic Grid to my EKS cluster, but I got stuck at how to get the container know its IP for each pod.

15:29:41.771 INFO [UnboundZmqEventBus.<init>] - Connecting to tcp://selenium-hub.selenium-grid.svc.cluster.local:4442 and tcp://selenium-hub.selenium-grid.svc.cluster.local:4443
15:29:41.875 INFO [UnboundZmqEventBus.<init>] - Sockets created
15:29:42.878 INFO [UnboundZmqEventBus.<init>] - Event bus ready
15:29:43.034 INFO [NodeServer.createHandlers] - Reporting self as: http://selenium-dynamic-node.selenium-grid.svc.cluster.local:6900
15:29:43.062 INFO [NodeOptions.getSessionFactories] - Detected 3 available processors
15:29:43.770 INFO [V141Docker.isContainerPresent] - Checking if container is present: selenium-dynamic-node-667476fcfd-hs4hp
....
15:29:44.150 INFO [NodeServer$1.start] - Starting registration process for Node http://selenium-dynamic-node.selenium-grid.svc.cluster.local:6900
15:29:44.152 INFO [NodeServer.execute] - Started Selenium node 4.2.1 (revision ac4d0fdd4a): http://selenium-dynamic-node.selenium-grid.svc.cluster.local:6900
15:29:44.178 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
15:29:54.189 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
15:30:04.199 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
15:30:14.208 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
15:30:24.215 INFO [NodeServer$1.lambda$start$1] - Sending registration event...

Here is the config.toml

    [docker]
    # Configs have a mapping between the Docker image to use and the capabilities that need to be matched to
    # start a container with the given image.
    configs = [
        "selenium/standalone-firefox:4.2.1-20220531", '{"browserName": "firefox", "platformName": "linux"}',
        "selenium/standalone-chrome:4.2.1-20220531", '{"browserName": "chrome", "platformName": "linux"}',
        "selenium/standalone-edge:4.2.1-20220531", '{"browserName": "MicrosoftEdge", "platformName": "linux"}'
        ]

    # URL for connecting to the docker daemon
    # host.docker.internal works for macOS and Windows.
    # Linux could use --net=host in the `docker run` instruction or 172.17.0.1 in the URI below.
    # To have Docker listening through tcp on macOS, install socat and run the following command
    # socat -4 TCP-LISTEN:2375,fork UNIX-CONNECT:/var/run/docker.sock
    url = "http://127.0.0.1:2375"
    # Docker image used for video recording
    video-image = "selenium/video:ffmpeg-4.3.1-20220531"

    # Uncomment the following section if you are running the node on a separate VM
    # Fill out the placeholders with appropriate values
    [server]
    host = "selenium-dynamic-node.selenium-grid.svc.cluster.local"
    port = 6900

As I can see, the blocker I faced is that I can not get the IP address of the pods, so that I can not register to the hub as the logs of the hub said

15:29:42.215 INFO [BoundZmqEventBus.<init>] - XPUB binding to [binding to tcp://*:4442, advertising as tcp://10.5.5.142:4442], XSUB binding to [binding to tcp://*:4443, advertising as tcp://10.5.5.142:4443]
15:29:42.334 INFO [UnboundZmqEventBus.<init>] - Connecting to tcp://10.5.5.142:4442 and tcp://10.5.5.142:4443
15:29:42.410 INFO [UnboundZmqEventBus.<init>] - Sockets created
15:29:43.412 INFO [UnboundZmqEventBus.<init>] - Event bus ready
15:29:45.186 INFO [Hub.execute] - Started Selenium Hub 4.2.1 (revision ac4d0fdd4a): http://10.5.5.142:4444
15:30:03.807 INFO [Node.<init>] - Binding additional locator mechanisms: id, name, relative
15:30:04.599 INFO [Node.<init>] - Binding additional locator mechanisms: relative, name, id
15:30:13.763 INFO [Node.<init>] - Binding additional locator mechanisms: id, relative, name
15:30:14.224 INFO [Node.<init>] - Binding additional locator mechanisms: name, relative, id
15:30:23.768 INFO [Node.<init>] - Binding additional locator mechanisms: id, name, relative
15:30:24.228 INFO [Node.<init>] - Binding additional locator mechanisms: name, id, relative

anyone worked around it before?

diemol commented 2 years ago

@chuongnguyen5397 please check all the issue comments. Dynamic Grid does not work in Kubernetes.

chuongnguyen5397 commented 2 years ago

@diemol Really??? I think Dynamic Grid should be awesome if I can use in EKS. Is it will be able in the future or I should go with KEDA? Because as I can understand, if you set host = "selenium-dynamic-node.selenium-grid.svc.cluster.local" in config.toml use forwarding instead of nslookup, I can register all dynamic-node pods to the hub

diemol commented 2 years ago

@StellarNear I do not know, please do not hijack the issue with questions as this derails the thread. Other channels for questions are available https://www.selenium.dev/support/

StellarNear commented 2 years ago

Sorry though it was relevant for a Kubernetes configuration :)

prashanth-volvocars commented 2 years ago

Hi all,

Is anyone still facing issues related to KEDA implementation? I was the one who originally added the scaler to KEDA. I wasn't following much about it due to my other assignments. We have now started the setup of grid in EKS on Fargate and it seems to work fine for us. I am yet to work on retrieving the browser console logs and network logs. Any help on that would be greatly appreciated.

prashanth-volvocars commented 2 years ago

With regards to video recording the problem we face is it recorded a single video for the whole lifetime of the pod so if there are multiple sessions handled by the same pod then there is just one video for all the sessions handled by that pod. Also even if the pod handles just a single session, the video keeps recording until the pod is killed which is like 300 seconds by default in HPA. So even for a test that runs just for few seconds we get a video that's 5 mins or longer. Is there a way to control this behaviour?

msvticket commented 2 years ago

My idea of solving that (which I haven't tested yet) is to use the scaling jobs feature in KEDA to run selenium nodes. These nodes should then be configured with DRAIN_AFTER_SESSION_COUNT=1. After the session has finished the selenium container will then finish. The remaining problem is to make the the video container exit. This could be solved by harnessing features of supervisord:

If the supervisord of the video container has enabled unix_http_server then the supervisord of the selenium container could use supervisorctl to stop the video container. in a similar way as here: https://github.com/SeleniumHQ/docker-selenium/commit/281e5c40dda87b29d9754fbcde5508fd11541b2e

A somewhat tricky part would be how to make that supervisorctl when there actually is a video container to stop.

NickWemekamp commented 2 years ago

An alternative would be to have the selenium node container kill the pod that it belongs to via the kube API server in the pre-stop hook (stackoverflow delete pod). I have not tested this. The pre-stop hook of the video container can then upload the video to a remote storage. The problem is then that the video container does not know the session identifier of the last test run of the selenium node container, which would be a practical filename in the remote storage.

prashanth-volvocars commented 2 years ago

I solved it by adding ffmpeg directly into browser node docker and record video directly for every session. It works great. I will soon share the whole setup.

qalinn commented 2 years ago

@prashanth-volvocars Hi! Great to hear this. Please don't forget to share with us the setup. Thank you!

josesimon commented 2 years ago

@prashanth-volvocars we would love to receive your feedback :)

prashanth-volvocars commented 2 years ago

My setup is more oriented towards AWS but it works great for us until now. I need some help with sharing it. I have made some changes to the NodeBase and added new script to upload the videos and logs directly to S3. So would it be ok to have this part of this repo or should i just share it another separate repo since its more oriented towards AWS.

gazal-k commented 2 years ago

My setup is more oriented towards AWS but it works great for us until now. I need some help with sharing it. I have made some changes to the NodeBase and added new script to upload the videos and logs directly to S3. So would it be ok to have this part of this repo or should i just share it another separate repo since its more oriented towards AWS.

I think using S3 as opposed to block storage was an excellent choice. Perhaps parts of that logic can be made generic using something like https://github.com/google/go-cloud in the future. But for a lot of us who would want to setup a selenium 4 grid on AWS, I think ur contribution would be excellent. Perhaps it can be turned on based on some env params?

prashanth-volvocars commented 2 years ago

Hey all,

Apologies for the delay in sharing it. I was unsure of how to do it. But just taking a first step now. https://github.com/prashanth-volvocars/docker-selenium/tree/auto-scaling/charts/selenium-grid

Remember you need to install keda before installing the chart. The chart is configured to work with default namespace. If you are installing it another namespace make sure to update the hpa.url. Any questions please direct to me here or slack.

prashanth-volvocars commented 2 years ago

You can grab all information about the setup here

On a nutshell, It can

Auto Scale browser nodes up and down.
Record videos and store them named under session id
Capture logs and store them named under session id
Upload captured logs and videos to S3

krmahadevan commented 1 year ago

@diemol - Do you think that this was one of the use-cases for building the reference implementation of org.openqa.selenium.grid.node.k8s.OneShotNode ? Maybe we could consider exposing this as maven artifact so that we can perhaps pass in a reference to this implementation via --node-implementation ?

diemol commented 1 year ago

@krmahadevan there is a solution in a PR in the docker-selenium project, have you checked them? @prashanth-volvocars was kind enough to submit them.

krmahadevan commented 1 year ago

@diemol - No I wasn't aware of the PR. I went back and checked https://github.com/SeleniumHQ/docker-selenium/pull/1714

Even though I dont understand a lot of the k8s lingo yet, I kind of got the idea of what it is doing and looks like that should suffice for the k8s requirement of an autoscaling grid.

diemol commented 1 year ago

Yes, we we merge that, we can close this issue.

msvticket commented 1 year ago

I have made the new PR SeleniumHQ/docker-selenium#1854 (based on SeleniumHQ/docker-selenium#1714). It has a few more features, including automatic installation of KEDA and autoscaling with jobs.

I have also supplied a helm repo where you can get the chart to test it this before the PR is merged.

aaron070596 commented 1 year ago

Hello Team , i would like to know if this implementation will include a solution for Video Recording feature enabled on a k8s distributed implementation and if there is any ETA on which we would be able to use these new components.

subin-krishna-test commented 1 year ago

You can grab all information about the setup here

On a nutshell, It can

Auto Scale browser nodes up and down.

Record videos and store them named under session id

Capture logs and store them named under session id

Upload captured logs and videos to S3

@prashanth-volvocars

What if I want to upload the videos with a specific name instead of .mp4 to the S3 bucket? Or how can I identify which is the corresponding video file for a test?

msvticket commented 1 year ago

In your test you know the session id so therefore you also know the file name. Telling what file name to use is not possible with this solution.

subin-krishna-test commented 1 year ago

In your test you know the session id so therefore you also know the file name. Telling what file name to use is not possible with this solution.

@msvticket Thanks for your response. I am using selenium-side-runner to connect the remote selenium grid and pass the .side file as an argument to side runner. How can I get the correct session id for each test if I am running multiple tests simultaneously?

tppalani commented 1 year ago

I have attempted to build something similar for Kubernetes with Selenium Grid3. More details here: https://link.medium.com/QQMCXLqQMjb

Hi @sahajamit,, i have seen your post in medium to configure selenium grid inside the eks cluster.

As per your instructions i have created selenium grid hub and I can able to access it via ingress controller, but when I'm trying to configure the chrome node not able to register with selenium hub.

I your post you mentioned k8s_host what you referring value, is the eks cluster endpoint url or something else?

diemol commented 10 months ago

We now have Keda integrated in the chart, and there is also video there. Closing this.

github-actions[bot] commented 9 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

SeleniumHQ / selenium

Dynamic Selenium 4 grid on kubernetes #9845

🚀 Feature Proposal