SeleniumHQ / docker-selenium

Provides a simple way to run Selenium Grid with Chrome, Firefox, and Edge using Docker, making it easier to perform browser automation
http://www.selenium.dev/docker-selenium/
Other
7.86k stars 2.51k forks source link

[🐛 Bug]: After upgrading to Selenium version 4.16.1 and Edge 120, some of the edge nodes are being placed in a queue. #2113

Closed chandupranayp closed 1 month ago

chandupranayp commented 7 months ago

What happened?

After upgrading to Selenium version 4.16.1 and Edge 120, we have encountered an issue where some of the Edge nodes are being placed in a queue. Previously, we were using version 4.13.0 and Edge 117, and did not experience this problem. It seems that this issue is specific to Edge, as Chrome is functioning properly.

For example, when we trigger 5 Edge and 5 Chrome scripts, only 4 Edge nodes and all 5 Chrome nodes will open. One Edge node will be placed in the queue, despite setting the maxReplicaCount to 50.

Command used to start Selenium Grid with Docker (or Kubernetes)

Below are the yml files:

values.yml
global:
  seleniumGrid:
    imageRegistry: crazcdaks.azurecr.io
    imageTag: 4.16.1-20231219
    nodesImageTag: 4.16.1-20231219
    imagePullSecret: ""

basicAuth:
  enabled: false

isolateComponents: false

ingress:
  enabled: true
  className: ""
  annotations: {}
  hostname: selenium-grid.local
  tls: []

busConfigMap:
  name: selenium-event-bus-config
  annotations: {}

components:

  router:
    imageName: selenium/router

    imagePullPolicy: IfNotPresent
    imagePullSecret: ""

    annotations: {}
    port: 4444
    livenessProbe:
      enabled: true
      path: /readyz
      initialDelaySeconds: 10
      failureThreshold: 10
      timeoutSeconds: 10
      periodSeconds: 10
      successThreshold: 1
    readinessProbe:
      enabled: true
      path: /readyz
      initialDelaySeconds: 12
      failureThreshold: 10
      timeoutSeconds: 10
      periodSeconds: 10
      successThreshold: 1
    resources: {}
    serviceType: ClusterIP
    loadBalancerIP: ""
    serviceAnnotations: {}
    tolerations: []
    nodeSelector: {}
    priorityClassName: ""

  distributor:
    imageName: selenium/distributor

    imagePullPolicy: IfNotPresent
    imagePullSecret: ""

    annotations: {}
    port: 5553
    resources: {}
    serviceType: ClusterIP
    serviceAnnotations: {}
    tolerations: []
    nodeSelector: {}
    priorityClassName: ""

  eventBus:
    imageName: selenium/event-bus

    imagePullPolicy: IfNotPresent
    imagePullSecret: ""

    annotations: {}
    port: 5557
    publishPort: 4442
    subscribePort: 4443
    resources: {}
    serviceType: ClusterIP
    serviceAnnotations: {}
    tolerations: []
    nodeSelector: {}
    priorityClassName: ""

  sessionMap:
    imageName: selenium/sessions

    imagePullPolicy: IfNotPresent
    imagePullSecret: ""

    annotations: {}
    port: 5556
    resources: {}
    serviceType: ClusterIP
    serviceAnnotations: {}
    tolerations: []
    nodeSelector: {}
    priorityClassName: ""

  sessionQueue:
    imageName: selenium/session-queue

    imagePullPolicy: IfNotPresent
    imagePullSecret: ""

    annotations: {}
    port: 5559
    resources: {}
    serviceType: ClusterIP
    serviceAnnotations: {}
    tolerations: []
    nodeSelector: {}
    priorityClassName: ""

  extraEnvironmentVariables:

  extraEnvFrom:

hub:
  imageName: selenium/hub
  imagePullPolicy: IfNotPresent
  imagePullSecret: ""

  annotations: {}
  labels: {}
  publishPort: 4442
  subscribePort: 4443
  port: 4444
  livenessProbe:
    enabled: true
    path: /readyz
    initialDelaySeconds: 10
    failureThreshold: 10
    timeoutSeconds: 10
    periodSeconds: 10
    successThreshold: 1
  readinessProbe:
    enabled: true
    path: /readyz
    initialDelaySeconds: 12
    failureThreshold: 10
    timeoutSeconds: 10
    periodSeconds: 10
    successThreshold: 1
  extraEnvironmentVariables:
    - name: SE_SESSION_REQUEST_TIMEOUT
      value: "300"
    - name: SE_NODE_SESSION_TIMEOUT
      value: "600"
  extraEnvFrom:
  resources: {}
  serviceType: ClusterIP
  loadBalancerIP: ""
  serviceAnnotations: {}
  tolerations: []
  nodeSelector: {}
  priorityClassName: ""

chromeNode:
  enabled: true

  deploymentEnabled: true

  replicas: 0
  imageName: selenium/node-chrome
  imagePullPolicy: IfNotPresent
  imagePullSecret: ""

  ports:
    - 5555
  seleniumPort: 5900
  seleniumServicePort: 6900
  annotations: {}
  labels: {}
  resources:
    requests:
      memory: "1Gi"
      cpu: "0.25"
    limits:
      memory: "2Gi"
      cpu: "1"
  tolerations: []
  nodeSelector: {}
  hostAliases:
  extraEnvironmentVariables:
    - name: SE_SCREEN_WIDTH
      value: "1920"
    - name: SE_SCREEN_HEIGHT
      value: "1080"
    - name: SE_SESSION_REQUEST_TIMEOUT
      value: "300"
    - name: SE_NODE_SESSION_TIMEOUT
      value: "600"
  extraEnvFrom:
  service:
    enabled: true
    type: ClusterIP
    annotations: {}
  dshmVolumeSizeLimit: 2Gi
  priorityClassName: ""

  startupProbe: {}
  terminationGracePeriodSeconds: 3600
  lifecycle:
    preStop:
      exec:
        command: ["/bin/sh", "-c", "curl --request POST 'localhost:5555/se/grid/node/drain' --header 'X-REGISTRATION-SECRET;'; tail --pid=$(pgrep -f '[n]ode --bind-host false --config /opt/selenium/config.toml') -f /dev/null; sleep 30s"]

  extraVolumeMounts: []

  extraVolumes: []

firefoxNode:
  enabled: true

  deploymentEnabled: true

  replicas: 0
  imageName: selenium/node-firefox
  imagePullPolicy: IfNotPresent
  imagePullSecret: ""

  ports:
    - 5555
  seleniumPort: 5900
  seleniumServicePort: 6900
  annotations: {}
  labels: {}
  tolerations: []
  nodeSelector: {}
  resources:
    requests:
      memory: "1Gi"
      cpu: "0.25"
    limits:
      memory: "2Gi"
      cpu: "1"
  hostAliases:
  extraEnvironmentVariables:
    - name: SE_SCREEN_WIDTH
      value: "1920"
    - name: SE_SCREEN_HEIGHT
      value: "1080"
    - name: SE_SESSION_REQUEST_TIMEOUT
      value: "300"
    - name: SE_NODE_SESSION_TIMEOUT
      value: "600"
  extraEnvFrom:
  service:
    enabled: true
    type: ClusterIP
    annotations: {}
  dshmVolumeSizeLimit: 2Gi
  priorityClassName: ""

  startupProbe: {}
  terminationGracePeriodSeconds: 3600
  lifecycle:
    preStop:
      exec:
        command: ["/bin/sh", "-c", "curl --request POST 'localhost:5555/se/grid/node/drain' --header 'X-REGISTRATION-SECRET;'; tail --pid=$(pgrep -f '[n]ode --bind-host false --config /opt/selenium/config.toml') -f /dev/null; sleep 30s"]

  extraVolumeMounts: []

  extraVolumes: []

edgeNode:
  enabled: true

  deploymentEnabled: true

  replicas: 0
  imageName: selenium/node-edge
  imagePullPolicy: IfNotPresent
  imagePullSecret: ""

  ports:
    - 5555
  seleniumPort: 5900
  seleniumServicePort: 6900
  annotations: {}
  labels: {}
  tolerations: []
  nodeSelector: {}
  resources:
    requests:
      memory: "1Gi"
      cpu: "0.25"
    limits:
      memory: "2Gi"
      cpu: "1"
  hostAliases:
  extraEnvironmentVariables:
    - name: SE_SCREEN_WIDTH
      value: "1920"
    - name: SE_SCREEN_HEIGHT
      value: "1080"
    - name: SE_SESSION_REQUEST_TIMEOUT
      value: "300"
    - name: SE_NODE_SESSION_TIMEOUT
      value: "600"
  extraEnvFrom:
  service:
    enabled: true
    type: ClusterIP
    annotations:
      hello: world
  dshmVolumeSizeLimit: 2Gi
  priorityClassName: ""

  startupProbe: {}
  terminationGracePeriodSeconds: 3600
  lifecycle:
    preStop:
      exec:
        command: ["/bin/sh", "-c", "curl --request POST 'localhost:5555/se/grid/node/drain' --header 'X-REGISTRATION-SECRET;'; tail --pid=$(pgrep -f '[n]ode --bind-host false --config /opt/selenium/config.toml') -f /dev/null; sleep 30s"]

  extraVolumeMounts: []

  extraVolumes: []

customLabels: {}
*********************************************Keda-seleniumtriggers.yml**********************************
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: selenium-grid-chrome-scaledobject
  namespace: selenium-dev
  labels:
    deploymentName: selenium-chrome-node
spec:
  maxReplicaCount: 50
  scaleTargetRef:
    name: selenium-chrome-node
  triggers:
    - type: selenium-grid
      metadata:
        url: 'https://selenium.***.in.***.dev/graphql'
        browserName: 'chrome'
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: selenium-grid-firefox-scaledobject
  namespace: selenium-dev
  labels:
    deploymentName: selenium-firefox-node
spec:
  maxReplicaCount: 5
  scaleTargetRef:
    name: selenium-firefox-node
  triggers:
    - type: selenium-grid
      metadata:
        url: 'https://selenium.***.in.***.dev/graphql'
        browserName: 'firefox'
---
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: selenium-grid-edge-scaledobject
  namespace: selenium-dev
  labels:
    deploymentName: selenium-edge-node
spec:
  maxReplicaCount: 50 
  scaleTargetRef:
    name: selenium-edge-node
  triggers:
    - type: selenium-grid
      metadata:
        url: 'https://selenium.***.in.***.dev/graphql'
        browserName: 'MicrosoftEdge'
        sessionBrowserName: 'msedge'

Relevant log output

To reproduce the issue, run the below test 5 times in parallel and you will see only 4 active edge nodes and 1 node in queue.

        [TestMethod]
        public static void Browser_Initialization()
        {
            try
            {
                    if (docker_execution.Equals("Edge"))
                    {
                        EdgeOptions options = new EdgeOptions();
                        driver = new RemoteWebDriver(new Uri("https://selenium.***.in.***.dev/wd/hub"), options.ToCapabilities(), TimeSpan.FromMinutes(5));
                        driver.Manage().Window.Maximize();
                    }
            }
            catch (Exception ex)
            {
            }
        }

Operating System

Kubernetes version: 1.26.6

Docker Selenium version (image tag)

Selenium version 4.16.1 and Edge 120

Selenium Grid chart version (chart version)

No response

github-actions[bot] commented 7 months ago

@chandupranayp, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

VietND96 commented 7 months ago

Hi @chandupranayp, autoscaling is on top of your own existing KEDA? If yes, which KEDA version did you use?

chandupranayp commented 7 months ago

Hi @vietnd96, yes, autoscaling is on top of existing KEDA and KEDA version we are using is 2.9.3

VietND96 commented 7 months ago

Yes, so I suggest that you should upgrade KEDA to recent version, now is 2.13.0 to test and confirm If take a look at KEDA changelog https://github.com/kedacore/keda/blob/main/CHANGELOG.md - between 2.9.3 - 2.13.0 there are few fixes for Selenium Grid Scaler

chandupranayp commented 7 months ago

@vietnd96, Thanks for your quick feedback.

I need to wait for a couple more weeks to update the KEDA version due to some other dependencies. I can only test and confirm after that. However, after reviewing the changelog you provided, I didn't find any fixes related to the EDGE issue I am currently facing. Do you have any suggestions for other possible issues that I can try to fix and test before proceeding with the KEDA upgrade?

VietND96 commented 7 months ago

Ah yes, as you mentioned after upgrade 4.16.1. In this version, in chart 0.26.3 there was a change that updated default value autoscaling.scalingStrategy.strategy from accurate to default In case you are using scalingType: job and facing this issue, can you try to change it back accurate Noted: in the latest chart 0.27.0, this default value changed back accurate already If you are using scalingType: deployment, the strategy is not related

chandupranayp commented 7 months ago

@vietnd96 , We are using 'scalingType' for deployment. I upgraded the Selenium version from 4.16.1 to 4.17, but the issue persists. In a few days, we will be updating our KEDA and testing this issue. Meanwhile, please let me know if you can recommend any other fixes. I greatly appreciate your time and feedback.

chandupranayp commented 6 months ago

@VietND96, We have now upgraded our infrastructure to the below versions. However, even after the upgrade, the issue remains the same. We still notice that some of our Edge nodes are going into the queue. Can you please assist on this issue.?

Kubernetes version: 1.27.7 KEDA: 2.12.1 Selenium grid: 4.18.1 Edge: 122 Chrome: 122

chandupranayp commented 6 months ago

@VietND96 Can you please assist with this? Please let me know if you need any further information from my end.

VietND96 commented 6 months ago

Hi @chandupranayp, I will get back to you on this when having any clue. Besides this issue, also having some other unstable related to autoscaling are under investigation.

chandupranayp commented 6 months ago

Hello @VietND96, thank you so much for the acknowledgment.

chandupranayp commented 2 months ago

Hello @VietND96, any update on my issue, pls?

VietND96 commented 1 month ago

@chandupranayp, the exact root cause has yet to be identified. However, 2 fixes are available from the Grid server. https://github.com/SeleniumHQ/selenium/pull/14272 (delivered in 4.23) https://github.com/SeleniumHQ/selenium/pull/14282 (will be delivered in 4.23) We will continue to keep track of this issue.

VietND96 commented 1 month ago

FYI, image tag 4.23.0-20240727 and chart version 0.33.0 contain the fixes mentioned above. Kindly verify and provide feedback if it is the right fix for this issue.

github-actions[bot] commented 1 week ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.