[🐛 Bug]: Each test-execution starts multiple jobs

maxnitze commented 1 year ago

What happened?

When I start selenium tests using the grid there are always two jobs started.

There is one started immediately. Once it is up-and-running a second is scheduled. The second job is then used for the test. After the test is done only the second is finished. The other keeps on running (doing nothing). Today I stopped one, that was in running state the whole weekend.

The second is only started up a soon as the first is ready. I saw this when it was scheduled on a node, that did not have the image yet. It took about 2:30m to pull it. Only after that was done the second job got scheduled. First I thought this might have something to do with a timeout, because it took to long to pull the image. But it also happens if the image is available and the first job only takes seconds to get ready.

Command used to start Selenium Grid with Docker

I installed the Grid from the Helm chart using an existing KEDA installation.

selenium-grid:
  ingress:
    enabled: true
    [ ... ]

  hub:
    extraEnvironmentVariables:
      - name: TZ
        value: Europe/Berlin
    resources:
      limits:
        memory: 2Gi
      requests:
        cpu: 50m
        memory: 2Gi

  autoscaling:
    enableWithExistingKEDA: true
    scalingType: job

  chromeNode:
    enabled: true
    maxReplicaCount: 16
    extraEnvironmentVariables:
      - name: TZ
        value: Europe/Berlin
  firefoxNode:
    enabled: true
    maxReplicaCount: 8
    extraEnvironmentVariables:
      - name: TZ
        value: Europe/Berlin
  edgeNode:
    enabled: false

My Kubernetes cluster is in version 1.23.

Relevant log output

I only put the KEDA log in the form, as I could not see any interesting output in the Grid logs.

2023-07-30T12:56:19Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Effective number of max jobs": 1}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:19Z    INFO    scaleexecutor   Created jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 1}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Effective number of max jobs": 1}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:29Z    INFO    scaleexecutor   Created jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of jobs": 1}
2023-07-30T12:56:39Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 2}
2023-07-30T12:56:39Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:39Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:39Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:49Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 0}
2023-07-30T12:56:49Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-firefox-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}
2023-07-30T12:56:49Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of running Jobs": 2}
2023-07-30T12:56:49Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-chrome-node", "scaledJob.Namespace": "selenium-grid-keda", "Number of pending Jobs ": 0}

Operating System

Kubernetes 1.23 on Flatcar Linux

Docker Selenium version (tag)

4.10.0-20230607

github-actions[bot] commented 1 year ago

@maxnitze, thank you for creating this issue. We will troubleshoot it as soon as we can.

Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

diemol commented 1 year ago

Can you share the test script you are using to see this behavior?

maxnitze commented 1 year ago

Hey @diemol ,

I asked the KEDA project as well. And it seems the issue is with the scalingStrategy. When I set it to default it works.

See here: https://github.com/kedacore/keda/issues/4833

Is there any specific reason the default is set to accurate in this Chart? In the issue @JurTurFer mentioned:

I don't think that you will have any trouble with the change. TBH, IDK why they set accurate. We suggest using accurate only in the case of knowing that we job is completed just at the end and not in the meantime. Docs explain how they work (a bit below) but the main difference is how both strategies take into account the current jobs.

https://github.com/kedacore/keda/issues/4833#issuecomment-1658887078

maxnitze commented 1 year ago

Can you share the test script you are using to see this behavior?

To answer your question: I have Geb tests for some of my applications. To connect to the Grid I use the RemoteWebDriver from org.seleniumhq.selenium:selenium-remote-driver:3.141.59.

diemol commented 1 year ago

Hey @diemol ,

I asked the KEDA project as well. And it seems the issue is with the scalingStrategy. When I set it to default it works.

See here: kedacore/keda#4833

Is there any specific reason the default is set to accurate in this Chart? In the issue @JurTurFer mentioned:

I don't think that you will have any trouble with the change. TBH, IDK why they set accurate. We suggest using accurate only in the case of knowing that we job is completed just at the end and not in the meantime. Docs explain how they work (a bit below) but the main difference is how both strategies take into account the current jobs.

kedacore/keda#4833 (comment)

@msvticket do you know?

maxnitze commented 1 year ago

For reference: It was set to accurate right from the beginning: https://github.com/SeleniumHQ/docker-selenium/commit/f0bbfe02c318ac58b8875f8f26c607ca86b9cf42

I could not find any discussion about the strategy in the PR.

amardeep2006 commented 1 year ago

I am seeing similar behavior with scalingType: deployment . Kubernetes version is 1.23 for me as well. Observation : wdio framework gets 504 gateway timeout error. A session is started on node but browser does nothing. Few sessions are shown as pending in queue as well.

I will try following and share results:

Increase timeout on ingress.
Increase default connect timeout in wdio framework.

maxnitze commented 1 year ago

We currently experience a problem with the default strategy as well: It expects the sessions to stay in the queue while they are worked on. The calculation for the scaled jobs basically checks, whether more jobs are running than are in the queue. And if that's the case no new job is scheduled.

Maybe that's what was tried to be fixed by using the accurate strategy? We are currently checking if and how we can implement a custom strategy instead.

amardeep2006 commented 1 year ago

I am seeing similar behavior with scalingType: deployment . Kubernetes version is 1.23 for me as well. Observation : wdio framework gets 504 gateway timeout error. A session is started on node but browser does nothing. Few sessions are shown as pending in queue as well.

I will try following and share results:

Increase timeout on ingress.

Increase default connect timeout in wdio framework.

Updates with scalingType: deployment . We have seen improvements after increasing the timeouts in the ingress. The pending sessions are not there anymore.

msvticket commented 1 year ago

Hey @diemol ,

I asked the KEDA project as well. And it seems the issue is with the scalingStrategy. When I set it to default it works.

Your mileage may vary apparently. For me it worked much better with accurate. The scale up was way to slow with default. I suppose it depends on your priorities: do you want fast scaling reponse choose accurate, if you want to be sure you don't end up with too many pods choose default.

maxnitze commented 1 year ago

That might be another issue (we did not have issues with too slow startup though).

A bigger problem is the calculation of the scaling itself. I dug deeper into the KEDA code and found out, that the default strategy assumes, that "locked messages" (so the ones, that are in progress already) stay in the queue. Which is not the case in the Selenium Grid. This leads to the issue, that new sessions are only started once the queue length exceeds the number of currently running jobs.

This issue is exactly what the accurate strategy solves:

If the scaler returns queueLength (number of items in the queue) that does not include the number of locked messages, this strategy is recommended.

see https://keda.sh/docs/2.11/concepts/scaling-jobs/

maxnitze commented 1 year ago

I suppose it depends on your priorities: do you want fast scaling reponse choose accurate, if you want to be sure you don't end up with too many pods choose default.

The issue was not only that we started too many pods, but rather, that additional jobs were started which never finished. I had this in a test setup with only a single session though. I'm not sure, if not later on another session might be taken over by the additional job. Do you have any experience there?

msvticket commented 1 year ago

I suppose it depends on your priorities: do you want fast scaling reponse choose accurate, if you want to be sure you don't end up with too many pods choose default.

The issue was not only that we started too many pods, but rather, that additional jobs were started which never finished.

Which is the same thing.

I had this in a test setup with only a single session though. I'm not sure, if not later on another session might be taken over by the additional job. Do you have any experience there?

Yes it would.

msvticket commented 1 year ago

That might be another issue (we did not have issues with too slow startup though).

A bigger problem is the calculation of the scaling itself. I dug deeper into the KEDA code and found out, that the default strategy assumes, that "locked messages" (so the ones, that are in progress already) stay in the queue. Which is not the case in the Selenium Grid. This leads to the issue, that new sessions are only started once the queue length exceeds the number of currently running jobs.

This issue is exactly what the accurate strategy solves:

Exactly. Which is why I choose accurate as the default strategy in the chart.

amardeep2006 commented 11 months ago

I have been experimenting with both type of scaling strategies (job/deployment) and seeing multiple jobs getting triggered . In one occurrence it started 16 Jobs for just two test cases. For now I am sticking with deployment and wait to hear more from others on this behavior. I tried KEDA 2.12.0 as well.

cr-liorholtzman commented 9 months ago

Any update on this one? we are also started having this issue after upgrading KEDA to 2.12.0 from 2.11.1

maxnitze commented 9 months ago

Fortunately (or unfortunately for you) we don't have the problem anymore. This was happening when we had a test setup, that only one application used at the time. When we scaled this up it just went away. We are running 100s of jobs daily now and no issues with "extra spawned jobs" so far.

Sorry, that I cannot be of more help :/

maxnitze commented 9 months ago

SeleniumHQ / docker-selenium