SeleniumHQ / selenium

A browser automation framework and ecosystem.
https://selenium.dev
Apache License 2.0
30.75k stars 8.19k forks source link

[🐛 Bug]: On grid, session is started but never used #11881

Closed bhecquet closed 1 year ago

bhecquet commented 1 year ago

What happened?

I encounter an issue when trying to execute more tests than a Selenium Grid can handle

My setup:

The script adds a wait to force new sessions to be queued So when starting the 4 instances (methods test1, test2, test3 and test4), I can see them run, or queued, but at some point, a session is created but not used: on the node, browser is started, but URL not loaded. In that case, we need to wait 5 mins (default session timeout) for next test to be started

From what I observed, I would say that there is a race condition between the session request timeout and the session creation thread, as the "stalled" session is logged as timed out in test script logs: Cannot create driver: Could not start a new session. Response code 500. Message: Could not start a new session. New session request timed out

Thanks for your help

How can we reproduce the issue?

package company;

import org.openqa.selenium.By;
import org.openqa.selenium.Capabilities;
import org.openqa.selenium.MutableCapabilities;
import org.openqa.selenium.WebDriver;
import org.openqa.selenium.chrome.ChromeDriver;
import org.openqa.selenium.chrome.ChromeOptions;
import org.openqa.selenium.remote.RemoteWebDriver;

import java.net.MalformedURLException;
import java.net.URL;

public class Test {

    @org.testng.annotations.Test
    public void testMethod(String name) throws MalformedURLException, InterruptedException {
        for (int i = 0; i < 10; i++) {
            MutableCapabilities caps = new ChromeOptions();
            caps.setCapability("cs:test", String.format(name + "-" + i));
            try {
                System.out.println(String.format("Starting session %s: %d", name, i));
                WebDriver driver = new RemoteWebDriver(new URL("http://selenium-server.company.com:4448/wd/hub"), caps);
                System.out.println(String.format("Started session %s: %d", name, i));
                try {
                    int wait = (int) (Math.random() * 30000);
                    driver.get("https://www.selenium.dev/");
                    driver.manage().window().maximize();
                    driver.findElement(By.linkText("Documentation")).click();
                    Thread.sleep(wait);
                    System.out.println("Test OK");
                } finally {
                    driver.quit();
                }
            } catch (Exception e) {
                System.out.println("Cannot create driver: " + e.getMessage());
            }
        }
    }

    @org.testng.annotations.Test
    public void test() throws MalformedURLException, InterruptedException {
        testMethod("test1");
    }
    @org.testng.annotations.Test
    public void test2() throws MalformedURLException, InterruptedException {
        testMethod("test2");
    }
    @org.testng.annotations.Test
    public void test3() throws MalformedURLException, InterruptedException {
        testMethod("test3");
    }
    @org.testng.annotations.Test
    public void test4() throws MalformedURLException, InterruptedException {
        testMethod("test4");
    }
}

Relevant log output

Hub: 
15:47:37.325 INFO [LocalDistributor.newSession] - Session created by the Distributor. Id: 056d017f54d5e55bc24fc8dd03b06316
 Caps: Capabilities {acceptInsecureCerts: false, browserName: chrome, browserVersion: 104.0.5112.102, chrome: {chromedriverVersion: 104.0.5112.79 (3cf3e8c8a07d..., userDataDir: C:\Users\ZSSELE~1\AppData\L...}, cs:test: test1-2, goog:chromeOptions: {debuggerAddress: localhost:54467}, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: Windows 10, proxy: {}, se:bidiEnabled: false, se:cdp: ws://selenium-server..., se:cdpVersion: 104.0.5112.102, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:virtualAuthenticators: true}
15:47:39.333 WARN [SeleniumSpanExporter$1.lambda$export$3] - {"traceId": "4467cec474a962ae70f33c9b68cba777","eventTime": 1681220859332195377,"eventName": "HTTP request execution complete","attributes": {"http.flavor": 1,"http.handler_class": "org.openqa.selenium.grid.sessionqueue.local.LocalNewSessionQueue","http.host": "selenium-server.company.com:4448","http.method": "POST","http.request_content_length": "421","http.scheme": "HTTP","http.status_code": 500,"http.target": "\u002fsession","http.user_agent": "selenium\u002f4.7.2 (java windows)"}}

15:48:05.054 WARN [SeleniumSpanExporter$1.lambda$export$3] - {"traceId": "c7dbb6ffaea0c77307032001fd43f947","eventTime": 1681220885053873705,"eventName": "HTTP request execution complete","attributes": {"http.flavor": 1,"http.handler_class": "org.openqa.selenium.grid.sessionqueue.local.LocalNewSessionQueue","http.host": "selenium-server.company.com:4448","http.method": "POST","http.request_content_length": "421","http.scheme": "HTTP","http.status_code": 500,"http.target": "\u002fsession","http.user_agent": "selenium\u002f4.7.2 (java windows)"}}

Moreover an other test showed the following events:

Operating System

Hub: RHE 7.9, Node: Windows 10

Selenium version

4.7.2 (script), 4.8.3 (Hub & node)

What are the browser(s) and version(s) where you see this issue?

chrome 104

What are the browser driver(s) and version(s) where you see this issue?

chromedriver 104

Are you using Selenium Grid?

4.8.3

github-actions[bot] commented 1 year ago

@bhecquet, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

extravio commented 1 year ago

I've got a similar issue. Environment: Grid deployed with Helm (4.8.1 - fully distributed - no extra config) - 2 chrome nodes.

5 concurrent tests, each test has a 200 second delay. result: 1 failure

response = {'status': 500, 'value': '{\n  "value": {\n    "error": "timeout",\n    "message": "java.net.http.HttpTimeoutException...n\\tat org.openqa.selenium.remote.http.jdk.JdkHttpClient.execute(JdkHttpClient.java:292)\\n\\t... 36 more\\n"\n  }\n}'}
[...]
=========================== short test summary info ============================
FAILED selenium-test.py::test_qahss03_4 - selenium.common.exceptions.TimeoutE...
=================== 1 failed, 4 passed in 404.93s (0:06:44) ====================

To avoid the timeout, I add the following configuration

extraEnvironmentVariables:
    - name: SE_SESSION_REQUEST_TIMEOUT
      value: "1200"
    - name: SE_SESSION_RETRY_INTERVAL
      value: "5"`

result: 4 tests PASSED, no response is received for the 5th test and the test suite never completes

All our test suites that use concurrent tests are affected by timeout issues: [TestNG] org.openqa.selenium.SessionNotCreatedException: Could not start a new session. Response code 500. Message: java.util.concurrent.TimeoutException

bhecquet commented 1 year ago

An other information that may help to resolve this bug: if I remove the "--session-request-timeout 30" parameter from hub, I don't have the bug anymore. But this is not a solution for me as I have 2 hubs in parallel (for redundancy purpose) and I don't want my new session request being stuck on one of them

diemol commented 1 year ago

I do not see the issue. There are three timeouts into play here:

@bhecquet In your case, you are setting 30 seconds for the session request queue timeout, therefore any new session request will stave because you are leaving the running sessions idle and waiting for the session timeout to happen.

I recommend to check the different flags for configuration here https://www.selenium.dev/documentation/grid/configuration/cli_options/, also the first steps here https://www.selenium.dev/documentation/grid/getting_started/

If you have more questions, I will leave some links below you can use to join our user chat community.

github-actions[bot] commented 1 year ago

💬 Please ask questions at:

bhecquet commented 1 year ago

Hello,

thanks for your reply. I think you missed something: I do not let the session idle, in the problematic case, the client request is "aborted" whereas, at the same time, the session is created on the node, so, there is no way to use the created session

diemol commented 1 year ago

The reason is the same, you need to figure out what timeouts work for you. If the client times out before the queue times out, then you will have orphan sessions. That is why we left it configurable and not hard coded.

bhecquet commented 1 year ago

So I do not understand why I get a staled session on the node, corresponding to a client request which has already timeout (after 30 secs only).

Tell me if I understand correctly the process:

  1. when client needs a session, this session is put in session queue
  2. At regular (and configurable) intervals, distributor ask the session queue for pending sessions, if it can handle at least one
  3. The first pending session is pulled from session queue by distributor which creates the session (I don't know here if it's really removed or not from session queue). I observed that this process can take 2 seconds
  4. The session is created on node

Is there a moment where session is still in session queue and also created on node ?

diemol commented 1 year ago

That seems correct, but what you are seeing:

Cannot create driver: Could not start a new session. Response code 500. Message: Could not start a new session. New session request timed out 

is because the session request was in the queue for over 30 seconds and no Node slot was released in the meantime.

bhecquet commented 1 year ago

Yes, I agree with that, but, in my test code (provided in the issue) I set a custom capability ("cs:test") that help me identify which session is currently started on node, and what I observe is:

That's what surprise me

diemol commented 1 year ago

I do not think you can correlate that because you are executing a loop, 10 times, for each test. So you do not have a single request with cs:test3, you are actually creating 10.

bhecquet commented 1 year ago

caps.setCapability("cs:test", String.format(name + "-" + i)); I simplified :) the capability holds the test name and the loop index

I looked at the "LocalNewSessionQueue" code and see that while distributor takes effectively the session request to create it, it's removed from the queue, so theorically, what I observe is not possible :(

I will do more tests to understand better what happens and let you know

Thanks for your time

bhecquet commented 1 year ago

Hello @diemol

I think I've found the root cause of my problem and it's in LocalNewSessionQueue class

In fact, when session request arrives, a wait is performed https://github.com/SeleniumHQ/selenium/blob/440a388d0bdf0e3b6f977a5048c591d3865c02e9/java/src/org/openqa/selenium/grid/sessionqueue/local/LocalNewSessionQueue.java#L199

During this wait, LocalDistributor calls https://github.com/SeleniumHQ/selenium/blob/440a388d0bdf0e3b6f977a5048c591d3865c02e9/java/src/org/openqa/selenium/grid/sessionqueue/local/LocalNewSessionQueue.java#L304 to get sessions in queue

But nothing prevents the session request to timeout while the session is being created

I've added more logs to grid code (whole file attached) logs_reduced.log

08:48:44.243 INFO [LocalNewSessionQueue.addToQueue] - 7f03040b-a214-48bd-94b3-47df03cc21c5: Adding session request to queue
08:48:44.243 INFO [LocalNewSessionQueue.injectIntoQueue] - 7f03040b-a214-48bd-94b3-47df03cc21c5: Adding request to queue
08:48:44.243 INFO [LocalNewSessionQueue.addToQueue] - 7f03040b-a214-48bd-94b3-47df03cc21c5: Waiting for session to be created
...
08:49:13.947 INFO [LocalNewSessionQueue.lambda$getNextAvailable$7] - available request: 7f03040b-a214-48bd-94b3-47df03cc21c5
...
08:49:14.244 INFO [LocalNewSessionQueue.addToQueue] - 7f03040b-a214-48bd-94b3-47df03cc21c5: Session request timed out
08:49:14.244 INFO [LocalNewSessionQueue.addToQueue] - 7f03040b-a214-48bd-94b3-47df03cc21c5: session creation error sent to client 
...
08:49:15.786 INFO [LocalDistributor.newSession] - Session created by the Distributor. Id: 934ed7006d17484db7ca3b67695672bf [request 7f03040b-a214-48bd-94b3-47df03cc21c5] 
 Caps: Capabilities {acceptInsecureCerts: false, browserName: chrome, browserVersion: 104.0.5112.102, chrome: {chromedriverVersion: 104.0.5112.79 (3cf3e8c8a07d..., userDataDir: C:\Users\ZSSELE~1\AppData\L...}, cs:test: test3-1, goog:chromeOptions: {debuggerAddress: localhost:57386}, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: Windows 10, proxy: {}, se:bidiEnabled: false, se:cdp: ws://selenium-server-dev.co..., se:cdpVersion: 104.0.5112.102, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:virtualAuthenticators: true}

You can look at the modified code (logs added) here: https://github.com/bhecquet/selenium/blob/grid_concurrency/java/src/org/openqa/selenium/grid/sessionqueue/local/LocalNewSessionQueue.java

github-actions[bot] commented 11 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.