SeleniumHQ / selenium

A browser automation framework and ecosystem.
https://selenium.dev
Apache License 2.0
30.19k stars 8.11k forks source link

[🐛 Bug]: org.openqa.selenium.NoSuchSessionException: Unable to find session with ID #14322

Open rishabhjain-qait opened 1 month ago

rishabhjain-qait commented 1 month ago

What happened?

Getting org.openqa.selenium.NoSuchSessionException: Unable to find session with ID: issue intermittently,

I have sel grid version 4.21.0-20240517 up and running, with below properties for browser pods in place, TZ: "Asia/Kolkata" SE_NODE_MAX_SESSIONS: "1" SE_NODE_SESSION_TIMEOUT: "10800" SE_NODE_OVERRIDE_MAX_SESSIONS: "true" SE_SCREEN_HEIGHT: "1080" SE_SCREEN_WIDTH: "1920" SE_OPTS: "--log-level FINEST"

I am running one browser node per k8s pod, I do have autoscaling for the browser pods in place,

autoscaling works absolutely fine, both upscaling and downscaling, this issue that i am facing is not very frequent, but i get this issue sometimes, i am not sure why it is coming,

And i am unable to reproduce this issue on my own, this is intermittent sometimes it comes, sometime it does not, also not related to test, it is not coming with same test everytime, it can be seen with different test whenever observed

I have integrated Jaeger support with my sel grid, just to look at the traces in order to catch these kind of issues, but when i am looking at traces for this issue, i don't see any localSessionMap.remove command sent as it's not visible in jaeger,

all i see is at some point it just threw SessionNotAvailable Exception all of a sudden, it was working fine, it was able to click on the element, and then the next it shows is Unable to Find Session Id, Adding screenshots of what i see in Jaeger

Screen Shot 2024-07-30 at 12 13 18 PM Screen Shot 2024-07-30 at 12 12 42 PM

Please help in checking once what could be the reason here for this issue, is there a particular setting that needs to be changed so as to avoid these kind of issues, please help in checking this once, Thanks in advance.

How can we reproduce the issue?

Adding the logs of what i see in my test output, 

and also adding the stack trace of what i am seeing in jaeger as an exception

Relevant log output

Test Exception

Unable to find session with ID: 303f6c17713ba2fe4988d4ecd00194f5 Build info: version: '4.21.0', revision: '79ed462ef4' System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.1.58+', java.version: '17.0.11' Driver info: driver.version: unknown Build info: version: '4.21.0', revision: '79ed462ef4' System info: os.name: 'Linux', os.arch: 'amd64', os.version: '5.14.0-362.24.2.el9_3.x86_64', java.version: '11.0.12' Driver info: org.openqa.selenium.remote.RemoteWebDriver Command: [303f6c17713ba2fe4988d4ecd00194f5, get {url=https://space-prod0-automation.sprinklr.com/logout}] Capabilities {acceptInsecureCerts: true, browserName: chrome, browserVersion: 125.0.6422.60, chrome: {chromedriverVersion: 125.0.6422.60 (3ac3319bff9f..., userDataDir: /tmp/.org.chromium.Chromium...}, fedcm:accounts: true, goog:chromeOptions: {debuggerAddress: localhost:34867}, goog:loggingPrefs: {browser: ALL}, networkConnectionEnabled: false, pageLoadStrategy: none, platformName: linux, proxy: Proxy(), se:bidiEnabled: false, se:cdp: wss://qa6-selenium-grid-soc..., se:cdpVersion: 125.0.6422.60, se:name: Governance_UI_Macro_Tests/164, se:vnc: wss://qa6-selenium-grid-soc..., se:vncEnabled: true, se:vncLocalAddress: ws://10.102.33.70:7900, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: accept, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:extension:minPinLength: true, webauthn:extension:prf: true, webauthn:virtualAuthenticators: true} Session ID: 303f6c17713ba2fe4988d4ecd00194f5

java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)

java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)

java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)

java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)

org.openqa.selenium.remote.ErrorCodec.decode(ErrorCodec.java:167)

org.openqa.selenium.remote.codec.w3c.W3CHttpResponseCodec.decode(W3CHttpResponseCodec.java:138)

org.openqa.selenium.remote.codec.w3c.W3CHttpResponseCodec.decode(W3CHttpResponseCodec.java:50)

org.openqa.selenium.remote.HttpCommandExecutor.execute(HttpCommandExecutor.java:190)

org.openqa.selenium.remote.TracedCommandExecutor.execute(TracedCommandExecutor.java:51)

org.openqa.selenium.remote.RemoteWebDriver.execute(RemoteWebDriver.java:518)

org.openqa.selenium.remote.RemoteWebDriver.get(RemoteWebDriver.java:300)

Jaeger Exception

event   
exception
exception.message   
Unable to execute request for an existing session: Unable to find session with ID: 303f6c17713ba2fe4988d4ecd00194f5
Build info: version: '4.21.0', revision: '79ed462ef4'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.1.58+', java.version: '17.0.11'
Driver info: driver.version: unknown
exception.stacktrace    
org.openqa.selenium.NoSuchSessionException: Unable to find session with ID: 303f6c17713ba2fe4988d4ecd00194f5
Build info: version: '4.21.0', revision: '79ed462ef4'
System info: os.name: 'Linux', os.arch: 'amd64', os.version: '6.1.58+', java.version: '17.0.11'
Driver info: driver.version: unknown
    at org.openqa.selenium.grid.sessionmap.local.LocalSessionMap.get(LocalSessionMap.java:132)
    at org.openqa.selenium.grid.sessionmap.SessionMap.getUri(SessionMap.java:84)
    at org.openqa.selenium.grid.router.HandleSession.lambda$loadSessionId$4(HandleSession.java:223)
    at io.opentelemetry.context.Context.lambda$wrap$2(Context.java:224)
    at org.openqa.selenium.grid.router.HandleSession.execute(HandleSession.java:180)
    at org.openqa.selenium.remote.http.Route$PredicatedRoute.handle(Route.java:397)
    at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
    at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
    at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
    at org.openqa.selenium.grid.router.Router.execute(Router.java:87)
    at org.openqa.selenium.grid.web.EnsureSpecCompliantResponseHeaders.lambda$apply$0(EnsureSpecCompliantResponseHeaders.java:34)
    at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
    at org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:360)
    at org.openqa.selenium.remote.http.Route.execute(Route.java:69)
    at org.openqa.selenium.remote.AddWebDriverSpecHeaders.lambda$apply$0(AddWebDriverSpecHeaders.java:35)
    at org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)
    at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
    at org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)
    at org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:63)
    at org.openqa.selenium.netty.server.SeleniumHandler.lambda$channelRead0$0(SeleniumHandler.java:44)
    at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
    at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)

Operating System

macOs

Selenium version

4.21.0-20240517

What are the browser(s) and version(s) where you see this issue?

Chrome

What are the browser driver(s) and version(s) where you see this issue?

ChromeDriver

Are you using Selenium Grid?

4.21.0-20240517

github-actions[bot] commented 1 month ago

@rishabhjain-qait, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

rishabhjain-qait commented 1 month ago

cc: @rookieInTraining

diemol commented 1 month ago

@VietND96 do you know?

VietND96 commented 1 month ago

autoscaling works absolutely fine, both upscaling and downscaling,

May I know if it is ScaledObject or ScaledJob? If it is ScaledObject, pod preStop is executed to graceful shutdown the Node? If yes, settings of terminationGracePeriodSeconds in how long, is it enough for pod keep Terminating to wait for the session to be completed?

VietND96 commented 1 month ago

A similar error that also discussed in here https://github.com/SeleniumHQ/docker-selenium/issues/2129#issuecomment-1948127335

rishabhjain-qait commented 1 month ago

hey @VietND96 thanks for looking at the issue,

I am not using KEDA for the autoscaling part, i have written a small spring boot application which is doing this work for me,

I am Draining the node in order to scale down if any of the nodes of sel grid is having 0 sessions running Drain Node https://www.selenium.dev/documentation/grid/advanced_features/endpoints/ Node drain command is for graceful node shutdown. Draining a Node stops the Node after all the ongoing sessions are complete. However, it does not accept any new session requests.

cURL --request POST 'http://localhost:4444/se/grid/distributor/node//drain' --header 'X-REGISTRATION-SECRET;'

VietND96 commented 1 month ago

Also, can you try to upgrade docker image to tag 4.23.0-20240727 (helm chart 0.33.0), which contains the fix https://github.com/SeleniumHQ/selenium/pull/14282 - race condition, a session can be assigned to Node in status DRAINING

VietND96 commented 1 month ago

I am Draining the node in order to scale down if any of the nodes of sel grid is having 0 sessions running

Do you guard the case that at a point of time, having 0 sessions running, drain nodes is triggered but suddenly new requests come? or draining nodes and new requests come together?

VietND96 commented 1 month ago

Also, assume you rely on GraphQL endpoint for getting sessions running. For example, there is a glitch that response return error or something. In this case, how the script makes decision? Is it assume as 0 and trigger the scale down, or retry further before making decision?

rishabhjain-qait commented 1 month ago

I am Draining the node in order to scale down if any of the nodes of sel grid is having 0 sessions running

Do you guard the case that at a point of time, having 0 sessions running, drain nodes is triggered but suddenly new requests come? or draining nodes and new requests come together?

https://www.selenium.dev/documentation/grid/advanced_features/endpoints/ As mentioned here, once the node is set to drained, no new request would come up to that particular node, ideally once the session is finished, a new node would spawn up and that would be able to take new requests if present in session queue as per the autoscaling logic written,

ideally the node that is set to drained should not take up any new requests and should be killed as soon as the current session is completed,

Also, assume you rely on GraphQL endpoint for getting sessions running. For example, there is a glitch that response return error or something. In this case, how the script makes decision? Is it assume as 0 and trigger the scale down, or retry further before making decision?

Also if the graphql endpoint returns error which i haven't observed till now, the script would not assume it as 0 and scale down, instead it will break from the logic, and then it would just try to hit the same graphql endpoint in another 10 sec to get the status and then makes the decision accordingly if needs to scale up/down

VietND96 commented 1 month ago

As mentioned here, once the node is set to drained, no new request would come up to that particular node,

I think the scaler not able to guard this, since Hub makes decision to assign session. So try the the new fix I mentioned to see able to avoid DRAINING node picking up new session.

ideally once the session is finished, a new node would spawn up and that would be able to take new requests if present in session queue as per the autoscaling logic written,

Again, question to the scaler. Once the session is finished, how scaler do the scale down? Does scaler consider exactly which pod will be scaled down, or it just randomly selected?

rishabhjain-qait commented 1 month ago

hey @VietND96

Yes scaler is considering exactly which pod to be scaled down, it does not select randomly,

the pod which needs to be scaled down, i am only updating that pod's deletion cost with below, String payload = "{ \"metadata\": { \"annotations\": { \"controller.kubernetes.io/pod-deletion-cost\": \"-1\" } } }";

and then scaling down so as to ensure correct pod scaled down and not any other

joerg1985 commented 1 month ago

@rishabhjain-qait Is this happening shortly after the session is started? A small delay in processing the NodeRestartedEvent might cause this trouble.

edsherwin commented 2 weeks ago

@rishabhjain-qait have you resolve your issue with KEDA? if yes, can you please share also. Thanks