Open rishabhjain-qait opened 1 month ago
@rishabhjain-qait, thank you for creating this issue. We will troubleshoot it as soon as we can.
Triage this issue by using labels.
If information is missing, add a helpful comment and then I-issue-template
label.
If the issue is a question, add the I-question
label.
If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted
label.
If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C),
add the applicable G-*
label, and it will provide the correct link and auto-close the
issue.
After troubleshooting the issue, please add the R-awaiting answer
label.
Thank you!
cc: @rookieInTraining
@VietND96 do you know?
autoscaling works absolutely fine, both upscaling and downscaling,
May I know if it is ScaledObject or ScaledJob?
If it is ScaledObject, pod preStop
is executed to graceful shutdown the Node? If yes, settings of terminationGracePeriodSeconds
in how long, is it enough for pod keep Terminating to wait for the session to be completed?
A similar error that also discussed in here https://github.com/SeleniumHQ/docker-selenium/issues/2129#issuecomment-1948127335
hey @VietND96 thanks for looking at the issue,
I am not using KEDA for the autoscaling part, i have written a small spring boot application which is doing this work for me,
I am Draining the node in order to scale down if any of the nodes of sel grid is having 0 sessions running Drain Node https://www.selenium.dev/documentation/grid/advanced_features/endpoints/ Node drain command is for graceful node shutdown. Draining a Node stops the Node after all the ongoing sessions are complete. However, it does not accept any new session requests.
cURL --request POST 'http://localhost:4444/se/grid/distributor/node/
Also, can you try to upgrade docker image to tag 4.23.0-20240727
(helm chart 0.33.0
), which contains the fix https://github.com/SeleniumHQ/selenium/pull/14282 - race condition, a session can be assigned to Node in status DRAINING
I am Draining the node in order to scale down if any of the nodes of sel grid is having 0 sessions running
Do you guard the case that at a point of time, having 0 sessions running, drain nodes is triggered but suddenly new requests come? or draining nodes and new requests come together?
Also, assume you rely on GraphQL endpoint for getting sessions running. For example, there is a glitch that response return error or something. In this case, how the script makes decision? Is it assume as 0 and trigger the scale down, or retry further before making decision?
I am Draining the node in order to scale down if any of the nodes of sel grid is having 0 sessions running
Do you guard the case that at a point of time, having 0 sessions running, drain nodes is triggered but suddenly new requests come? or draining nodes and new requests come together?
https://www.selenium.dev/documentation/grid/advanced_features/endpoints/ As mentioned here, once the node is set to drained, no new request would come up to that particular node, ideally once the session is finished, a new node would spawn up and that would be able to take new requests if present in session queue as per the autoscaling logic written,
ideally the node that is set to drained should not take up any new requests and should be killed as soon as the current session is completed,
Also, assume you rely on GraphQL endpoint for getting sessions running. For example, there is a glitch that response return error or something. In this case, how the script makes decision? Is it assume as 0 and trigger the scale down, or retry further before making decision?
Also if the graphql endpoint returns error which i haven't observed till now, the script would not assume it as 0 and scale down, instead it will break from the logic, and then it would just try to hit the same graphql endpoint in another 10 sec to get the status and then makes the decision accordingly if needs to scale up/down
As mentioned here, once the node is set to drained, no new request would come up to that particular node,
I think the scaler not able to guard this, since Hub makes decision to assign session. So try the the new fix I mentioned to see able to avoid DRAINING node picking up new session.
ideally once the session is finished, a new node would spawn up and that would be able to take new requests if present in session queue as per the autoscaling logic written,
Again, question to the scaler. Once the session is finished, how scaler do the scale down? Does scaler consider exactly which pod will be scaled down, or it just randomly selected?
hey @VietND96
Yes scaler is considering exactly which pod to be scaled down, it does not select randomly,
the pod which needs to be scaled down, i am only updating that pod's deletion cost with below, String payload = "{ \"metadata\": { \"annotations\": { \"controller.kubernetes.io/pod-deletion-cost\": \"-1\" } } }";
and then scaling down so as to ensure correct pod scaled down and not any other
@rishabhjain-qait Is this happening shortly after the session is started?
A small delay in processing the NodeRestartedEvent
might cause this trouble.
@rishabhjain-qait have you resolve your issue with KEDA? if yes, can you please share also. Thanks
What happened?
Getting org.openqa.selenium.NoSuchSessionException: Unable to find session with ID: issue intermittently,
I have sel grid version 4.21.0-20240517 up and running, with below properties for browser pods in place, TZ: "Asia/Kolkata" SE_NODE_MAX_SESSIONS: "1" SE_NODE_SESSION_TIMEOUT: "10800" SE_NODE_OVERRIDE_MAX_SESSIONS: "true" SE_SCREEN_HEIGHT: "1080" SE_SCREEN_WIDTH: "1920" SE_OPTS: "--log-level FINEST"
I am running one browser node per k8s pod, I do have autoscaling for the browser pods in place,
autoscaling works absolutely fine, both upscaling and downscaling, this issue that i am facing is not very frequent, but i get this issue sometimes, i am not sure why it is coming,
And i am unable to reproduce this issue on my own, this is intermittent sometimes it comes, sometime it does not, also not related to test, it is not coming with same test everytime, it can be seen with different test whenever observed
I have integrated Jaeger support with my sel grid, just to look at the traces in order to catch these kind of issues, but when i am looking at traces for this issue, i don't see any localSessionMap.remove command sent as it's not visible in jaeger,
all i see is at some point it just threw SessionNotAvailable Exception all of a sudden, it was working fine, it was able to click on the element, and then the next it shows is Unable to Find Session Id, Adding screenshots of what i see in Jaeger
Please help in checking once what could be the reason here for this issue, is there a particular setting that needs to be changed so as to avoid these kind of issues, please help in checking this once, Thanks in advance.
How can we reproduce the issue?
Relevant log output
Operating System
macOs
Selenium version
4.21.0-20240517
What are the browser(s) and version(s) where you see this issue?
Chrome
What are the browser driver(s) and version(s) where you see this issue?
ChromeDriver
Are you using Selenium Grid?
4.21.0-20240517