Closed amardeep2006 closed 9 months ago
@amardeep2006, thank you for creating this issue. We will troubleshoot it as soon as we can.
Triage this issue by using labels.
If information is missing, add a helpful comment and then I-issue-template
label.
If the issue is a question, add the I-question
label.
If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted
label.
If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C),
add the applicable G-*
label, and it will provide the correct link and auto-close the
issue.
After troubleshooting the issue, please add the R-awaiting answer
label.
Thank you!
Can you also check the logs of keda-operator at that time, to see if any deployment was scaled down by the Scaler. When looking at your logs, it's similar to an issue still open for discussion around autoscaling here https://github.com/SeleniumHQ/docker-selenium/issues/2129
Thanks @VietND96 for response. I looked at the keda operator logs and it's full of unrelated errors. It's failing to get firefox scaled object because I have set firefox:enabled as false. May be this is another area of improvement in helm chart(unrelated to this issue). I am using chrome nodes.
I looked at the issue you mentioned and it seems like almost my issue. I have tried disabling autoscaling and all tests worked fine so the issue is related to autoscaling that's for sure. What to you suggest as further troubleshooting ?
I have another questing : Looking at the chrome-node logs , what do you think ? Let's assume KEDA operator instructed for downscaling, Why did not the preStop script bale to hold the DRAINING node till tests was complete ? Can there be some bug in prestop script logic ?
edit: I just saw one more issue that looks like similar in nature. https://github.com/SeleniumHQ/docker-selenium/issues/2155
I will share further details tomorrow as I feel it could be kubernetes killing pods because of many reasons
chromeNode.terminationGracePeriodSeconds=30
. Can it be playing some role here ?? exactly after 30 seconds I see SIGTERM in logs. Should not it be inherited from my value file as 3600 ?@VietND96 I applied the chromeNode.terminationGracePeriodSeconds=3600 setting and the issue is disappeared. I see following issues that may need fix in helm chart:
firefoxNode:
enabled: true
imagePullPolicy: Always
# /dev/shm volume
dshmVolumeSizeLimit: "2Gi"
# Resources for firefox-node container
resources:
requests:
memory: "1Gi"
cpu: "1"
limits:
memory: "2Gi"
cpu: "2"
extraEnvironmentVariables:
# - name: "SE_VNC_NO_PASSWORD"
# value: "1"
- name: "SE_VNC_VIEW_ONLY"
value: "1"
autoscaling:
scaledOptions:
minReplicaCount: 0
maxReplicaCount: 3
terminationGracePeriodSeconds: 3600
edit : Downside is few pods may live for 3600 seconds in terminating state and still do processing. I can live with that for now.
Ok, let me check any regression broken, since in README I updated that default 3600
will be applied for all nodes if there is no individual config override
It was a defect actually, the logic is handled but the template name is not called in Node spec YAML, so value 30
in each node is picked up directly. I just fixed it and added a template test to verify it.
I am considering other fixes before bumping a new chart version soon.
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.
What happened?
I upgraded to selenium grid 4.18.1 this week from 4.14 and observing that some test scripts are throwing error like
invalid session id: Unable to find session with ID: 2e5824bb7eb7f83dc8d83774e8c7c539
Deployed on : Kubernetes Autoscaling: enabled (Deployment)terminationGracePeriodSeconds: 3600
in autoscaling.Do we know the reasons why chrome node goes into draining mode ?
Additional Info : I run around 30 tests in parallel. I run 1 browser per pod. The test duration ranges between 10 minutes to 35 minutes. Keda version previously was 2.12.0 but now it 2.13.1 with this upgrade.
Command used to start Selenium Grid with Docker (or Kubernetes)
Relevant log output
Operating System
kubernetes 1.23.14
Docker Selenium version (image tag)
4.18.1
Selenium Grid chart version (chart version)
0.28.1