Open Doofus100500 opened 1 year ago
@Doofus100500, thank you for creating this issue. We will troubleshoot it as soon as we can.
Triage this issue by using labels.
If information is missing, add a helpful comment and then I-issue-template
label.
If the issue is a question, add the I-question
label.
If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted
label.
If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C),
add the applicable G-*
label, and it will provide the correct link and auto-close the
issue.
After troubleshooting the issue, please add the R-awaiting answer
label.
Thank you!
This normally happens because those Nodes are in a "new" section of the cluster, and the Distributor's DNS has no information about those IP addresses. Can you double check that?
But after restarting the distributor, it once again starts "seeing" the nodes created after its restart.
Can you double check that? How can i check this?
Yes, because the DNS information is updated, I believe. You know better your environment, can you check that?
While studying the issue, I discovered an interesting feature: if at least one node is connected to the grid, the distributor continues to connect newly created nodes correctly(However, this has not yet been confirmed over an extended period of time.). Also, I don't understand where (at the network level) the distributor gets the list of nodes from, as EventBus doesn't write anything to it, as practice has shown. The node sends information to EventBus, and after that, I have a gap in understanding what happens. Could you please explain in more detail what is happening?
https://www.selenium.dev/documentation/grid/getting_started/#node-and-hub-on-different-machines
Distributor interacts with New Session Queue, Session Map, Event Bus, and the Node(s).
Well, based on this description, it's not clear. Does he himself have to go there? "Interacts" is too loose of an interpretation.
This is the part I linked:
Hub and Nodes talk to each other via HTTP and the Event Bus (the Event Bus lives inside the Hub). A Node sends a message to the Hub via the Event Bus to start the registration process. When the Hub receives the message, reaches out to the Node via HTTP to confirm its existence.
To successfully register a Node to a Hub, it is important to expose the Event Bus ports (4442 and 4443 by default) on the Hub machine. This also applies for the Node port. With that, both Hub and Node will be able to communicate.
If the Hub is using the default ports, the --hub flag can be used to register the Node
Yes, because the DNS information is updated, I believe. You know better your environment, can you check that?
There is definitely no DNS in it. The externalUri confirms this. It's an IP address there, not a DNS name.
Let's go back to my hypothesis: is it possible that the Distributor at some point stops "searching" for nodes because there are none and no longer resumes its search?
The hypothesis has been confirmed. If there is at least one node in the grid, the distributor continues to reliably register new nodes. Testing was conducted from the moment of the last comment. Thank you for the responses.
There is also --register-period
and --register-cycle
(from https://www.selenium.dev/documentation/grid/configuration/cli_options/#node), which determine how long a Node tries to register.
I am not 100% sure about the hypothesis because a Distributor registers a Node if it can reach it via HTTP. It might be that, for some reason, in your environment it takes longer than 2 minutes for the message to reach the Distributor.
But then how to explain that everything is working now?
There is nothing in the code that can confirm that hypothesis, that is why I'm not sure about it. What I'm sure is that other people have reported very similar issues and it is due to network connectivity between the Distributor and the Nodes. I cannot troubleshoot your environment.
I can also report this happened on our deployment which is similar to what have been reported, we use NodePools and the issue started when a new node was added to the pool and chrome-pods where deployed on it while the rest of the components were on the first node. Restarting also did the trick and solved it for now...
And happened again today when a new node was added to the cluster's nodepool
I can also report this happened on our deployment which is similar to what have been reported, we use NodePools and the issue started when a new node was added to the pool and chrome-pods where deployed on it while the rest of the components were on the first node. Restarting also did the trick and solved it for now...
I apologize for being off-topic, but what do you mean by "NodePools"? Are you referring to Selenium nodes? And how can this be done?
Hi @Doofus100500 the nodepools are GKE (Google kubernetes). THis might require to dive a little bit in to k8s.
We have GKE nodepool with is configured using k8s autoscaler. So when selenium might need to spin some more chrome instances, it would trigger the k8s node autoscaler and it automatically creates new k8s node and spins selenium chrome instance on new node. And when this happens, we're having the issue described here. Restarting the chrome instance pod "resolves the issue".
Basically this would a valid use case for selenium with keda and k8s autoscaler. Otherwise just keep as many as possible chrome instances and that's it 😃
Read through all comments. I have something want to add.
httpGet
check endpoint /status
of Node is reachable. However, when the container is up, the Node /status
is reachable in a short time, but meanwhile, Node could still be in progress to send the registration events to Hub/Router, which means startup probe passed but Node wasn't registered successfully. I saw this problem and tried to enhance the startup probe by using the method exec.command
with a script to double check the nodeId should be present in Hub/Router before passing the startup probe (PR #2139). Hope this can help the startup probe more reliableThere is also --register-period and --register-cycle (from https://www.selenium.dev/documentation/grid/configuration/cli_options/#node), which determine how long a Node tries to register. I am not 100% sure about the hypothesis because a Distributor registers a Node if it can reach it via HTTP. It might be that, for some reason, in your environment it takes longer than 2 minutes for the message to reach the Distributor.
I saw by default --register-period
is 120s. I believe there is a case if a Node pod took over 120s but could not register successfully, pod will stay Running & Ready but could not take any session, which impacts the number of replicas in autoscaling
In a change https://github.com/SeleniumHQ/docker-selenium/commit/85b708f8445bd6472a15f13bd5c4ed4c19032f7b recently, ENV var SE_NODE_REGISTER_PERIOD
was added to pass value to option --register-period
. I believe this can be a workaround in case latency in environment which causes Node registration can take more time.
These 2 above will be available in new release images tag and chart version on top of SE 4.18.0
Thank you for the new unquestionably useful features, but my issue is not related to the node, but to the distributor. After its reboot, nodes immediately register successfully. In other words, if you start the grid and do not connect any nodes to it, after some time, the distributor stops accepting requests from nodes.
Yes, I think we can observe any endpoint or signal to check health of Distributor then we can rely on that and implement the Liveness probe to take action restart the container if it could not recover itself in a period.
Is it expected that SE_GRID_URL in the nodeProbe.sh script is empty for SE_NODE_GRID_URL pointing to ingress hostname? I get infinite "Node ID: ${NODE_ID} is not found in the Grid. The registration could be in progress.". (Selenium Helm Chart v0.28)
Hi, may I know your SE_NODE_GRID_URL value is rendered in your deployment?
It points to hostname and path defined in the ingress.hostname and ingress.path. Protocol is probably derived from tls.enabled state (I use tls.enabled=true). Example value: https://se-grid.mycompany.com/selenium For version 0.27 everything works fine.
Thank you for your feedback, there was bug in nodeProbe.sh actually. In the meantime you can workaround by disabling startup probe in node. I will give a patch ASAP.
@aafeltowicz, chart 0.28.1
is out. Can you retry and confirm?
If still have an issue, an alternative config allows switch back the default startup probe method httpGet
by setting global.seleniumGrid.defaultNodeStartupProbe: httpGet
or leave it blank in your own override YAML
v0.28.1 works like a charm, thx :)
BTW I forgot to mention that I also have to set global.K8S_PUBLIC_IP to external host DNS, to make this setup working, otherwise nodes have problems with communication with other components.
I don't know if the original issue has been resolved recently. However, a proactive approach via liveness probe in K8s to check Distributor is healthy and restart it if there is no request session in queue picked up via PR #2272.
What happened?
Selenium nodes stop connecting to the grid after a certain period of time(one night), but if the distributor is restarted at that moment, they start connecting again. We are scaling Nodes deployment's by KEDA 0->N. Useful information couldn't be found in the logs. Could you please suggest where to look?
Command used to start Selenium Grid with Docker
Relevant log output
Operating System
k8s
Docker Selenium version (tag)
4.11.0-20230801