[🐛 Bug]: Selenium nodes stop connecting to the grid after some time has passed.

Doofus100500 commented 1 year ago

What happened?

Selenium nodes stop connecting to the grid after a certain period of time(one night), but if the distributor is restarted at that moment, they start connecting again. We are scaling Nodes deployment's by KEDA 0->N. Useful information couldn't be found in the logs. Could you please suggest where to look?

Command used to start Selenium Grid with Docker

helm chart 0.20.0

Relevant log output

Didn't find any logs in other component's.
Node:
2023-08-09 05:12:51,325 INFO Included extra file "/etc/supervisor/conf.d/selenium.conf" during parsing
2023-08-09 05:12:51,329 INFO RPC interface 'supervisor' initialized
2023-08-09 05:12:51,329 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2023-08-09 05:12:51,330 INFO supervisord started with pid 8
2023-08-09 05:12:52,333 INFO spawned: 'xvfb' with pid 10
2023-08-09 05:12:52,335 INFO spawned: 'vnc' with pid 11
2023-08-09 05:12:52,337 INFO spawned: 'novnc' with pid 12
2023-08-09 05:12:52,340 INFO spawned: 'selenium-node' with pid 13
2023-08-09 05:12:52,366 INFO success: selenium-node entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Generating Selenium Config
Configuring server...
Setting up SE_NODE_HOST...
Setting up SE_NODE_PORT...
Setting up SE_NODE_GRID_URL...
Tracing is enabled
Classpath will be enriched with these external jars :  /opt/selenium/selenium-http-jdk-client.jar:/external_jars/https/repo1.maven.org/maven2/io/opentelemetry/opentelemetry-exporter-jaeger/1.26.0/opentelemetry-exporter-jaeger-1.26.0.jar:/external_jars/https/repo1.maven.org/maven2/io/grpc/grpc-netty/1.55.1/grpc-netty-1.55.1.jar:/external_jars/https/repo1.maven.org/maven2/io/opentelemetry/opentelemetry-sdk/1.26.0/opentelemetry-sdk-1.26.0.jar:/external_jars/https/repo1.maven.org/maven2/io/opentelemetry/opentelemetry-exporter-common/1.26.0/opentelemetry-exporter-common-1.26.0.jar:/external_jars/https/repo1.maven.org/maven2/io/opentelemetry/opentelemetry-semconv/1.26.0-alpha/opentelemetry-semconv-1.26.0-alpha.jar:/external_jars/https/repo1.maven.org/maven2/io/opentelemetry/opentelemetry-sdk-extension-autoconfigure-spi/1.26.0/opentelemetry-sdk-extension-autoconfigure-spi-1.26.0.jar:/external_jars/https/repo1.maven.org/maven2/com/squareup/okhttp3/okhttp/4.11.0/okhttp-4.11.0.jar:/external_jars/https/repo1.maven.org/maven2/com/fasterxml/jackson/jr/jackson-jr-objects/2.15.0/jackson-jr-objects-2.15.0.jar:/external_jars/https/repo1.maven.org/maven2/io/grpc/grpc-core/1.55.1/grpc-core-1.55.1.jar:/external_jars/https/repo1.maven.org/maven2/io/netty/netty-codec-http2/4.1.87.Final/netty-codec-http2-4.1.87.Final.jar:/external_jars/https/repo1.maven.org/maven2/io/netty/netty-handler-proxy/4.1.87.Final/netty-handler-proxy-4.1.87.Final.jar:/external_jars/https/repo1.maven.org/maven2/com/google/guava/guava/31.1-android/guava-31.1-android.jar:/external_jars/https/repo1.maven.org/maven2/com/google/errorprone/error_prone_annotations/2.18.0/error_prone_annotations-2.18.0.jar:/external_jars/https/repo1.maven.org/maven2/io/perfmark/perfmark-api/0.25.0/perfmark-api-0.25.0.jar:/external_jars/https/repo1.maven.org/maven2/io/netty/netty-transport-native-unix-common/4.1.87.Final/netty-transport-native-unix-common-4.1.87.Final.jar:/external_jars/https/repo1.maven.org/maven2/io/opentelemetry/opentelemetry-api/1.26.0/opentelemetry-api-1.26.0.jar:/external_jars/https/repo1.maven.org/maven2/io/opentelemetry/opentelemetry-sdk-common/1.26.0/opentelemetry-sdk-common-1.26.0.jar:/external_jars/https/repo1.maven.org/maven2/io/opentelemetry/opentelemetry-sdk-trace/1.26.0/opentelemetry-sdk-trace-1.26.0.jar:/external_jars/https/repo1.maven.org/maven2/io/opentelemetry/opentelemetry-sdk-metrics/1.26.0/opentelemetry-sdk-metrics-1.26.0.jar:/external_jars/https/repo1.maven.org/maven2/io/opentelemetry/opentelemetry-sdk-logs/1.26.0-alpha/opentelemetry-sdk-logs-1.26.0-alpha.jar:/external_jars/https/repo1.maven.org/maven2/com/squareup/okio/okio/3.2.0/okio-3.2.0.jar:/external_jars/https/repo1.maven.org/maven2/org/jetbrains/kotlin/kotlin-stdlib/1.6.20/kotlin-stdlib-1.6.20.jar:/external_jars/https/repo1.maven.org/maven2/org/jetbrains/kotlin/kotlin-stdlib-jdk8/1.6.20/kotlin-stdlib-jdk8-1.6.20.jar:/external_jars/https/repo1.maven.org/maven2/com/fasterxml/jackson/core/jackson-core/2.15.0/jackson-core-2.15.0.jar:/external_jars/https/repo1.maven.org/maven2/io/grpc/grpc-api/1.55.1/grpc-api-1.55.1.jar:/external_jars/https/repo1.maven.org/maven2/com/google/code/gson/gson/2.9.0/gson-2.9.0.jar:/external_jars/https/repo1.maven.org/maven2/com/google/android/annotations/4.1.1.4/annotations-4.1.1.4.jar:/external_jars/https/repo1.maven.org/maven2/org/codehaus/mojo/animal-sniffer-annotations/1.21/animal-sniffer-annotations-1.21.jar:/external_jars/https/repo1.maven.org/maven2/io/netty/netty-common/4.1.87.Final/netty-common-4.1.87.Final.jar:/external_jars/https/repo1.maven.org/maven2/io/netty/netty-buffer/4.1.87.Final/netty-buffer-4.1.87.Final.jar:/external_jars/https/repo1.maven.org/maven2/io/netty/netty-transport/4.1.87.Final/netty-transport-4.1.87.Final.jar:/external_jars/https/repo1.maven.org/maven2/io/netty/netty-codec/4.1.87.Final/netty-codec-4.1.87.Final.jar:/external_jars/https/repo1.maven.org/maven2/io/netty/netty-handler/4.1.87.Final/netty-handler-4.1.87.Final.jar:/external_jars/https/repo1.maven.org/maven2/io/netty/netty-codec-http/4.1.87.Final/netty-codec-http-4.1.87.Final.jar:/external_jars/https/repo1.maven.org/maven2/io/netty/netty-codec-socks/4.1.87.Final/netty-codec-socks-4.1.87.Final.jar:/external_jars/https/repo1.maven.org/maven2/com/google/guava/failureaccess/1.0.1/failureaccess-1.0.1.jar:/external_jars/https/repo1.maven.org/maven2/com/google/guava/listenablefuture/9999.0-empty-to-avoid-conflict-with-guava/listenablefuture-9999.0-empty-to-avoid-conflict-with-guava.jar:/external_jars/https/repo1.maven.org/maven2/com/google/code/findbugs/jsr305/3.0.2/jsr305-3.0.2.jar:/external_jars/https/repo1.maven.org/maven2/org/checkerframework/checker-qual/3.12.0/checker-qual-3.12.0.jar:/external_jars/https/repo1.maven.org/maven2/com/google/j2objc/j2objc-annotations/1.3/j2objc-annotations-1.3.jar:/external_jars/https/repo1.maven.org/maven2/io/opentelemetry/opentelemetry-context/1.26.0/opentelemetry-context-1.26.0.jar:/external_jars/https/repo1.maven.org/maven2/io/opentelemetry/opentelemetry-extension-incubator/1.26.0-alpha/opentelemetry-extension-incubator-1.26.0-alpha.jar:/external_jars/https/repo1.maven.org/maven2/io/opentelemetry/opentelemetry-api-logs/1.26.0-alpha/opentelemetry-api-logs-1.26.0-alpha.jar:/external_jars/https/repo1.maven.org/maven2/io/opentelemetry/opentelemetry-api-events/1.26.0-alpha/opentelemetry-api-events-1.26.0-alpha.jar:/external_jars/https/repo1.maven.org/maven2/com/squareup/okio/okio-jvm/3.2.0/okio-jvm-3.2.0.jar:/external_jars/https/repo1.maven.org/maven2/org/jetbrains/kotlin/kotlin-stdlib-common/1.6.20/kotlin-stdlib-common-1.6.20.jar:/external_jars/https/repo1.maven.org/maven2/org/jetbrains/annotations/13.0/annotations-13.0.jar:/external_jars/https/repo1.maven.org/maven2/org/jetbrains/kotlin/kotlin-stdlib-jdk7/1.6.20/kotlin-stdlib-jdk7-1.6.20.jar:/external_jars/https/repo1.maven.org/maven2/io/grpc/grpc-context/1.55.1/grpc-context-1.55.1.jar:/external_jars/https/repo1.maven.org/maven2/io/netty/netty-resolver/4.1.87.Final/netty-resolver-4.1.87.Final.jar
Selenium Grid Node configuration: 
[events]
publish = "tcp://selenium-event-bus:4442"
subscribe = "tcp://selenium-event-bus:4443"

[node]
session-timeout = "300"
override-max-sessions = false
detect-drivers = false
drain-after-session-count = 0
max-sessions = 1

[[node.driver-configuration]]
display-name = "chrome"
stereotype = '{"browserName": "chrome", "browserVersion": "115.0", "platformName": "Linux"}'
max-sessions = 1

Starting Selenium Grid Node...
2023-08-09 05:12:53,373 INFO success: xvfb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-08-09 05:12:53,373 INFO success: vnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-08-09 05:12:53,373 INFO success: novnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
05:12:54.670 INFO [LoggingOptions.configureLogEncoding] - Using the system default encoding
05:12:54.677 INFO [OpenTelemetryTracer.createTracer] - Using OpenTelemetry for tracing
05:12:55.096 INFO [UnboundZmqEventBus.<init>] - Connecting to tcp://selenium-event-bus:4442 and tcp://selenium-event-bus:4443
05:12:55.364 INFO [UnboundZmqEventBus.<init>] - Sockets created
05:12:56.373 INFO [UnboundZmqEventBus.<init>] - Event bus ready
05:12:56.805 INFO [NodeServer.createHandlers] - Reporting self as: http://10.223.122.221:5555
05:12:56.892 INFO [NodeOptions.getSessionFactories] - Detected 1 available processors
05:12:57.016 INFO [NodeOptions.report] - Adding chrome for {"browserName": "chrome","browserVersion": "115.0","platformName": "linux","se:noVncPort": 7900,"se:vncEnabled": true} 1 times
05:12:57.085 INFO [Node.<init>] - Binding additional locator mechanisms: relative
05:12:57.704 INFO [NodeServer$1.start] - Starting registration process for Node http://10.223.122.221:5555
05:12:57.705 INFO [NodeServer.execute] - Started Selenium node 4.11.0 (revision 040bc5406b): http://10.223.122.221:5555
05:12:57.777 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
05:13:07.784 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
05:13:17.788 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
05:13:27.793 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
05:13:37.796 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
05:13:47.799 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
05:13:57.803 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
05:14:07.810 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
05:14:17.815 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
05:14:27.818 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
05:14:37.822 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
05:14:47.826 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
05:14:57.708 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
05:43:15.602 INFO [LocalNode.drain] - Firing node drain complete message
05:43:16.606 INFO [NodeServer.lambda$createHandlers$3] - Shutting down
2023-08-09 05:43:16,940 INFO exited: selenium-node (exit status 0; expected)
2023-08-09 05:43:16,940 WARN received SIGINT indicating exit request
2023-08-09 05:43:16,940 INFO waiting for xvfb, vnc, novnc to die
2023-08-09 05:43:18,943 INFO stopped: novnc (terminated by SIGTERM)
2023-08-09 05:43:19,945 INFO stopped: vnc (terminated by SIGTERM)
2023-08-09 05:43:19,945 INFO waiting for xvfb to die
Trapped SIGTERM/SIGINT/x so shutting down supervisord...
2023-08-09 05:43:20,947 INFO stopped: xvfb (terminated by SIGTERM)
Shutdown complete
Stream closed EOF for idevops-selenium4/selenium-chrome-node-v115-8474667bbc-blwsn (selenium-chrome-node-v115)

Operating System

k8s

Docker Selenium version (tag)

4.11.0-20230801

github-actions[bot] commented 1 year ago

@Doofus100500, thank you for creating this issue. We will troubleshoot it as soon as we can.

Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

diemol commented 1 year ago

This normally happens because those Nodes are in a "new" section of the cluster, and the Distributor's DNS has no information about those IP addresses. Can you double check that?

Doofus100500 commented 1 year ago

But after restarting the distributor, it once again starts "seeing" the nodes created after its restart.

Can you double check that? How can i check this?

diemol commented 1 year ago

Yes, because the DNS information is updated, I believe. You know better your environment, can you check that?

Doofus100500 commented 1 year ago

While studying the issue, I discovered an interesting feature: if at least one node is connected to the grid, the distributor continues to connect newly created nodes correctly(However, this has not yet been confirmed over an extended period of time.). Also, I don't understand where (at the network level) the distributor gets the list of nodes from, as EventBus doesn't write anything to it, as practice has shown. The node sends information to EventBus, and after that, I have a gap in understanding what happens. Could you please explain in more detail what is happening?

diemol commented 1 year ago

https://www.selenium.dev/documentation/grid/getting_started/#node-and-hub-on-different-machines

Doofus100500 commented 1 year ago

https://www.selenium.dev/documentation/grid/getting_started/#node-and-hub-on-different-machines

Distributor interacts with New Session Queue, Session Map, Event Bus, and the Node(s).

Well, based on this description, it's not clear. Does he himself have to go there? "Interacts" is too loose of an interpretation.

diemol commented 1 year ago

This is the part I linked:

Hub and Nodes talk to each other via HTTP and the Event Bus (the Event Bus lives inside the Hub). A Node sends a message to the Hub via the Event Bus to start the registration process. When the Hub receives the message, reaches out to the Node via HTTP to confirm its existence.

To successfully register a Node to a Hub, it is important to expose the Event Bus ports (4442 and 4443 by default) on the Hub machine. This also applies for the Node port. With that, both Hub and Node will be able to communicate.

If the Hub is using the default ports, the --hub flag can be used to register the Node

Doofus100500 commented 1 year ago

Yes, because the DNS information is updated, I believe. You know better your environment, can you check that?

There is definitely no DNS in it. The externalUri confirms this. It's an IP address there, not a DNS name.

Let's go back to my hypothesis: is it possible that the Distributor at some point stops "searching" for nodes because there are none and no longer resumes its search?

Doofus100500 commented 1 year ago

The hypothesis has been confirmed. If there is at least one node in the grid, the distributor continues to reliably register new nodes. Testing was conducted from the moment of the last comment. Thank you for the responses.

diemol commented 1 year ago

There is also --register-period and --register-cycle (from https://www.selenium.dev/documentation/grid/configuration/cli_options/#node), which determine how long a Node tries to register.

I am not 100% sure about the hypothesis because a Distributor registers a Node if it can reach it via HTTP. It might be that, for some reason, in your environment it takes longer than 2 minutes for the message to reach the Distributor.

Doofus100500 commented 1 year ago

But then how to explain that everything is working now?

diemol commented 1 year ago

There is nothing in the code that can confirm that hypothesis, that is why I'm not sure about it. What I'm sure is that other people have reported very similar issues and it is due to network connectivity between the Distributor and the Nodes. I cannot troubleshoot your environment.

cr-liorholtzman commented 1 year ago

I can also report this happened on our deployment which is similar to what have been reported, we use NodePools and the issue started when a new node was added to the pool and chrome-pods where deployed on it while the rest of the components were on the first node. Restarting also did the trick and solved it for now...

cr-liorholtzman commented 1 year ago

And happened again today when a new node was added to the cluster's nodepool

Doofus100500 commented 1 year ago

I can also report this happened on our deployment which is similar to what have been reported, we use NodePools and the issue started when a new node was added to the pool and chrome-pods where deployed on it while the rest of the components were on the first node. Restarting also did the trick and solved it for now...

I apologize for being off-topic, but what do you mean by "NodePools"? Are you referring to Selenium nodes? And how can this be done?

CR-EvgenyTalaevsky commented 1 year ago

Hi @Doofus100500 the nodepools are GKE (Google kubernetes). THis might require to dive a little bit in to k8s.

We have GKE nodepool with is configured using k8s autoscaler. So when selenium might need to spin some more chrome instances, it would trigger the k8s node autoscaler and it automatically creates new k8s node and spins selenium chrome instance on new node. And when this happens, we're having the issue described here. Restarting the chrome instance pod "resolves the issue".

Basically this would a valid use case for selenium with keda and k8s autoscaler. Otherwise just keep as many as possible chrome instances and that's it 😃

VietND96 commented 6 months ago

Read through all comments. I have something want to add.

Restarting the Node pod is a trick to solve the problem. Instead of restarting the pod manually, I think we can rely on the startup probe to guard this. Currently in Helm chart or even the user's own YAML manifests, the startup probe is enabled by using method httpGet check endpoint /status of Node is reachable. However, when the container is up, the Node /status is reachable in a short time, but meanwhile, Node could still be in progress to send the registration events to Hub/Router, which means startup probe passed but Node wasn't registered successfully. I saw this problem and tried to enhance the startup probe by using the method exec.command with a script to double check the nodeId should be present in Hub/Router before passing the startup probe (PR #2139). Hope this can help the startup probe more reliable
There is also --register-period and --register-cycle (from https://www.selenium.dev/documentation/grid/configuration/cli_options/#node), which determine how long a Node tries to register. I am not 100% sure about the hypothesis because a Distributor registers a Node if it can reach it via HTTP. It might be that, for some reason, in your environment it takes longer than 2 minutes for the message to reach the Distributor.

I saw by default --register-period is 120s. I believe there is a case if a Node pod took over 120s but could not register successfully, pod will stay Running & Ready but could not take any session, which impacts the number of replicas in autoscaling In a change https://github.com/SeleniumHQ/docker-selenium/commit/85b708f8445bd6472a15f13bd5c4ed4c19032f7b recently, ENV var SE_NODE_REGISTER_PERIOD was added to pass value to option --register-period. I believe this can be a workaround in case latency in environment which causes Node registration can take more time.

These 2 above will be available in new release images tag and chart version on top of SE 4.18.0

Doofus100500 commented 6 months ago

Thank you for the new unquestionably useful features, but my issue is not related to the node, but to the distributor. After its reboot, nodes immediately register successfully. In other words, if you start the grid and do not connect any nodes to it, after some time, the distributor stops accepting requests from nodes.

VietND96 commented 6 months ago

Yes, I think we can observe any endpoint or signal to check health of Distributor then we can rely on that and implement the Liveness probe to take action restart the container if it could not recover itself in a period.

aafeltowicz commented 6 months ago

Is it expected that SE_GRID_URL in the nodeProbe.sh script is empty for SE_NODE_GRID_URL pointing to ingress hostname? I get infinite "Node ID: ${NODE_ID} is not found in the Grid. The registration could be in progress.". (Selenium Helm Chart v0.28)

VietND96 commented 6 months ago

Hi, may I know your SE_NODE_GRID_URL value is rendered in your deployment?

aafeltowicz commented 6 months ago

It points to hostname and path defined in the ingress.hostname and ingress.path. Protocol is probably derived from tls.enabled state (I use tls.enabled=true). Example value: https://se-grid.mycompany.com/selenium For version 0.27 everything works fine.

VietND96 commented 6 months ago

Thank you for your feedback, there was bug in nodeProbe.sh actually. In the meantime you can workaround by disabling startup probe in node. I will give a patch ASAP.

VietND96 commented 6 months ago

@aafeltowicz, chart 0.28.1 is out. Can you retry and confirm? If still have an issue, an alternative config allows switch back the default startup probe method httpGet by setting global.seleniumGrid.defaultNodeStartupProbe: httpGet or leave it blank in your own override YAML

aafeltowicz commented 6 months ago

v0.28.1 works like a charm, thx :)

BTW I forgot to mention that I also have to set global.K8S_PUBLIC_IP to external host DNS, to make this setup working, otherwise nodes have problems with communication with other components.

VietND96 commented 3 months ago

I don't know if the original issue has been resolved recently. However, a proactive approach via liveness probe in K8s to check Distributor is healthy and restart it if there is no request session in queue picked up via PR #2272.

SeleniumHQ / docker-selenium