[🐛 Bug]: Nodes Disconnecting from Hub after AKS Deployment with Helm Chart #2065

michaelmowry opened 9 months ago

michaelmowry commented 9 months ago

What happened?

Our team has deployed Selenium Grid to AKS using the helm templates in the repository. Our problem is that the nodes connect to the hub very briefly and are visible in the UI and then disappear and do not show up again. In the logs below we can see that the registration event between the node and hub is not successful. We are attempting to use a basic hub/node architecture with isolateComponents=false. We have disabled ingress and basic auth and are using istio. We are able to access the Selenium Grid UI on the Hub and we are able to queue tests but they timeout as no nodes are available for processing. Thanks in advance for any help on resolving this.

Command used to start Selenium Grid with Docker (or Kubernetes)

Relevant log output

Logs from the chrome node that is not able to register with the hub:

2023-12-14 10:25:15,457 INFO Included extra file "/etc/supervisor/conf.d/selenium.conf" during parsing
2023-12-14 10:25:15,460 INFO RPC interface 'supervisor' initialized
2023-12-14 10:25:15,460 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2023-12-14 10:25:15,460 INFO supervisord started with pid 8
2023-12-14 10:25:16,462 INFO spawned: 'xvfb' with pid 10
2023-12-14 10:25:16,464 INFO spawned: 'vnc' with pid 11
2023-12-14 10:25:16,465 INFO spawned: 'novnc' with pid 12
2023-12-14 10:25:16,467 INFO spawned: 'selenium-node' with pid 13
2023-12-14 10:25:16,484 INFO success: selenium-node entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
Generating Selenium Config
Configuring server...
Setting up SE_NODE_HOST...
Setting up SE_NODE_PORT...
2023-12-14 10:25:17,538 INFO success: xvfb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-12-14 10:25:17,538 INFO success: vnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2023-12-14 10:25:17,538 INFO success: novnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
Tracing is disabled
Selenium Grid Node configuration:
publish = "tcp://selenium-hub:4442"
subscribe = "tcp://selenium-hub:4443"

grid-url = http://selenium-hub.seleniumgridpoc:4444
session-timeout = "300"
override-max-sessions = false
detect-drivers = false
drain-after-session-count = 0
max-sessions = 1

display-name = "chrome"
stereotype = '{"browserName": "chrome", "browserVersion": "118.0", "platformName": "Linux"}'
max-sessions = 1

Starting Selenium Grid Node...
Dec 14, 2023 10:25:17 AM org.openqa.selenium.grid.Bootstrap createExtendedClassLoader
WARNING: Extension file or directory does not exist: /opt/selenium/selenium-http-jdk-client.jar
10:25:18.527 INFO [LoggingOptions.configureLogEncoding] - Using the system default encoding
10:25:18.538 INFO [OpenTelemetryTracer.createTracer] - Using OpenTelemetry for tracing
10:25:18.935 INFO [UnboundZmqEventBus.<init>] - Connecting to tcp://selenium-hub:4442 and tcp://selenium-hub:4443
10:25:19.133 INFO [UnboundZmqEventBus.<init>] - Sockets created
10:25:20.136 INFO [UnboundZmqEventBus.<init>] - Event bus ready
10:25:20.320 INFO [NodeServer.createHandlers] - Reporting self as:
10:25:20.339 INFO [NodeOptions.getSessionFactories] - Detected 1 available processors
10:25:20.437 INFO [] - Adding chrome for {"browserName": "chrome","browserVersion": "118.0","platformName": "linux","se:noVncPort": 7900,"se:vncEnabled": true} 1 times
10:25:20.452 INFO [Node.<init>] - Binding additional locator mechanisms: relative
10:25:20.756 INFO [NodeServer$1.start] - Starting registration process for Node
10:25:20.758 INFO [NodeServer.execute] - Started Selenium node 4.14.1 (revision 03f8ede370):
10:25:20.777 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
10:25:30.781 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
10:25:40.782 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
10:25:50.785 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
10:26:00.787 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
10:26:10.789 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
10:26:20.794 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
10:26:30.796 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
10:26:40.798 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
10:26:50.800 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
10:27:00.802 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
10:27:10.804 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
10:27:20.761 INFO [NodeServer$1.lambda$start$1] - Sending registration event...

Operating System


Docker Selenium version (image tag)


Selenium Grid chart version (chart version)


VietND96 commented 9 months ago

Hi @michaelmowry, can you try to kubectl describe selenium-node-config to see what is SE_NODE_GRID_URL is set there?

michaelmowry commented 9 months ago

Thanks for the reply. SE_NODE_GRID_URL = http://selenium-hub.seleniumgridpoc:4444

VietND96 commented 9 months ago

@michaelmowry, can you try to enable FINE logs in Node, what's wrong behind Sending registration event...

    - name: SE_OPTS
      value: "--log-level FINE"

If there is no dependency can you try the latest chart 0.26.3 with this config passing when installing the chart --set global.seleniumGrid.logLevel=FINE, it would simply enable FINE logs for all components

michaelmowry commented 9 months ago


We upgraded to 0.26.3 and still get the same issue with the chrome node not connecting. The only items to note are:

  1. We disable basic auth
  2. We use istio for traffic control see line 39 in values.yaml
  3. IsolateComponents = false and we disable Edge, Firefox, Video, and scaling just to work on connectivity with chrome nodes
  4. We set a hostname on line 75 but disable ingress because we are using istio. I don't think this is an issue because we are able to access the selenium grid web console at
  5. We updated logging to FINEST

The updated values files and logs are attached. We also validated connectivity between the chrome node and the hub via curl and have attached the logs with the failed registration for the chrome node. We still get a timeout on "Sending registration event...". We can queue tests for execution but they also timeout due to no available chrome nodes. We have tried quite a few things but haven't been able to solve this...would appreciate any ideas.

values.yaml.txt values-istio.yaml.txt chrome-node-argocd-logs.txt chrome-node-curl-logs

VietND96 commented 9 months ago

Honestly, I don't have much experience with Istio. Let me look around to see any clue. How about other kinds of service deployment? without Istio, NodePort, or Ingress?

VietND96 commented 9 months ago

@michaelmowry, there is another ticket that also mentioned the same problem when Node registers - There was a comment mentioned that can be resolved by disabling Java Opentelemetry feature on the Selenium process. Can you try to add the below configs under chromeNode

    - name: SE_JAVA_OPTS
      value: "-Dotel.javaagent.enabled=false -Dotel.metrics.exporter=none -Dotel.sdk.disabled=true"
michaelmowry commented 9 months ago

@vietnd96 thank you for your continued support. I tried adding the SE_JAVA_OPTS above and still no change to the connectivity issue. I will also look for a response from @eowoyn in the comment linked above.

amardeep2006 commented 9 months ago

@michaelmowry What role does istio play in your kubernetes cluster ? Can it block the traffic within kubernetes namespace among pods ? I faced a different issue of similar nature due to Calico networking policy. The calico by default is zero trust in my setup. Had to apply the appropriate network policy so that Node and Hub can talk to each other.

michaelmowry commented 9 months ago

Istio is a traffic manager within our cluster. It can block traffic within the namespace, however we have it configured to allow all traffic within the namespace.

Calico is disabled in our namespace.

The chrome node and hub run on seperate pods and have different IPs. From the chrome node log snippet below, it appears that selenium-hub is accessible on 4442 and 4443 as the sockets are created. Can anyone tell us more about how the registration event works? What port does it occur on and what endpoint does it use to register with the hub? It is strange that the 4442/4443 connection works but the registration does not, right?

10:15:27.108 INFO [UnboundZmqEventBus.<init>] - Connecting to tcp://selenium-hub:4442 and tcp://selenium-hub:4443
10:15:27.293 INFO [UnboundZmqEventBus.<init>] - Sockets created
10:15:28.303 INFO [UnboundZmqEventBus.<init>] - Event bus ready
10:15:28.514 INFO [NodeServer.createHandlers] - Reporting self as:
10:15:28.585 INFO [NodeOptions.getSessionFactories] - Detected 1 available processors
10:15:28.710 INFO [] - Adding chrome for {"browserName": "chrome","browserVersion": "120.0","goog:chromeOptions": {"binary": "\u002fusr\u002fbin\u002fgoogle-chrome"},"platformName": "linux","se:noVncPort": 7900,"se:vncEnabled": true} 1 times
10:15:28.796 INFO [Node.<init>] - Binding additional locator mechanisms: relative
10:15:29.214 INFO [NodeServer$1.start] - Starting registration process for Node
10:15:29.216 INFO [NodeServer.execute] - Started Selenium node 4.16.1 (revision 9b4c83354e):
10:15:29.280 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
10:15:39.283 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
10:15:49.289 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
amardeep2006 commented 9 months ago

Not specific to kubernetes but this link may be helpful on ports used in registration Few things I will try in your situation assuming you are using hub mode:

  1. Try enabling DEBUG mode via helm chart if that prints more details around registration.
  2. exec into hub/node containers and check if pods can connect on desired ports via kubernetes services.
  3. Check istio logs /ui . Does istio offer some interactive ui where I can see the traffic ?
Thomas-Personal commented 8 months ago

Hi, I am continuing Micheal's effort from our team. The issue is still not received. I tried with diabling the open telemetry feature as mentioned in the comment - [].But it didnt work out.

Also I am attaching the response from hub and nodes when doing curl from one another. Please let me know if it does ring a bell on an possible cause?

Hub to Node:

hub to node

Node to Hub:

node to hub

VietND96 commented 8 months ago

Node also needs to reach EventBus (port 4442, 4443) inside the Hub, that communication is done via TCP. Can you check if that is enabled?

Thomas-Personal commented 6 months ago

Hi Everyone, I am able to register the nodes by passing the environment variables of Pod names.

I have another question on https:// calls inside nodes.

When i trigger a test using my selenium grid on AKS, by default the webpage under test are routed to http:// instead of HTTPS://

Can you please help me to understand the root cause of this issue.

VietND96 commented 6 months ago

Hi @Thomas-Personal, may I know the details on passing the environment variables of Pod names. Which env vars and it belongs to which component? With Istio (service mesh) if using Service names it won't work?

VietND96 commented 6 months ago

I just tried to understand Istio and service mesh, it looks like one proxy sidecar per pod, so I guess that's the reason Pod names are needed for components communication. Currently, by default in chart, Service names are used only. So I am thinking on how to extend the supports, then we can simplify this kind of deployment.

Thomas-Personal commented 6 months ago

Hi @VietND96 , We have updated the service names in the node env. by default , it was using the POD IP tp register the nodes. when we passed the service names , it got registered

Thomas-Personal commented 6 months ago

@VietND96 , Can you please let me know the release from which the service names are used by default. Passing the service names in the extra env variables causing some issues during autoscaled jobs . I am using 0.26.3.but it seems to have taken the POD IP for registration

VietND96 commented 6 months ago

Hi @Thomas-Personal, you can check the chart version 0.28.0 onwards

Thomas-Personal commented 6 months ago

Thank you @VietND96 . I have issues with autoscaling . When the queue size is 2 , there are two scaled jobs triggered for chrome node. But only one node was successful and one test case picked up and run and the other test case failed. I could see only one node in the UI . the other node also says the node registration is successful.

But I am not sure what was the error. Is it because the both scaled jobs using the same port ? do we need to change any configuration to see both queued test cases picked up successfully ?

Thomas-Personal commented 6 months ago

Hi @VietND96 , In Istio mesh, the POD IP based node registration seems to be causing the problem. So i added the below in the helpers.tpl

Node registration is successful after including this part. But I couldn't get more than one node registered. Could you please help me with this issue.

VietND96 commented 6 months ago

@Thomas-Personal, I have not tried this way yet, let me try to see any clue and get back to you.

Thomas-Personal commented 6 months ago

Thank you so much . Please let me know the results once you tried it. I am trying to implement it with ISTIO mesh for the organization that i work.

Thomas-Personal commented 5 months ago

Hi @VietND96 , I have made the clusterIP: none in the node service which made the service as headless without cluster IP and node started registering without issues.

I have tried with KEDA autoscalar. I am facing two issues , 1) After completion, the sidecar proxy(istio-proxy) is not terminated. beacuse of which the pod continue to exist 2) If test cases timeout before pod spin up, the jobs are not terminating the container

Please help me with the above two issues

kakliniew commented 1 month ago

@VietND96 any updates on this?