SeleniumHQ / docker-selenium

Provides a simple way to run Selenium Grid with Chrome, Firefox, and Edge using Docker, making it easier to perform browser automation
http://www.selenium.dev/docker-selenium/
Other
8.02k stars 2.51k forks source link

[🐛 Bug]: Chrome nodes stuck on Termination state #2168

Closed Aymen-Ben-S closed 3 months ago

Aymen-Ben-S commented 8 months ago

What happened?

We have an consistent behavior where Chrome nodes get stuck on Terminating state.

I'm not sure I can provide the exact steps to reproduce but I'm happy to share logs from a system where this is happening.

Command used to start Selenium Grid with Docker (or Kubernetes)

global:
  seleniumGrid:
    imageRegistry: {{ fvt_image_registry }}/selenium
    imagePullSecret: xxx
hub:
  imageTag: 4.18.1-20240224
chromeNode:
  imageTag: 122.0-20240224
  resources:
    requests:
      cpu: "0.1"
firefoxNode:
  enabled: false
edgeNode:
  enabled: false
autoscaling:
  enabled: true
  scalingType: job
  scaledOptions:
    maxReplicaCount: 999
  scaledJobOptions:
    scalingStrategy:
      strategy: default
ingress:
  hostname: selenium-grid.local
  path: /selenium

Relevant log output

NAME                                              READY   STATUS        RESTARTS      AGE     IP               NODE           NOMINATED NODE   READINESS GATES
keda-operator-d44bc8ffc-f7rzk                     1/1     Running       0             161m    172.30.107.176   10.74.145.3    <none>           <none>
keda-operator-metrics-apiserver-b994566dc-8b59f   1/1     Running       1 (14h ago)   14h     172.30.180.80    10.74.145.8    <none>           <none>
selenium-grid-selenium-chrome-node-25f9x-5b6dz    1/1     Terminating   0             3h24m   172.30.93.77     10.48.76.225   <none>           <none>
selenium-grid-selenium-chrome-node-5cx4x-j9clf    1/1     Terminating   0             174m    172.30.180.158   10.74.145.8    <none>           <none>
selenium-grid-selenium-chrome-node-5hpg8-m2xdz    1/1     Running       0             160m    172.30.202.92    10.48.76.223   <none>           <none>
selenium-grid-selenium-chrome-node-7mt5f-gwgm6    1/1     Running       0             160m    172.30.139.72    10.74.145.20   <none>           <none>
selenium-grid-selenium-chrome-node-cmggr-vvl2l    1/1     Running       0             160m    172.30.180.178   10.74.145.8    <none>           <none>
selenium-grid-selenium-chrome-node-fmq9j-qlvmj    1/1     Terminating   0             174m    172.30.107.131   10.74.145.3    <none>           <none>
selenium-grid-selenium-chrome-node-fxgnj-hb8qf    1/1     Terminating   0             174m    172.30.93.97     10.48.76.225   <none>           <none>
selenium-grid-selenium-chrome-node-gsrsp-h9tzz    1/1     Terminating   0             3h16m   172.30.93.83     10.48.76.225   <none>           <none>
selenium-grid-selenium-chrome-node-xd72s-vws74    1/1     Terminating   0             3h24m   172.30.202.89    10.48.76.223   <none>           <none>
selenium-grid-selenium-chrome-node-xm8h6-69wl8    1/1     Terminating   0             3h24m   172.30.139.32    10.74.145.20   <none>           <none>
selenium-grid-selenium-chrome-node-xqt4h-hkt67    1/1     Terminating   0             3h16m   172.30.139.102   10.74.145.20   <none>           <none>
selenium-grid-selenium-hub-5f49c8fc47-vzmt6       1/1     Running       0             14h     172.30.202.81    10.48.76.223   <none>           <none>

Chrome node log
2024-03-11 18:58:48,902 INFO Included extra file "/etc/supervisor/conf.d/selenium.conf" during parsing
2024-03-11 18:58:48,905 INFO RPC interface 'supervisor' initialized
2024-03-11 18:58:48,905 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-03-11 18:58:48,906 INFO supervisord started with pid 8
2024-03-11 18:58:49,909 INFO spawned: 'xvfb' with pid 9
2024-03-11 18:58:49,912 INFO spawned: 'vnc' with pid 10
2024-03-11 18:58:49,915 INFO spawned: 'novnc' with pid 11
2024-03-11 18:58:49,917 INFO spawned: 'selenium-node' with pid 12
2024-03-11 18:58:49,938 INFO success: selenium-node entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
E: [pulseaudio] main.c: Daemon startup failed.
Home directory not accessible: Permission denied
No PulseAudio daemon running, or not running as session daemon.
Home directory not accessible: Permission denied
No PulseAudio daemon running, or not running as session daemon.
Home directory not accessible: Permission denied
No PulseAudio daemon running, or not running as session daemon.
Appending Selenium options: --session-timeout 300
Appending Selenium options: --register-period 60
Appending Selenium options: --register-cycle 5
Appending Selenium options: --heartbeat-period 30
Appending Selenium options: --log-level INFO
Generating Selenium Config
Setting up SE_NODE_HOST...
Tracing is disabled
Selenium Grid Node configuration: 
[events]
publish = "tcp://selenium-grid-selenium-hub.selenium:4442"
subscribe = "tcp://selenium-grid-selenium-hub.selenium:4443"

[server]
port = "5555"
[node]
grid-url = "http://admin:admin@selenium-grid-selenium-hub.selenium:4444"
session-timeout = "300"
override-max-sessions = false
detect-drivers = false
drain-after-session-count = 1
max-sessions = 1

[[node.driver-configuration]]
display-name = "chrome"
stereotype = '{"browserName": "chrome", "browserVersion": "122.0", "platformName": "Linux", "goog:chromeOptions": {"binary": "/usr/bin/google-chrome"}}'
max-sessions = 1

Starting Selenium Grid Node...
2024-03-11 18:58:51,777 INFO success: xvfb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-11 18:58:51,777 INFO success: vnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-11 18:58:51,777 INFO success: novnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
18:58:51.977 INFO [LoggingOptions.configureLogEncoding] - Using the system default encoding
18:58:51.985 INFO [OpenTelemetryTracer.createTracer] - Using OpenTelemetry for tracing
18:58:52.554 INFO [UnboundZmqEventBus.<init>] - Connecting to tcp://selenium-grid-selenium-hub.selenium:4442 and tcp://selenium-grid-selenium-hub.selenium:4443
18:58:52.767 INFO [UnboundZmqEventBus.<init>] - Sockets created
18:58:53.781 INFO [UnboundZmqEventBus.<init>] - Event bus ready
18:58:54.166 INFO [NodeServer.createHandlers] - Reporting self as: http://172.30.93.77:5555
18:58:54.252 INFO [NodeOptions.getSessionFactories] - Detected 1 available processors
18:58:54.763 INFO [NodeOptions.report] - Adding chrome for {"browserName": "chrome","browserVersion": "122.0","goog:chromeOptions": {"binary": "\u002fusr\u002fbin\u002fgoogle-chrome"},"platformName": "linux","se:noVncPort": 7900,"se:vncEnabled": true} 1 times
2024-03-11T18:58:54UTC [Probe.Startup] - Wait for the Node to report its status
18:58:54.884 INFO [Node.<init>] - Binding additional locator mechanisms: relative
18:58:55.373 INFO [NodeServer$1.start] - Starting registration process for Node http://172.30.93.77:5555
18:58:55.375 INFO [NodeServer.execute] - Started Selenium node 4.18.1 (revision b1d3319b48): http://172.30.93.77:5555
18:58:55.464 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
18:58:55.984 INFO [NodeServer.lambda$createHandlers$2] - Node has been added
18:58:57.964 INFO [LocalNode.checkSessionCount] - Draining Node, configured sessions value (1) has been reached.
18:58:57.972 INFO [LocalNode.newSession] - Session created by the Node. Id: b03b291c5d5108416cf0ac1327aeeda8, Caps: Capabilities {acceptInsecureCerts: true, browserName: chrome-headless-shell, browserVersion: 122.0.6261.69, chrome: {chromedriverVersion: 122.0.6261.69 (81bc525b6a36..., userDataDir: /tmp/.org.chromium.Chromium...}, fedcm:accounts: true, goog:chromeOptions: {debuggerAddress: localhost:43491}, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: linux, proxy: Proxy(), se:bidiEnabled: false, se:cdp: ws://admin:admin@selenium-g..., se:cdpVersion: 122.0.6261.69, se:vnc: ws://admin:admin@selenium-g..., se:vncEnabled: true, se:vncLocalAddress: ws://172.30.93.77:7900, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:extension:minPinLength: true, webauthn:extension:prf: true, webauthn:virtualAuthenticators: true}
2024-03-11T18:58:58UTC [Probe.Startup] - Node responds the ID: 81a24be5-1a59-402b-9407-62e5be087a72 with status: UP
2024-03-11T18:58:58UTC [Probe.Startup] - Grid responds a matched Node ID: 81a24be5-1a59-402b-9407-62e5be087a72
2024-03-11T18:58:58UTC [Probe.Startup] - Node ID: 81a24be5-1a59-402b-9407-62e5be087a72 is found in the Grid. Node is ready.
19:03:34.310 INFO [SessionSlot.stop] - Stopping session b03b291c5d5108416cf0ac1327aeeda8
19:03:34.350 INFO [LocalNode.stopTimedOutSession] - Node draining complete!
19:03:35.357 INFO [NodeServer.lambda$createHandlers$3] - Shutting down
2024-03-11 19:03:35,722 INFO exited: selenium-node (exit status 0; expected)
2024-03-11 19:03:35,722 WARN received SIGINT indicating exit request
2024-03-11 19:03:35,723 INFO waiting for xvfb, vnc, novnc to die
2024-03-11 19:03:37,727 INFO stopped: novnc (terminated by SIGTERM)
2024-03-11 19:03:38,730 INFO stopped: vnc (terminated by SIGTERM)
2024-03-11 19:03:38,731 INFO waiting for xvfb to die
2024-03-11 19:03:39,732 INFO stopped: xvfb (terminated by SIGTERM)

Chrome node yml:
Name:                      selenium-grid-selenium-chrome-node-25f9x-5b6dz
Namespace:                 selenium
Priority:                  0
Service Account:           selenium-grid-selenium-serviceaccount
Node:                      10.48.76.225/10.48.76.225
Start Time:                Mon, 11 Mar 2024 18:58:22 +0000
Labels:                    app=selenium-grid-selenium-chrome-node
                           app.kubernetes.io/component=selenium-grid-4.18.1-20240224
                           app.kubernetes.io/instance=selenium-grid
                           app.kubernetes.io/managed-by=helm
                           app.kubernetes.io/name=selenium-grid-selenium-chrome-node
                           app.kubernetes.io/version=4.18.1-20240224
                           controller-uid=449ef28c-fc3f-4da4-8b3f-31469fb86d9d
                           helm.sh/chart=selenium-grid-0.28.4
                           job-name=selenium-grid-selenium-chrome-node-25f9x
                           scaledjob.keda.sh/name=selenium-grid-selenium-chrome-node
Annotations:               checksum/event-bus-configmap: 2698802d0bbf358d1634b47dff1ef36c5fc2501a27a9d2eef02c7874eb9496f8
                           checksum/logging-configmap: 7f721b250f90c8a5877dc9217b97f0a14392b420edf0e5af105a60944d2b9dc3
                           checksum/node-configmap: 3b6c0fffa6e6a10d57e5455ce21e1e7ee55e0638f15ff521b8c96fe8c10d8e91
                           checksum/server-configmap: ac6520a86bfffa04b4946bbce02ac8f1be341d800f4ab09f4e7cf274f74d3770
                           cni.projectcalico.org/containerID: a880de0585776e16db52c8eb9c290406e1b69dfd510cd15fc0c421cd6a9ed1dc
                           cni.projectcalico.org/podIP: 
                           cni.projectcalico.org/podIPs: 
                           k8s.v1.cni.cncf.io/network-status:
                             [{
                                 "name": "k8s-pod-network",
                                 "ips": [
                                     "172.30.93.77"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
                           k8s.v1.cni.cncf.io/networks-status:
                             [{
                                 "name": "k8s-pod-network",
                                 "ips": [
                                     "172.30.93.77"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
                           openshift.io/scc: restricted-v2
                           seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:                    Terminating (lasts 3h15m)
Termination Grace Period:  30s
SeccompProfile:            RuntimeDefault
IP:                        172.30.93.77
IPs:
  IP:           172.30.93.77
Controlled By:  Job/selenium-grid-selenium-chrome-node-25f9x
Containers:
  selenium-grid-selenium-chrome-node:
    Container ID:   cri-o://f177006210af63a402465588e62b42ca96387427102f59824d4a6b29b197ab21
    Image:          docker-na-private.artifactory.swg-devops.com/wiotp-docker-local/selenium/node-chrome:122.0-20240224
    Image ID:       docker-na-private.artifactory.swg-devops.com/wiotp-docker-local/selenium/node-chrome@sha256:3b50643ff9885215c9142fefecace7b17efdde64a8c863ef702bd4a6c3e6a378
    Port:           5555/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Mon, 11 Mar 2024 18:58:48 +0000
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:     100m
      memory:  1Gi
    Startup:   exec [bash -c /opt/selenium/nodeProbe.sh Startup >> /proc/1/fd/1] delay=0s timeout=60s period=5s #success=1 #failure=12
    Environment Variables from:
      selenium-grid-selenium-event-bus       ConfigMap  Optional: false
      selenium-grid-selenium-node-config     ConfigMap  Optional: false
      selenium-grid-selenium-logging-config  ConfigMap  Optional: false
      selenium-grid-selenium-server-config   ConfigMap  Optional: false
      selenium-grid-selenium-secrets         Secret     Optional: false
    Environment:
      SE_OTEL_SERVICE_NAME:     selenium-grid-selenium-chrome-node
      SE_NODE_PORT:             5555
      SE_NODE_REGISTER_PERIOD:  60
      SE_NODE_REGISTER_CYCLE:   5
    Mounts:
      /dev/shm from dshm (rw)
      /opt/selenium/nodePreStop.sh from selenium-grid-selenium-node-config (rw,path="nodePreStop.sh")
      /opt/selenium/nodeProbe.sh from selenium-grid-selenium-node-config (rw,path="nodeProbe.sh")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-n6b64 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  selenium-grid-selenium-node-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      selenium-grid-selenium-node-config
    Optional:  false
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  1Gi
  kube-api-access-n6b64:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age   From     Message
  ----     ------             ----  ----     -------
  Warning  FailedPreStopHook  165m  kubelet  Exec lifecycle hook ([bash -c /opt/selenium/nodePreStop.sh >> /proc/1/fd/1]) for Container "selenium-grid-selenium-chrome-node" in Pod "selenium-grid-selenium-chrome-node-25f9x-5b6dz_selenium(8ff39bf5-08ec-4467-b609-0db98b02ff8c)" failed - error: rpc error: code = Unknown desc = command error: time="2024-03-11T14:37:35-05:00" level=fatal msg="nsexec-1[153181]: failed to open /proc/187153/ns/ipc: No such file or directory"
time="2024-03-11T14:37:35-05:00" level=fatal msg="nsexec-0[153169]: failed to sync with stage-1: next state: Invalid argument"
time="2024-03-11T14:37:35-05:00" level=error msg="exec failed: unable to start container process: error executing setns process: exit status 1"
, stdout: , stderr: , exit code -1, message: ""
  Warning  FailedPreStopHook  135m  kubelet  Exec lifecycle hook ([bash -c /opt/selenium/nodePreStop.sh >> /proc/1/fd/1]) for Container "selenium-grid-selenium-chrome-node" in Pod "selenium-grid-selenium-chrome-node-25f9x-5b6dz_selenium(8ff39bf5-08ec-4467-b609-0db98b02ff8c)" failed - error: rpc error: code = Unknown desc = command error: time="2024-03-11T15:08:06-05:00" level=fatal msg="nsexec-1[358906]: failed to open /proc/187153/ns/ipc: No such file or directory"
time="2024-03-11T15:08:06-05:00" level=fatal msg="nsexec-0[358893]: failed to sync with stage-1: next state: Invalid argument"
time="2024-03-11T15:08:06-05:00" level=error msg="exec failed: unable to start container process: error executing setns process: exit status 1"
, stdout: , stderr: , exit code -1, message: ""
  Warning  FailedPreStopHook  104m  kubelet  Exec lifecycle hook ([bash -c /opt/selenium/nodePreStop.sh >> /proc/1/fd/1]) for Container "selenium-grid-selenium-chrome-node" in Pod "selenium-grid-selenium-chrome-node-25f9x-5b6dz_selenium(8ff39bf5-08ec-4467-b609-0db98b02ff8c)" failed - error: rpc error: code = Unknown desc = command error: time="2024-03-11T15:38:36-05:00" level=fatal msg="nsexec-1[149417]: failed to open /proc/187153/ns/ipc: No such file or directory"
time="2024-03-11T15:38:36-05:00" level=fatal msg="nsexec-0[149388]: failed to sync with stage-1: next state: Invalid argument"
time="2024-03-11T15:38:36-05:00" level=error msg="exec failed: unable to start container process: error executing setns process: exit status 1"
, stdout: , stderr: , exit code -1, message: ""
  Warning  FailedPreStopHook  74m  kubelet  Exec lifecycle hook ([bash -c /opt/selenium/nodePreStop.sh >> /proc/1/fd/1]) for Container "selenium-grid-selenium-chrome-node" in Pod "selenium-grid-selenium-chrome-node-25f9x-5b6dz_selenium(8ff39bf5-08ec-4467-b609-0db98b02ff8c)" failed - error: rpc error: code = Unknown desc = command error: time="2024-03-11T16:09:06-05:00" level=fatal msg="nsexec-1[454447]: failed to open /proc/187153/ns/ipc: No such file or directory"
time="2024-03-11T16:09:06-05:00" level=fatal msg="nsexec-0[454427]: failed to sync with stage-1: next state: Invalid argument"
time="2024-03-11T16:09:06-05:00" level=error msg="exec failed: unable to start container process: error executing setns process: exit status 1"
, stdout: , stderr: , exit code -1, message: ""
  Warning  FailedPreStopHook  43m  kubelet  Exec lifecycle hook ([bash -c /opt/selenium/nodePreStop.sh >> /proc/1/fd/1]) for Container "selenium-grid-selenium-chrome-node" in Pod "selenium-grid-selenium-chrome-node-25f9x-5b6dz_selenium(8ff39bf5-08ec-4467-b609-0db98b02ff8c)" failed - error: rpc error: code = Unknown desc = command error: time="2024-03-11T16:39:37-05:00" level=fatal msg="nsexec-1[259528]: failed to open /proc/187153/ns/ipc: No such file or directory"
time="2024-03-11T16:39:37-05:00" level=fatal msg="nsexec-0[259505]: failed to sync with stage-1: next state: Invalid argument"
time="2024-03-11T16:39:37-05:00" level=error msg="exec failed: unable to start container process: error executing setns process: exit status 1"
, stdout: , stderr: , exit code -1, message: ""
  Normal   Killing            13m (x7 over 3h16m)  kubelet  Stopping container selenium-grid-selenium-chrome-node
  Warning  FailedKillPod      13m (x6 over 165m)   kubelet  error killing pod: [failed to "KillContainer" for "selenium-grid-selenium-chrome-node" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "8ff39bf5-08ec-4467-b609-0db98b02ff8c" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
  Warning  FailedPreStopHook  13m                  kubelet  Exec lifecycle hook ([bash -c /opt/selenium/nodePreStop.sh >> /proc/1/fd/1]) for Container "selenium-grid-selenium-chrome-node" in Pod "selenium-grid-selenium-chrome-node-25f9x-5b6dz_selenium(8ff39bf5-08ec-4467-b609-0db98b02ff8c)" failed - error: rpc error: code = Unknown desc = command error: time="2024-03-11T17:10:07-05:00" level=fatal msg="nsexec-1[51553]: failed to open /proc/187153/ns/ipc: No such file or directory"
time="2024-03-11T17:10:07-05:00" level=fatal msg="nsexec-0[51532]: failed to sync with stage-1: next state: Invalid argument"
time="2024-03-11T17:10:07-05:00" level=error msg="exec failed: unable to start container process: error executing setns process: exit status 1"
, stdout: , stderr: , exit code -1, message: ""

Operating System

Openshift 4.12.49

Docker Selenium version (image tag)

4.18.1

Selenium Grid chart version (chart version)

0.28.4

github-actions[bot] commented 8 months ago

@Aymen-Ben-S, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

alequint commented 8 months ago

Just to add some additional context about the problem above... it is the same mentioned here: https://github.com/SeleniumHQ/docker-selenium/issues/2129#issuecomment-1970446915

At that time @VietND96 though preStop was stuck and suggested to get some logs from there. Are these logs the same we have in chrome node (obtained via kubectl logs) or is there any other log persisted inside the container and not exposed? We are using 4.18 images as mentioned above, chrome node has no errors in the log, but the pod does not terminate even after their processes receive the SIGTERM (they keep as Running and/or Terminating - if we restart hub/keda).

We do not see this behavior consistently, but once the first node hangs, all the others will hang. It seems cluster/grid enters in a state where selenium processes cannot be killed and nodes live forever.

VietND96 commented 8 months ago

Your K8s env is using runtime contained, right? I am asking this since when looking into message failed to open /proc/187153/ns/ipc: No such file or directory and found something could be similar mentioned in https://github.com/containerd/containerd/issues/9160

alequint commented 8 months ago

@VietND96 our runtime is really running in containers in the same cluster, which are connected with hub via remote web driver (java client). We are not using containerd though. I did not notice the failed message about /proc/187153/ns/ipc: No such file or directory, it is something @Aymen-Ben-S was investigating yesterday. So if it happens randomly it seems this file/directory is found sometimes and sometimes they are not... I am not sure what /proc/187153/ns/ipc directory really means, or how it is used or required for our nodes to stop. I will investigate more in that line.

Thanks for the link to the other issue... overall the way we reach to the problem is different, though the absence of that file/dir is quite the same.. will keep an eye on that as well.

VietND96 commented 8 months ago

Can you also try to set global.seleniumGrid.stdoutProbeLog=false I idea behind this config is to let stdout to the pid 1 in the container that you can see /opt/selenium/nodeProbe.sh preStop >> /proc/1/fd/1 Seems /proc/1/fd/1 is a trick and it won't work in all platforms, e.g in this case is OpenShift.

alequint commented 8 months ago

Just to add two more examples... I have in my live env here one chrome node in running state and one chrome node in terminating state... both are stuck (usually the "running" stuck pods turn to "terminating" when we restart hub/keda pods... I do not see events similar to the one posted above by @Aymen-Ben-S - but I see preStop failed anyway. Follow pod describe:

Stuck Running Node

oc describe pod selenium-grid-selenium-chrome-node-vcr64-kvjtq
Name:         selenium-grid-selenium-chrome-node-vcr64-kvjtq
Namespace:    selenium
Priority:     0
Node:         10.177.41.210/10.177.41.210
Start Time:   Wed, 13 Mar 2024 17:25:14 -0300
Labels:       app=selenium-grid-selenium-chrome-node
              app.kubernetes.io/component=selenium-grid-4.18.1-20240224
              app.kubernetes.io/instance=selenium-grid
              app.kubernetes.io/managed-by=helm
              app.kubernetes.io/name=selenium-grid-selenium-chrome-node
              app.kubernetes.io/version=4.18.1-20240224
              controller-uid=6c5573d2-3e1e-4887-9dc4-5b71e2a6bd19
              helm.sh/chart=selenium-grid-0.28.4
              job-name=selenium-grid-selenium-chrome-node-vcr64
              scaledjob.keda.sh/name=selenium-grid-selenium-chrome-node
Annotations:  checksum/event-bus-configmap: 2698802d0bbf358d1634b47dff1ef36c5fc2501a27a9d2eef02c7874eb9496f8
              checksum/logging-configmap: 7f721b250f90c8a5877dc9217b97f0a14392b420edf0e5af105a60944d2b9dc3
              checksum/node-configmap: 3b6c0fffa6e6a10d57e5455ce21e1e7ee55e0638f15ff521b8c96fe8c10d8e91
              checksum/server-configmap: ac6520a86bfffa04b4946bbce02ac8f1be341d800f4ab09f4e7cf274f74d3770
              cni.projectcalico.org/containerID: a3c4689d38376d4277486296c7c87aa32f59794f9f401bbaffe0cf60b1cf3ea1
              cni.projectcalico.org/podIP: 172.30.201.59/32
              cni.projectcalico.org/podIPs: 172.30.201.59/32
              k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "k8s-pod-network",
                    "ips": [
                        "172.30.201.59"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: restricted-v2
              seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:       Running
IP:           172.30.201.59
IPs:
  IP:           172.30.201.59
Controlled By:  Job/selenium-grid-selenium-chrome-node-vcr64
Containers:
  selenium-grid-selenium-chrome-node:
    Container ID:   cri-o://e06640a839fa0ec541ca954e5639f23d68802128187557a5ee83e8634aa796b9
    Image:          docker-na-private.artifactory.swg-devops.com/wiotp-docker-local/selenium/node-chrome:122.0-20240224
    Image ID:       docker-na-private.artifactory.swg-devops.com/wiotp-docker-local/selenium/node-chrome@sha256:3b50643ff9885215c9142fefecace7b17efdde64a8c863ef702bd4a6c3e6a378
    Port:           5555/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Wed, 13 Mar 2024 17:25:16 -0300
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:     100m
      memory:  1Gi
    Startup:   exec [bash -c /opt/selenium/nodeProbe.sh Startup >> /proc/1/fd/1] delay=0s timeout=60s period=5s #success=1 #failure=12
    Environment Variables from:
      selenium-grid-selenium-event-bus       ConfigMap  Optional: false
      selenium-grid-selenium-node-config     ConfigMap  Optional: false
      selenium-grid-selenium-logging-config  ConfigMap  Optional: false
      selenium-grid-selenium-server-config   ConfigMap  Optional: false
      selenium-grid-selenium-secrets         Secret     Optional: false
    Environment:
      SE_OTEL_SERVICE_NAME:     selenium-grid-selenium-chrome-node
      SE_NODE_PORT:             5555
      SE_NODE_REGISTER_PERIOD:  60
      SE_NODE_REGISTER_CYCLE:   5
    Mounts:
      /dev/shm from dshm (rw)
      /opt/selenium/nodePreStop.sh from selenium-grid-selenium-node-config (rw,path="nodePreStop.sh")
      /opt/selenium/nodeProbe.sh from selenium-grid-selenium-node-config (rw,path="nodeProbe.sh")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q48g8 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  selenium-grid-selenium-node-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      selenium-grid-selenium-node-config
    Optional:  false
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  1Gi
  kube-api-access-q48g8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>

What is interesting with this running pod is that it has no Event associated, but if we look at their logs, selenium processes were supposed to have terminated as they got SIGTERM:

2024-03-13T20:25:25UTC [Probe.Startup] - Node responds the ID: e3578a60-9ae9-4c10-8266-06d4a2dd4859 with status: DRAINING
2024-03-13T20:25:25UTC [Probe.Startup] - Grid responds a matched Node ID: e3578a60-9ae9-4c10-8266-06d4a2dd4859
2024-03-13T20:25:25UTC [Probe.Startup] - Node ID: e3578a60-9ae9-4c10-8266-06d4a2dd4859 is found in the Grid. Node is ready.
20:29:45.411 INFO [SessionSlot.stop] - Stopping session 600b097a806e938d7f3d27b44b3f96ca
20:29:45.413 INFO [LocalNode.stopTimedOutSession] - Node draining complete!
20:29:46.416 INFO [NodeServer.lambda$createHandlers$3] - Shutting down
2024-03-13 20:29:46,780 INFO exited: selenium-node (exit status 0; expected)
2024-03-13 20:29:46,781 WARN received SIGINT indicating exit request
2024-03-13 20:29:46,781 INFO waiting for xvfb, vnc, novnc to die
2024-03-13 20:29:48,783 INFO stopped: novnc (terminated by SIGTERM)
2024-03-13 20:29:49,814 INFO stopped: vnc (terminated by SIGTERM)
2024-03-13 20:29:49,814 INFO waiting for xvfb to die
2024-03-13 20:29:50,816 INFO stopped: xvfb (terminated by SIGTERM)

Stuck Terminating Node

oc describe pod selenium-grid-selenium-chrome-node-vcr64-kvjtq
Name:         selenium-grid-selenium-chrome-node-vcr64-kvjtq
Namespace:    selenium
Priority:     0
Node:         10.177.41.210/10.177.41.210
Start Time:   Wed, 13 Mar 2024 17:25:14 -0300
Labels:       app=selenium-grid-selenium-chrome-node
              app.kubernetes.io/component=selenium-grid-4.18.1-20240224
              app.kubernetes.io/instance=selenium-grid
              app.kubernetes.io/managed-by=helm
              app.kubernetes.io/name=selenium-grid-selenium-chrome-node
              app.kubernetes.io/version=4.18.1-20240224
              controller-uid=6c5573d2-3e1e-4887-9dc4-5b71e2a6bd19
              helm.sh/chart=selenium-grid-0.28.4
              job-name=selenium-grid-selenium-chrome-node-vcr64
              scaledjob.keda.sh/name=selenium-grid-selenium-chrome-node
Annotations:  checksum/event-bus-configmap: 2698802d0bbf358d1634b47dff1ef36c5fc2501a27a9d2eef02c7874eb9496f8
              checksum/logging-configmap: 7f721b250f90c8a5877dc9217b97f0a14392b420edf0e5af105a60944d2b9dc3
              checksum/node-configmap: 3b6c0fffa6e6a10d57e5455ce21e1e7ee55e0638f15ff521b8c96fe8c10d8e91
              checksum/server-configmap: ac6520a86bfffa04b4946bbce02ac8f1be341d800f4ab09f4e7cf274f74d3770
              cni.projectcalico.org/containerID: a3c4689d38376d4277486296c7c87aa32f59794f9f401bbaffe0cf60b1cf3ea1
              cni.projectcalico.org/podIP: 172.30.201.59/32
              cni.projectcalico.org/podIPs: 172.30.201.59/32
              k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "k8s-pod-network",
                    "ips": [
                        "172.30.201.59"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: restricted-v2
              seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:       Running
IP:           172.30.201.59
IPs:
  IP:           172.30.201.59
Controlled By:  Job/selenium-grid-selenium-chrome-node-vcr64
Containers:
  selenium-grid-selenium-chrome-node:
    Container ID:   cri-o://e06640a839fa0ec541ca954e5639f23d68802128187557a5ee83e8634aa796b9
    Image:          docker-na-private.artifactory.swg-devops.com/wiotp-docker-local/selenium/node-chrome:122.0-20240224
    Image ID:       docker-na-private.artifactory.swg-devops.com/wiotp-docker-local/selenium/node-chrome@sha256:3b50643ff9885215c9142fefecace7b17efdde64a8c863ef702bd4a6c3e6a378
    Port:           5555/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Wed, 13 Mar 2024 17:25:16 -0300
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:     100m
      memory:  1Gi
    Startup:   exec [bash -c /opt/selenium/nodeProbe.sh Startup >> /proc/1/fd/1] delay=0s timeout=60s period=5s #success=1 #failure=12
    Environment Variables from:
      selenium-grid-selenium-event-bus       ConfigMap  Optional: false
      selenium-grid-selenium-node-config     ConfigMap  Optional: false
      selenium-grid-selenium-logging-config  ConfigMap  Optional: false
      selenium-grid-selenium-server-config   ConfigMap  Optional: false
      selenium-grid-selenium-secrets         Secret     Optional: false
    Environment:
      SE_OTEL_SERVICE_NAME:     selenium-grid-selenium-chrome-node
      SE_NODE_PORT:             5555
      SE_NODE_REGISTER_PERIOD:  60
      SE_NODE_REGISTER_CYCLE:   5
    Mounts:
      /dev/shm from dshm (rw)
      /opt/selenium/nodePreStop.sh from selenium-grid-selenium-node-config (rw,path="nodePreStop.sh")
      /opt/selenium/nodeProbe.sh from selenium-grid-selenium-node-config (rw,path="nodeProbe.sh")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-q48g8 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  selenium-grid-selenium-node-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      selenium-grid-selenium-node-config
    Optional:  false
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  1Gi
  kube-api-access-q48g8:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:                      <none>
(pytestenv) alequint@Alexandres-MBP fvt-ibm-mas-manage % oc describe pod selenium-grid-selenium-chrome-node-wqv2b-pbk8g
Name:                      selenium-grid-selenium-chrome-node-wqv2b-pbk8g
Namespace:                 selenium
Priority:                  0
Node:                      10.177.41.210/10.177.41.210
Start Time:                Wed, 13 Mar 2024 16:10:44 -0300
Labels:                    app=selenium-grid-selenium-chrome-node
                           app.kubernetes.io/component=selenium-grid-4.18.1-20240224
                           app.kubernetes.io/instance=selenium-grid
                           app.kubernetes.io/managed-by=helm
                           app.kubernetes.io/name=selenium-grid-selenium-chrome-node
                           app.kubernetes.io/version=4.18.1-20240224
                           controller-uid=bf32cbc7-b431-453c-bfac-5b29d45fc739
                           helm.sh/chart=selenium-grid-0.28.4
                           job-name=selenium-grid-selenium-chrome-node-wqv2b
                           scaledjob.keda.sh/name=selenium-grid-selenium-chrome-node
Annotations:               checksum/event-bus-configmap: 2698802d0bbf358d1634b47dff1ef36c5fc2501a27a9d2eef02c7874eb9496f8
                           checksum/logging-configmap: 7f721b250f90c8a5877dc9217b97f0a14392b420edf0e5af105a60944d2b9dc3
                           checksum/node-configmap: 3b6c0fffa6e6a10d57e5455ce21e1e7ee55e0638f15ff521b8c96fe8c10d8e91
                           checksum/server-configmap: ac6520a86bfffa04b4946bbce02ac8f1be341d800f4ab09f4e7cf274f74d3770
                           cni.projectcalico.org/containerID: a4f6e5164e05ac11a7074e67b29d85a0c9f6a75e78cd1231c3673a965d6eb5f2
                           cni.projectcalico.org/podIP: 
                           cni.projectcalico.org/podIPs: 
                           k8s.v1.cni.cncf.io/network-status:
                             [{
                                 "name": "k8s-pod-network",
                                 "ips": [
                                     "172.30.200.215"
                                 ],
                                 "default": true,
                                 "dns": {}
                             }]
                           openshift.io/scc: restricted-v2
                           seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:                    Terminating (lasts 13h)
Termination Grace Period:  30s
IP:                        172.30.200.215
IPs:
  IP:           172.30.200.215
Controlled By:  Job/selenium-grid-selenium-chrome-node-wqv2b
Containers:
  selenium-grid-selenium-chrome-node:
    Container ID:   cri-o://e4c63c9a396b286577e6168d241485097be77263c6b782a59d6157ec52ebeb32
    Image:          docker-na-private.artifactory.swg-devops.com/wiotp-docker-local/selenium/node-chrome:122.0-20240224
    Image ID:       docker-na-private.artifactory.swg-devops.com/wiotp-docker-local/selenium/node-chrome@sha256:3b50643ff9885215c9142fefecace7b17efdde64a8c863ef702bd4a6c3e6a378
    Port:           5555/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Wed, 13 Mar 2024 16:11:06 -0300
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:     100m
      memory:  1Gi
    Startup:   exec [bash -c /opt/selenium/nodeProbe.sh Startup >> /proc/1/fd/1] delay=0s timeout=60s period=5s #success=1 #failure=12
    Environment Variables from:
      selenium-grid-selenium-event-bus       ConfigMap  Optional: false
      selenium-grid-selenium-node-config     ConfigMap  Optional: false
      selenium-grid-selenium-logging-config  ConfigMap  Optional: false
      selenium-grid-selenium-server-config   ConfigMap  Optional: false
      selenium-grid-selenium-secrets         Secret     Optional: false
    Environment:
      SE_OTEL_SERVICE_NAME:     selenium-grid-selenium-chrome-node
      SE_NODE_PORT:             5555
      SE_NODE_REGISTER_PERIOD:  60
      SE_NODE_REGISTER_CYCLE:   5
    Mounts:
      /dev/shm from dshm (rw)
      /opt/selenium/nodePreStop.sh from selenium-grid-selenium-node-config (rw,path="nodePreStop.sh")
      /opt/selenium/nodeProbe.sh from selenium-grid-selenium-node-config (rw,path="nodeProbe.sh")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-7dv5t (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  selenium-grid-selenium-node-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      selenium-grid-selenium-node-config
    Optional:  false
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  1Gi
  kube-api-access-7dv5t:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason             Age                 From     Message
  ----     ------             ----                ----     -------
  Normal   Killing            20m (x27 over 13h)  kubelet  Stopping container selenium-grid-selenium-chrome-node
  Warning  FailedPreStopHook  20m (x27 over 13h)  kubelet  PreStopHook failed
  Warning  FailedKillPod      20m (x26 over 13h)  kubelet  error killing pod: [failed to "KillContainer" for "selenium-grid-selenium-chrome-node" with KillContainerError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded", failed to "KillPodSandbox" for "c4bffdc1-fc0b-46b2-9819-be49a0a47725" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"]
(pytestenv) alequint@Alexandres-MBP fvt-ibm-mas-manage % oc describe pod selenium-grid-selenium-chrome-node-wqv2b-pbk8g | grep stop
(pytestenv) alequint@Alexandres-MBP fvt-ibm-mas-manage % oc describe pod selenium-grid-selenium-chrome-node-wqv2b-pbk8g | grep Stop
      /opt/selenium/nodePreStop.sh from selenium-grid-selenium-node-config (rw,path="nodePreStop.sh")
  Normal   Killing            22m (x27 over 13h)  kubelet  Stopping container selenium-grid-selenium-chrome-node
  Warning  FailedPreStopHook  22m (x27 over 13h)  kubelet  PreStopHook failed
alequint commented 8 months ago

@VietND96 just to give some feedback: we did not notice any node hanging so far, after:

We will keep an eye on our next test runs anyway, but so far we had very good results compared with previous runs.

alequint commented 8 months ago

@VietND96 @Aymen-Ben-S noticed hanging in one of our environments. Follow some details:

List of Pods in selenium namespace sorted by Created Date:

image

Only the top chrome node is really runnning, the one below, even though it is running, it hung. Follow describe from this pod (noticed a startup probe warning, but that is all):

oc describe pod selenium-grid-selenium-chrome-node-qx8s5-4g2bg -n selenium

Name:         selenium-grid-selenium-chrome-node-qx8s5-4g2bg
Namespace:    selenium
Priority:     0
Node:         10.113.225.224/10.113.225.224
Start Time:   Fri, 29 Mar 2024 14:28:54 -0300
Labels:       app=selenium-grid-selenium-chrome-node
              app.kubernetes.io/component=selenium-grid-4.18.1-20240224
              app.kubernetes.io/instance=selenium-grid
              app.kubernetes.io/managed-by=helm
              app.kubernetes.io/name=selenium-grid-selenium-chrome-node
              app.kubernetes.io/version=4.18.1-20240224
              controller-uid=4b3c4ecc-c78a-446b-8128-6e67bb9957cc
              helm.sh/chart=selenium-grid-0.28.4
              job-name=selenium-grid-selenium-chrome-node-qx8s5
              scaledjob.keda.sh/name=selenium-grid-selenium-chrome-node
Annotations:  checksum/event-bus-configmap: 2698802d0bbf358d1634b47dff1ef36c5fc2501a27a9d2eef02c7874eb9496f8
              checksum/logging-configmap: 7f721b250f90c8a5877dc9217b97f0a14392b420edf0e5af105a60944d2b9dc3
              checksum/node-configmap: 3b6c0fffa6e6a10d57e5455ce21e1e7ee55e0638f15ff521b8c96fe8c10d8e91
              checksum/server-configmap: ac6520a86bfffa04b4946bbce02ac8f1be341d800f4ab09f4e7cf274f74d3770
              cni.projectcalico.org/containerID: 4004c884b70042cf36f09bf1be0b7b7b4ee1014df73f9d075c9bb4f4a6ad4d1b
              cni.projectcalico.org/podIP: 172.30.103.212/32
              cni.projectcalico.org/podIPs: 172.30.103.212/32
              k8s.v1.cni.cncf.io/network-status:
                [{
                    "name": "k8s-pod-network",
                    "ips": [
                        "172.30.103.212"
                    ],
                    "default": true,
                    "dns": {}
                }]
              k8s.v1.cni.cncf.io/networks-status:
                [{
                    "name": "k8s-pod-network",
                    "ips": [
                        "172.30.103.212"
                    ],
                    "default": true,
                    "dns": {}
                }]
              openshift.io/scc: restricted-v2
              seccomp.security.alpha.kubernetes.io/pod: runtime/default
Status:       Running
IP:           172.30.103.212
IPs:
  IP:           172.30.103.212
Controlled By:  Job/selenium-grid-selenium-chrome-node-qx8s5
Containers:
  selenium-grid-selenium-chrome-node:
    Container ID:   cri-o://9b1c71402416095c52dadada09becb2fef04d27172052fee60b07421f810a197
    Image:          docker-na-private.artifactory.swg-devops.com/wiotp-docker-local/selenium/node-chrome:122.0-20240224
    Image ID:       docker-na-private.artifactory.swg-devops.com/wiotp-docker-local/selenium/node-chrome@sha256:3b50643ff9885215c9142fefecace7b17efdde64a8c863ef702bd4a6c3e6a378
    Port:           5555/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Fri, 29 Mar 2024 14:28:56 -0300
    Ready:          True
    Restart Count:  0
    Limits:
      cpu:     1
      memory:  1Gi
    Requests:
      cpu:     100m
      memory:  1Gi
    Startup:   exec [bash -c /opt/selenium/nodeProbe.sh Startup ] delay=0s timeout=60s period=5s #success=1 #failure=12
    Environment Variables from:
      selenium-grid-selenium-event-bus       ConfigMap  Optional: false
      selenium-grid-selenium-node-config     ConfigMap  Optional: false
      selenium-grid-selenium-logging-config  ConfigMap  Optional: false
      selenium-grid-selenium-server-config   ConfigMap  Optional: false
      selenium-grid-selenium-secrets         Secret     Optional: false
    Environment:
      SE_OTEL_SERVICE_NAME:     selenium-grid-selenium-chrome-node
      SE_NODE_PORT:             5555
      SE_NODE_REGISTER_PERIOD:  60
      SE_NODE_REGISTER_CYCLE:   5
    Mounts:
      /dev/shm from dshm (rw)
      /opt/selenium/nodePreStop.sh from selenium-grid-selenium-node-config (rw,path="nodePreStop.sh")
      /opt/selenium/nodeProbe.sh from selenium-grid-selenium-node-config (rw,path="nodeProbe.sh")
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-h99rz (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             True 
  ContainersReady   True 
  PodScheduled      True 
Volumes:
  selenium-grid-selenium-node-config:
    Type:      ConfigMap (a volume populated by a ConfigMap)
    Name:      selenium-grid-selenium-node-config
    Optional:  false
  dshm:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     Memory
    SizeLimit:  1Gi
  kube-api-access-h99rz:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
    ConfigMapName:           openshift-service-ca.crt
    ConfigMapOptional:       <nil>
QoS Class:                   Burstable
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/memory-pressure:NoSchedule op=Exists
                             node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason          Age   From               Message
  ----     ------          ----  ----               -------
  Normal   Scheduled       24m   default-scheduler  Successfully assigned selenium/selenium-grid-selenium-chrome-node-qx8s5-4g2bg to 10.113.225.224
  Normal   AddedInterface  24m   multus             Add eth0 [172.30.103.212/32] from k8s-pod-network
  Normal   Pulled          24m   kubelet            Container image "docker-na-private.artifactory.swg-devops.com/wiotp-docker-local/selenium/node-chrome:122.0-20240224" already present on machine
  Normal   Created         24m   kubelet            Created container selenium-grid-selenium-chrome-node
  Normal   Started         24m   kubelet            Started container selenium-grid-selenium-chrome-node
  Warning  Unhealthy       24m   kubelet            Startup probe failed: 2024-03-29T17:29:00UTC [Probe.Startup] - Wait for the Node to report its status

Follow logs from that same pod:

oc logs -f selenium-grid-selenium-chrome-node-qx8s5-4g2bg -n selenium 

2024-03-29 17:28:56,709 INFO Included extra file "/etc/supervisor/conf.d/selenium.conf" during parsing
2024-03-29 17:28:56,712 INFO RPC interface 'supervisor' initialized
2024-03-29 17:28:56,712 CRIT Server 'unix_http_server' running without any HTTP authentication checking
2024-03-29 17:28:56,713 INFO supervisord started with pid 8
2024-03-29 17:28:57,716 INFO spawned: 'xvfb' with pid 9
2024-03-29 17:28:57,719 INFO spawned: 'vnc' with pid 10
2024-03-29 17:28:57,721 INFO spawned: 'novnc' with pid 11
2024-03-29 17:28:57,725 INFO spawned: 'selenium-node' with pid 12
2024-03-29 17:28:57,742 INFO success: selenium-node entered RUNNING state, process has stayed up for > than 0 seconds (startsecs)
E: [pulseaudio] main.c: Daemon startup failed.
Home directory not accessible: Permission denied
No PulseAudio daemon running, or not running as session daemon.
Home directory not accessible: Permission denied
No PulseAudio daemon running, or not running as session daemon.
Home directory not accessible: Permission denied
No PulseAudio daemon running, or not running as session daemon.
Appending Selenium options: --session-timeout 300
Appending Selenium options: --register-period 60
Appending Selenium options: --register-cycle 5
Appending Selenium options: --heartbeat-period 30
Appending Selenium options: --log-level INFO
Generating Selenium Config
Setting up SE_NODE_HOST...
Tracing is disabled
Selenium Grid Node configuration: 
[events]
publish = "tcp://selenium-grid-selenium-hub.selenium:4442"
subscribe = "tcp://selenium-grid-selenium-hub.selenium:4443"

[server]
port = "5555"
[node]
grid-url = "http://admin:admin@selenium-grid-selenium-hub.selenium:4444"
session-timeout = "300"
override-max-sessions = false
detect-drivers = false
drain-after-session-count = 1
max-sessions = 1

[[node.driver-configuration]]
display-name = "chrome"
stereotype = '{"browserName": "chrome", "browserVersion": "122.0", "platformName": "Linux", "goog:chromeOptions": {"binary": "/usr/bin/google-chrome"}}'
max-sessions = 1

Starting Selenium Grid Node...
2024-03-29 17:28:58,819 INFO success: xvfb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-29 17:28:58,819 INFO success: vnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-03-29 17:28:58,819 INFO success: novnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
17:28:59.716 INFO [LoggingOptions.configureLogEncoding] - Using the system default encoding
17:28:59.799 INFO [OpenTelemetryTracer.createTracer] - Using OpenTelemetry for tracing
17:29:00.494 INFO [UnboundZmqEventBus.<init>] - Connecting to tcp://selenium-grid-selenium-hub.selenium:4442 and tcp://selenium-grid-selenium-hub.selenium:4443
17:29:00.714 INFO [UnboundZmqEventBus.<init>] - Sockets created
17:29:01.728 INFO [UnboundZmqEventBus.<init>] - Event bus ready
17:29:02.106 INFO [NodeServer.createHandlers] - Reporting self as: http://172.30.103.212:5555
17:29:02.204 INFO [NodeOptions.getSessionFactories] - Detected 1 available processors
17:29:02.393 INFO [NodeOptions.report] - Adding chrome for {"browserName": "chrome","browserVersion": "122.0","goog:chromeOptions": {"binary": "\u002fusr\u002fbin\u002fgoogle-chrome"},"platformName": "linux","se:noVncPort": 7900,"se:vncEnabled": true} 1 times
17:29:02.500 INFO [Node.<init>] - Binding additional locator mechanisms: relative
17:29:03.042 INFO [NodeServer$1.start] - Starting registration process for Node http://172.30.103.212:5555
17:29:03.043 INFO [NodeServer.execute] - Started Selenium node 4.18.1 (revision b1d3319b48): http://172.30.103.212:5555
17:29:03.106 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
17:29:03.600 INFO [NodeServer.lambda$createHandlers$2] - Node has been added
17:29:06.692 INFO [LocalNode.checkSessionCount] - Draining Node, configured sessions value (1) has been reached.
17:29:06.703 INFO [LocalNode.newSession] - Session created by the Node. Id: 3ff9d1895a0d96e49cc1655a0908a244, Caps: Capabilities {acceptInsecureCerts: true, browserName: chrome, browserVersion: 122.0.6261.69, chrome: {chromedriverVersion: 122.0.6261.69 (81bc525b6a36..., userDataDir: /tmp/.org.chromium.Chromium...}, fedcm:accounts: true, goog:chromeOptions: {debuggerAddress: localhost:39059}, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: linux, proxy: Proxy(), se:bidiEnabled: false, se:cdp: ws://admin:admin@selenium-g..., se:cdpVersion: 122.0.6261.69, se:vnc: ws://admin:admin@selenium-g..., se:vncEnabled: true, se:vncLocalAddress: ws://172.30.103.212:7900, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:extension:minPinLength: true, webauthn:extension:prf: true, webauthn:virtualAuthenticators: true}
17:34:32.624 INFO [LocalNode.stopTimedOutSession] - Session id 3ff9d1895a0d96e49cc1655a0908a244 timed out, stopping...
17:34:32.755 INFO [SessionSlot.stop] - Stopping session 3ff9d1895a0d96e49cc1655a0908a244
17:34:32.758 INFO [LocalNode.stopTimedOutSession] - Node draining complete!
17:34:33.764 INFO [NodeServer.lambda$createHandlers$3] - Shutting down
2024-03-29 17:34:34,110 INFO exited: selenium-node (exit status 0; expected)
2024-03-29 17:34:34,111 WARN received SIGINT indicating exit request
2024-03-29 17:34:34,112 INFO waiting for xvfb, vnc, novnc to die
2024-03-29 17:34:36,115 INFO stopped: novnc (terminated by SIGTERM)
2024-03-29 17:34:37,118 INFO stopped: vnc (terminated by SIGTERM)
2024-03-29 17:34:37,119 INFO waiting for xvfb to die
2024-03-29 17:34:37,119 INFO stopped: xvfb (terminated by SIGTERM)

I have noticed also an error in keda-operator-metrics, take a look:

2024/03/29 09:37:07 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
I0329 09:37:07.382355       1 welcome.go:34] keda_metrics_adapter "msg"="Starting metrics server" 
I0329 09:37:07.382406       1 welcome.go:35] keda_metrics_adapter "msg"="KEDA Version: 2.13.0" 
I0329 09:37:07.382428       1 welcome.go:36] keda_metrics_adapter "msg"="Git Commit: 367fcd32f1e72be504ada53165d07c33d75fc0f7" 
I0329 09:37:07.382448       1 welcome.go:37] keda_metrics_adapter "msg"="Go Version: go1.21.6" 
I0329 09:37:07.382468       1 welcome.go:38] keda_metrics_adapter "msg"="Go OS/Arch: linux/amd64" 
I0329 09:37:07.382520       1 welcome.go:39] keda_metrics_adapter "msg"="Running on Kubernetes 1.25" "version"={"major":"1","minor":"25","gitVersion":"v1.25.16+6df2177","gitCommit":"f5b7c3e8faedd51935d77828a5fc72c7540236f4","gitTreeState":"clean","buildDate":"2024-01-18T12:53:29Z","goVersion":"go1.19.13 X:strictfipsruntime","compiler":"gc","platform":"linux/amd64"}
I0329 09:37:07.382557       1 welcome.go:43] keda_metrics_adapter "msg"="WARNING: KEDA 2.13.0 hasn't been tested on Kubernetes v1.25.16+6df2177" 
I0329 09:37:07.382682       1 welcome.go:44] keda_metrics_adapter "msg"="You can check recommended versions on https://keda.sh" 
I0329 09:37:07.384318       1 main.go:125] keda_metrics_adapter "msg"="Connecting Metrics Service gRPC client to the server" "address"="keda-operator.selenium.svc.cluster.local:9666"
I0329 09:37:07.401590       1 provider.go:58] keda_metrics_adapter/provider "msg"="starting" 
I0329 09:37:07.401682       1 main.go:259] keda_metrics_adapter "msg"="starting adapter..." 
I0329 09:37:07.412258       1 client.go:88] keda_metrics_adapter/provider "msg"="Waiting for establishing a gRPC connection to KEDA Metrics Server" 
I0329 09:37:07.412288       1 main.go:176] keda_metrics_adapter "msg"="starting /metrics server endpoint" 
I0329 09:37:07.662522       1 secure_serving.go:213] Serving securely on [::]:6443
I0329 09:37:07.662721       1 requestheader_controller.go:169] Starting RequestHeaderAuthRequestController
I0329 09:37:07.662771       1 shared_informer.go:311] Waiting for caches to sync for RequestHeaderAuthRequestController
I0329 09:37:07.662838       1 dynamic_serving_content.go:132] "Starting controller" name="serving-cert::/certs/tls.crt::/certs/tls.key"
I0329 09:37:07.663031       1 tlsconfig.go:240] "Starting DynamicServingCertificateController"
I0329 09:37:07.663302       1 dynamic_cafile_content.go:157] "Starting controller" name="client-ca-bundle::/certs/ca.crt"
I0329 09:37:07.663460       1 configmap_cafile_content.go:202] "Starting controller" name="client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file"
I0329 09:37:07.663491       1 shared_informer.go:311] Waiting for caches to sync for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
I0329 09:37:07.763034       1 shared_informer.go:318] Caches are synced for RequestHeaderAuthRequestController
I0329 09:37:07.764422       1 shared_informer.go:318] Caches are synced for client-ca::kube-system::extension-apiserver-authentication::requestheader-client-ca-file
W0329 09:37:08.583147       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.selenium.svc.cluster.local:9666", ServerName: "keda-operator.selenium.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.21.71.118:9666: connect: connection refused"
W0329 09:37:10.631072       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.selenium.svc.cluster.local:9666", ServerName: "keda-operator.selenium.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.21.71.118:9666: connect: connection refused"
W0329 09:37:13.447047       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.selenium.svc.cluster.local:9666", ServerName: "keda-operator.selenium.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.21.71.118:9666: connect: connection refused"
W0329 09:37:16.775172       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.selenium.svc.cluster.local:9666", ServerName: "keda-operator.selenium.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.21.71.118:9666: connect: connection refused"
W0329 09:37:22.407104       1 logging.go:59] [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "keda-operator.selenium.svc.cluster.local:9666", ServerName: "keda-operator.selenium.svc.cluster.local:9666", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 172.21.71.118:9666: connect: connection refused"
I0329 09:37:29.938514       1 provider.go:66] keda_metrics_adapter/provider "msg"="Connection to KEDA Metrics Service gRPC server has been successfully established" "server"="keda-operator.selenium.svc.cluster.local:9666"
E0329 10:54:03.649759       1 provider.go:84] keda_metrics_adapter/provider "msg"="error converting Selector to Labels Map" "error"="invalid selector: [mas.ibm.com/instanceId]" 
E0329 10:54:03.650386       1 status.go:71] apiserver received an error that is not an metav1.Status: &errors.errorString{s:"invalid selector: [mas.ibm.com/instanceId]"}: invalid selector: [mas.ibm.com/instanceId]
http2: server: error reading preface from client 172.30.141.15:40478: read tcp 172.30.109.75:6443->172.30.141.15:40478: read: connection reset by peer

And here, just as a reference, the log from keda-opetarator (beginning):

2024/03/29 17:28:34 maxprocs: Updating GOMAXPROCS=1: determined from CPU quota
2024-03-29T17:28:35Z    INFO    setup   Starting manager
2024-03-29T17:28:35Z    INFO    setup   KEDA Version: 2.13.0
2024-03-29T17:28:35Z    INFO    setup   Git Commit: 367fcd32f1e72be504ada53165d07c33d75fc0f7
2024-03-29T17:28:35Z    INFO    setup   Go Version: go1.21.6
2024-03-29T17:28:35Z    INFO    setup   Go OS/Arch: linux/amd64
2024-03-29T17:28:35Z    INFO    setup   Running on Kubernetes 1.25  {"version": "v1.25.16+6df2177"}
2024-03-29T17:28:35Z    INFO    setup   WARNING: KEDA 2.13.0 hasn't been tested on Kubernetes v1.25.16+6df2177
2024-03-29T17:28:35Z    INFO    setup   You can check recommended versions on https://keda.sh
2024-03-29T17:28:35Z    INFO    starting server {"kind": "health probe", "addr": "[::]:8081"}
I0329 17:28:35.306819       1 leaderelection.go:250] attempting to acquire leader lease selenium/operator.keda.sh...
I0329 17:28:52.723265       1 leaderelection.go:260] successfully acquired lease selenium/operator.keda.sh
2024-03-29T17:28:52Z    INFO    Starting EventSource    {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v1alpha1.ScaledObject"}
2024-03-29T17:28:52Z    INFO    Starting EventSource    {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "source": "kind source: *v2.HorizontalPodAutoscaler"}
2024-03-29T17:28:52Z    INFO    Starting Controller {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject"}
2024-03-29T17:28:52Z    INFO    Starting EventSource    {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "source": "kind source: *v1alpha1.TriggerAuthentication"}
2024-03-29T17:28:52Z    INFO    Starting Controller {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication"}
2024-03-29T17:28:52Z    INFO    Starting EventSource    {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "source": "kind source: *v1alpha1.ScaledJob"}
2024-03-29T17:28:52Z    INFO    Starting Controller {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob"}
2024-03-29T17:28:52Z    INFO    Starting EventSource    {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "source": "kind source: *v1alpha1.CloudEventSource"}
2024-03-29T17:28:52Z    INFO    Starting Controller {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource"}
2024-03-29T17:28:52Z    INFO    Starting EventSource    {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "source": "kind source: *v1alpha1.ClusterTriggerAuthentication"}
2024-03-29T17:28:52Z    INFO    Starting Controller {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication"}
2024-03-29T17:28:52Z    INFO    cert-rotation   starting cert rotator controller
2024-03-29T17:28:52Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *v1.Secret"}
2024-03-29T17:28:52Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-03-29T17:28:52Z    INFO    Starting EventSource    {"controller": "cert-rotator", "source": "kind source: *unstructured.Unstructured"}
2024-03-29T17:28:52Z    INFO    Starting Controller {"controller": "cert-rotator"}
2024-03-29T17:28:52Z    INFO    cert-rotation   no cert refresh needed
2024-03-29T17:28:52Z    INFO    cert-rotation   certs are ready in /certs
2024-03-29T17:28:52Z    INFO    Starting workers    {"controller": "cert-rotator", "worker count": 1}
2024-03-29T17:28:52Z    INFO    cert-rotation   no cert refresh needed
2024-03-29T17:28:52Z    ERROR   cert-rotation   Webhook not found. Unable to update certificate.    {"name": "keda-admission", "gvk": "admissionregistration.k8s.io/v1, Kind=ValidatingWebhookConfiguration", "error": "ValidatingWebhookConfiguration.admissionregistration.k8s.io \"keda-admission\" not found"}
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).ensureCerts
    /workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:816
github.com/open-policy-agent/cert-controller/pkg/rotator.(*ReconcileWH).Reconcile
    /workspace/vendor/github.com/open-policy-agent/cert-controller/pkg/rotator/rotator.go:785
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:119
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:316
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:266
sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2
    /workspace/vendor/sigs.k8s.io/controller-runtime/pkg/internal/controller/controller.go:227
2024-03-29T17:28:52Z    INFO    cert-rotation   Ensuring CA cert    {"name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService", "name": "v1beta1.external.metrics.k8s.io", "gvk": "apiregistration.k8s.io/v1, Kind=APIService"}
2024-03-29T17:28:53Z    INFO    Starting workers    {"controller": "cloudeventsource", "controllerGroup": "eventing.keda.sh", "controllerKind": "CloudEventSource", "worker count": 1}
2024-03-29T17:28:53Z    INFO    Starting workers    {"controller": "clustertriggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "ClusterTriggerAuthentication", "worker count": 1}
2024-03-29T17:28:53Z    INFO    Starting workers    {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "worker count": 1}
2024-03-29T17:28:53Z    INFO    Reconciling ScaledJob   {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"selenium-grid-selenium-chrome-node","namespace":"selenium"}, "namespace": "selenium", "name": "selenium-grid-selenium-chrome-node", "reconcileID": "7d69bdf2-a351-4a66-82cb-6822d0b51e39"}
2024-03-29T17:28:53Z    INFO    Starting workers    {"controller": "scaledobject", "controllerGroup": "keda.sh", "controllerKind": "ScaledObject", "worker count": 5}
2024-03-29T17:28:53Z    INFO    Starting workers    {"controller": "triggerauthentication", "controllerGroup": "keda.sh", "controllerKind": "TriggerAuthentication", "worker count": 1}
2024-03-29T17:28:53Z    INFO    RolloutStrategy: immediate, Deleting jobs owned by the previous version of the scaledJob    {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"selenium-grid-selenium-chrome-node","namespace":"selenium"}, "namespace": "selenium", "name": "selenium-grid-selenium-chrome-node", "reconcileID": "7d69bdf2-a351-4a66-82cb-6822d0b51e39", "numJobsToDelete": 1}
2024-03-29T17:28:54Z    INFO    Initializing Scaling logic according to ScaledJob Specification {"controller": "scaledjob", "controllerGroup": "keda.sh", "controllerKind": "ScaledJob", "ScaledJob": {"name":"selenium-grid-selenium-chrome-node","namespace":"selenium"}, "namespace": "selenium", "name": "selenium-grid-selenium-chrome-node", "reconcileID": "7d69bdf2-a351-4a66-82cb-6822d0b51e39"}
2024-03-29T17:28:54Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of running Jobs": 0}
2024-03-29T17:28:54Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of pending Jobs ": 0}
2024-03-29T17:28:54Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Effective number of max jobs": 1}
2024-03-29T17:28:54Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of jobs": 1}
2024-03-29T17:28:54Z    INFO    scaleexecutor   Created jobs    {"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of jobs": 1}
2024-03-29T17:28:54Z    INFO    cert-rotation   CA certs are injected to webhooks
2024-03-29T17:28:54Z    INFO    grpc_server Starting Metrics Service gRPC Server    {"address": ":9666"}
2024-03-29T17:29:04Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of running Jobs": 1}
2024-03-29T17:29:04Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of pending Jobs ": 0}
2024-03-29T17:29:14Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of running Jobs": 1}
2024-03-29T17:29:14Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of pending Jobs ": 0}

.. and here some of the last lines:

2024-03-29T17:49:24Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of jobs": 0}
2024-03-29T17:49:24Z    INFO    scaleexecutor   Created jobs    {"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of jobs": 0}
2024-03-29T17:49:34Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of running Jobs": 2}
2024-03-29T17:49:34Z    INFO    scaleexecutor   Scaling Jobs    {"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of pending Jobs ": 0}
2024-03-29T17:49:34Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Effective number of max jobs": 0}
2024-03-29T17:49:34Z    INFO    scaleexecutor   Creating jobs   {"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of jobs": 0}
2024-03-29T17:49:34Z    INFO    scaleexecutor   Created jobs    {"scaledJob.Name": "selenium-grid-selenium-chrome-node", "scaledJob.Namespace": "selenium", "Number of jobs": 0}
VietND96 commented 8 months ago

@alequint, thanks for your continuous feedback. In chart 0.29.1, I just updated the preStop script with the expectation that it would prevent stuck somewhere. Besides that, I also suspect K8s behavior, since when enabling the scaling type as job, in deployment, terminationGracePeriodSeconds: 30 by default, which means Pod should be killed after the 30s expires.

alequint commented 8 months ago

Thanks @VietND96... we plan to upgrade our images to the most recent release and try again.

Considering we are using job scaling type, do you think worth setting terminationGracePeriodSeconds to a time higher than 30?

Doofus100500 commented 7 months ago

Hi @VietND96 , I'm having this problem to, but in k8s & with 0.30.0 chart, selenium/node-chrome:124.0 & selenium/video:ffmpeg-7.0-20240425 Relevant log output node:

Starting Selenium Grid Node...
2024-05-02 13:46:43,603 INFO spawned: 'browserleftoverscleanup' with pid 98
2024-05-02 13:46:43,604 INFO success: xvfb entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-05-02 13:46:43,604 INFO success: vnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-05-02 13:46:43,604 INFO success: novnc entered RUNNING state, process has stayed up for > than 1 seconds (startsecs)
2024-05-02 13:46:43,672 INFO exited: browserleftoverscleanup (exit status 0; not expected)
13:46:44.110 INFO [LoggingOptions.configureLogEncoding] - Using the system default encoding
13:46:44.175 INFO [OpenTelemetryTracer.createTracer] - Using OpenTelemetry for tracing
13:46:45.677 INFO [UnboundZmqEventBus.<init>] - Connecting to tcp://selenium-grid-selenium-event-bus.selenium-test:4442 and tcp://selenium-grid-selenium-event-bus.selenium-test:4443
2024-05-02 13:46:45,683 INFO spawned: 'browserleftoverscleanup' with pid 115
2024-05-02 13:46:45,696 INFO exited: browserleftoverscleanup (exit status 0; not expected)
13:46:45.802 INFO [UnboundZmqEventBus.<init>] - Sockets created
13:46:46.805 INFO [UnboundZmqEventBus.<init>] - Event bus ready
13:46:46.913 INFO [NodeServer.createHandlers] - Reporting self as: https://10.233.121.210:5555
13:46:46.988 INFO [NodeOptions.getSessionFactories] - Detected 2 available processors
13:46:47.090 INFO [NodeOptions.report] - Adding chrome for {"browserName": "chrome","browserVersion": "124.0","goog:chromeOptions": {"binary": "\u002fusr\u002fbin\u002fgoogle-chrome"},"platformName": "linux","se:noVncPort": 7900,"se:vncEnabled": true} 1 times
13:46:47.102 INFO [Node.<init>] - Binding additional locator mechanisms: relative
13:46:48.199 INFO [NodeServer$1.start] - Starting registration process for Node https://10.233.121.210:5555
13:46:48.200 INFO [NodeServer.execute] - Started Selenium node 4.20.0 (revision 866c76ca80): https://10.233.121.210:5555
13:46:48.265 INFO [NodeServer$1.lambda$start$1] - Sending registration event...
13:46:49.088 INFO [NodeServer.lambda$createHandlers$2] - Node has been added
2024-05-02 13:46:49,092 INFO spawned: 'browserleftoverscleanup' with pid 131
2024-05-02 13:46:49,105 INFO exited: browserleftoverscleanup (exit status 0; not expected)
2024-05-02 13:46:50,107 INFO gave up: browserleftoverscleanup entered FATAL state, too many start retries too quickly
13:46:52.512 INFO [LocalNode.newSession] - Session created by the Node. Id: 8dc50864ff272dae7876a994e5e72cf8, Caps: Capabilities {acceptInsecureCerts: false, browserName: chrome, browserVersion: 124.0.6367.78, chrome: {chromedriverVersion: 124.0.6367.78 (a087f2dd364d..., userDataDir: /tmp/.org.chromium.Chromium...}, fedcm:accounts: true, goog:chromeOptions: {debuggerAddress: localhost:7289}, grid:teamName: idevops, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: linux, proxy: Proxy(), se:bidiEnabled: false, se:cdp: wss://grid-test...., se:cdpVersion: 124.0.6367.78, se:vnc: wss://grid-test...., se:vncEnabled: true, se:vncLocalAddress: ws://10.233.121.210:7900, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:extension:minPinLength: true, webauthn:extension:prf: true, webauthn:virtualAuthenticators: true}
13:52:17.194 INFO [LocalNode.stopTimedOutSession] - Session id 8dc50864ff272dae7876a994e5e72cf8 timed out, stopping...
13:52:17.616 INFO [SessionSlot.stop] - Stopping session 8dc50864ff272dae7876a994e5e72cf8
13:57:44.698 INFO [LocalNode.drain] - Firing node drain complete message
13:57:44.743 INFO [LocalNode.drain] - Firing node drain complete message
Trapped SIGTERM/SIGINT/x so shutting down supervisord...
13:57:45.706 INFO [NodeServer.lambda$createHandlers$3] - Shutting down
2024-05-02 13:57:45,707 WARN received SIGTERM indicating exit request
2024-05-02 13:57:45,708 INFO waiting for xvfb, vnc, novnc, selenium-node to die
2024-05-02 13:57:46,162 INFO stopped: selenium-node (terminated by SIGTERM)
2024-05-02 13:57:47,164 INFO stopped: novnc (terminated by SIGTERM)
2024-05-02 13:57:48,167 INFO stopped: vnc (terminated by SIGTERM)
2024-05-02 13:57:49,169 INFO stopped: xvfb (terminated by SIGTERM)
Shutdown complete

video container:

Stopping to record video
Add to pipe a signal Uploading video to s3-selenium-video:selenium-video/video//8dc50864ff272dae7876a994e5e72cf8.mp4
Uploading /videos/8dc50864ff272dae7876a994e5e72cf8.mp4 to s3-selenium-video:selenium-video/video/
[out#0/mp4 @ 0x556b0c2a3c80] video:1380kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: 1.445701%
frame= 4874 fps= 15 q=-1.0 Lsize=    1400kB time=00:05:24.86 bitrate=  35.3kbits/s speed=   1x    
[libx264 @ 0x556b0c2a6280] frame I:20    Avg QP: 8.60  size: 15852
[libx264 @ 0x556b0c2a6280] frame P:4854  Avg QP:11.03  size:   226
[libx264 @ 0x556b0c2a6280] mb I  I16..4: 100.0%  0.0%  0.0%
[libx264 @ 0x556b0c2a6280] mb P  I16..4:  0.0%  0.0%  0.0%  P16..4:  1.4%  0.0%  0.0%  0.0%  0.0%    skip:98.5%
[libx264 @ 0x556b0c2a6280] coded y,uvDC,uvAC intra: 2.2% 2.5% 2.4% inter: 0.1% 1.2% 0.0%
[libx264 @ 0x556b0c2a6280] i16 v,h,dc,p: 92%  8%  0%  0%
[libx264 @ 0x556b0c2a6280] i8c dc,h,v,p: 95%  4%  1%  0%
[libx264 @ 0x556b0c2a6280] kb/s:34.77
Exiting normally, received signal 2.
Transferred:           1.367 MiB / 1.367 MiB, 100%, 0 B/s, ETA -
Checks:                 1 / 1, 100%
Deleted:                1 (files), 0 (dirs)
Renamed:                1
Transferred:            1 / 1, 100%
Elapsed time:         0.2s
Trapped SIGTERM/SIGINT/x so shutting down supervisord...
2024-05-02 13:57:44,959 WARN received SIGTERM indicating exit request
2024-05-02 13:57:44,960 INFO waiting for video-recording, video-ready, video-upload to die
Node API is not responding, exiting.
Waiting for force exit file to be consumed by external upload container
Uploader is ready to shutdown
2024-05-02 13:57:46,648 INFO stopped: video-upload (exit status 0)
Traceback (most recent call last):
  File "/opt/bin/video_ready.py", line 19, in <module>
    httpd.serve_forever()
  File "/usr/lib/python3.10/socketserver.py", line 232, in serve_forever
    ready = selector.select(poll_interval)
  File "/usr/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt
2024-05-02 13:57:46,672 INFO stopped: video-ready (terminated by SIGINT)
Waiting for force exit file to be consumed by external upload container
2024-05-02 13:57:48,649 INFO waiting for video-recording to die
Waiting for force exit file to be consumed by external upload container
Waiting for force exit file to be consumed by external upload container
Waiting for force exit file to be consumed by external upload container
2024-05-02 13:57:51,656 INFO waiting for video-recording to die
Waiting for force exit file to be consumed by external upload container
Ready to shutdown the recorder
Ready to shutdown the recorder
2024-05-02 13:57:52,711 WARN received SIGINT indicating exit request
2024-05-02 13:57:53,713 INFO stopped: video-recording (exit status 0)
2024-05-02 13:57:53,714 WARN received SIGINT indicating exit request
Shutdown complete
Stream closed EOF for selenium-test/selenium-grid-selenium-chrome-node-v124-5ff489654-m9bxd (video)

image

VietND96 commented 7 months ago

@Doofus100500, may I know some info. The previous version that worked before upgrading? The video recording output can be opened and viewable?

Doofus100500 commented 7 months ago

@VietND96

  1. 98% working version is : node-chrome:122.0 & video:ffmpeg-6.1-20240224
  2. The video recording output can't be opened, it is not empty but all players show 0:00 minutes
  3. This problem does not always reproduce, so it is difficult to say for sure. I'll try to downgrade video container and watch
Doofus100500 commented 7 months ago

@VietND96 Going through different versions of images didn't lead to anything; the issue with video being recorded after sending the file remains, if I understood the problem correctly.

VietND96 commented 7 months ago

@Doofus100500, how about the K8s version that you are using? I tried v1.30.0 and saw that the video file was broken as you mentioned. I recheck with the docker-compose setup without autoscaling, and the video file is viewable.

Doofus100500 commented 7 months ago

@VietND96 K8s Rev: v1.25.6 It seems that this is not important.

Doofus100500 commented 7 months ago

@VietND96 Can you try in v1.30.0 with chart 0.29.1 ?

VietND96 commented 7 months ago

Sure, I am inprogress to reproduce and figure out the issue. Will have a patch if anything can be fixed or rollback ffmpeg version.

Doofus100500 commented 7 months ago

@VietND96 Interesting observation, the problem does not occur if there are many resources in the namespace and sessions do not wait in the queue. Maybe it will be useful for you

VietND96 commented 7 months ago

@Doofus100500, looks like rollback ffmpeg 6.1.1, video files are stable. I also added video integrity checks to prevent unstable issue when upgrading dependencies. A new image tag and chart will be released soon

VietND96 commented 7 months ago

@Doofus100500, new image tag and chart 0.30.1 is out. I added tests to verify the video file integrity and saw it passed in CI. Can you try and confirm? Btw, video file name now is able to set from language binding - https://github.com/SeleniumHQ/docker-selenium?tab=readme-ov-file#video-recording-with-dynamic-file-name-based-on-metadata-in-tests. Hope it can improve the traceability of artifacts

Doofus100500 commented 7 months ago

@VietND96 With the video files everything is good now, but the pods are still remain in the Terminating status.

Btw, video file name now is able to set from language binding

That's wonderful, thank you, it will be helpful!

VietND96 commented 7 months ago

@Doofus100500, can you try to enable global.seleniumGrid.stdoutProbeLog: true, preStop script will print logs to kubectl logs we can see something could get stuck there.

Doofus100500 commented 7 months ago

@VietND96

kubectl logs -f -l app=selenium-grid-selenium-chrome-node --all-containers=true --namespace selenium-test

2024-05-06 09:14:24,239 INFO exited: selenium-node (exit status 0; expected)
2024-05-06 09:14:24,239 WARN received SIGINT indicating exit request
2024-05-06 09:14:24,239 INFO waiting for xvfb, vnc, novnc to die
Trapped SIGTERM/SIGINT/x so shutting down supervisord...
2024-05-06 09:14:27,243 INFO stopped: novnc (terminated by SIGTERM)
2024-05-06 09:14:27,243 INFO waiting for xvfb, vnc to die
2024-05-06 09:14:27,243 WARN received SIGTERM indicating exit request
2024-05-06 09:14:28,244 INFO stopped: vnc (terminated by SIGTERM)
2024-05-06 09:14:29,246 INFO stopped: xvfb (terminated by SIGTERM)
Shutdown complete
2024-05-06T09:09:56UTC [video.recorder] - Video recording in progress
2024-05-06T09:09:58UTC [video.recorder] - Video recording in progress
2024-05-06T09:09:59UTC [video.recorder] - Video recording in progress
2024-05-06T09:10:00UTC [video.recorder] - Video recording in progress
2024-05-06T09:10:01UTC [video.recorder] - Video recording in progress
2024-05-06T09:10:02UTC [video.recorder] - Video recording in progress
2024-05-06T09:10:03UTC [video.recorder] - Stopping to record video
2024-05-06T09:10:03UTC [video.recorder] - Add to pipe a signal Uploading video to s3-selenium-video:selenium-video/video//8c2a18c7b195c1a712118a6d68556f60.mp4
Uploading /videos/8c2a18c7b195c1a712118a6d68556f60.mp4 to s3-selenium-video:idevops-ops-selenium-video/video/
2024/05/06 09:10:03 Failed to create file system for "s3-selenium-video:idevops-ops-selenium-video/video/": didn't find section in config file
2024-05-06 09:13:33,515 INFO waiting for video-recording to die
2024-05-06T09:13:33UTC [video.recorder] - Video recording in progress
2024-05-06T09:13:34UTC [video.recorder] - Video recording in progress
2024-05-06T09:13:35UTC [video.recorder] - Video recording in progress
2024-05-06 09:13:36,654 INFO waiting for video-recording to die
2024-05-06T09:13:36UTC [video.recorder] - Video recording in progress
2024-05-06T09:13:37UTC [video.recorder] - Video recording in progress
2024-05-06 09:13:38,747 WARN killing 'video-recording' (9) with SIGKILL
2024-05-06 09:13:39,749 INFO stopped: video-recording (terminated by SIGKILL)
Shutdown complete
08:53:32.686 INFO [SessionSlot.stop] - Stopping session 24f17200728ab42ea8628dbf10de07d8
08:53:33.881 INFO [LocalNode.newSession] - Session created by the Node. Id: 65b4e0dfed660ceab008a019b0b8f8a7, Caps: Capabilities {acceptInsecureCerts: false, browserName: chrome, browserVersion: 124.0.6367.118, chrome: {chromedriverVersion: 124.0.6367.91 (51df0e5e17a8..., userDataDir: /tmp/.org.chromium.Chromium...}, fedcm:accounts: true, goog:chromeOptions: {debuggerAddress: localhost:13055}, grid:teamName: idevops, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: linux, proxy: Proxy(), se:bidiEnabled: false, se:cdp: wss://grid-test.testkontur...., se:cdpVersion: 124.0.6367.118, se:vnc: wss://grid-test.testkontur...., se:vncEnabled: true, se:vncLocalAddress: ws://10.233.65.58:7900, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:extension:minPinLength: true, webauthn:extension:prf: true, webauthn:virtualAuthenticators: true}
08:59:02.571 INFO [LocalNode.stopTimedOutSession] - Session id 65b4e0dfed660ceab008a019b0b8f8a7 timed out, stopping...
08:59:02.699 INFO [SessionSlot.stop] - Stopping session 65b4e0dfed660ceab008a019b0b8f8a7
08:59:03.772 INFO [LocalNode.newSession] - Session created by the Node. Id: 443851e31a2148d36dc790fe7423bde9, Caps: Capabilities {acceptInsecureCerts: false, browserName: chrome, browserVersion: 124.0.6367.118, chrome: {chromedriverVersion: 124.0.6367.91 (51df0e5e17a8..., userDataDir: /tmp/.org.chromium.Chromium...}, fedcm:accounts: true, goog:chromeOptions: {debuggerAddress: localhost:9937}, grid:teamName: idevops, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: linux, proxy: Proxy(), se:bidiEnabled: false, se:cdp: wss://grid-test.testkontur...., se:cdpVersion: 124.0.6367.118, se:vnc: wss://grid-test.testkontur...., se:vncEnabled: true, se:vncLocalAddress: ws://10.233.65.58:7900, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:extension:minPinLength: true, webauthn:extension:prf: true, webauthn:virtualAuthenticators: true}
09:04:32.571 INFO [LocalNode.stopTimedOutSession] - Session id 443851e31a2148d36dc790fe7423bde9 timed out, stopping...
09:04:32.702 INFO [SessionSlot.stop] - Stopping session 443851e31a2148d36dc790fe7423bde9
09:04:33.921 INFO [LocalNode.newSession] - Session created by the Node. Id: 8c2a18c7b195c1a712118a6d68556f60, Caps: Capabilities {acceptInsecureCerts: false, browserName: chrome, browserVersion: 124.0.6367.118, chrome: {chromedriverVersion: 124.0.6367.91 (51df0e5e17a8..., userDataDir: /tmp/.org.chromium.Chromium...}, fedcm:accounts: true, goog:chromeOptions: {debuggerAddress: localhost:9075}, grid:teamName: idevops, networkConnectionEnabled: false, pageLoadStrategy: normal, platformName: linux, proxy: Proxy(), se:bidiEnabled: false, se:cdp: wss://grid-test.testkontur...., se:cdpVersion: 124.0.6367.118, se:vnc: wss://grid-test.testkontur...., se:vncEnabled: true, se:vncLocalAddress: ws://10.233.65.58:7900, setWindowRect: true, strictFileInteractability: false, timeouts: {implicit: 0, pageLoad: 300000, script: 30000}, unhandledPromptBehavior: dismiss and notify, webauthn:extension:credBlob: true, webauthn:extension:largeBlob: true, webauthn:extension:minPinLength: true, webauthn:extension:prf: true, webauthn:virtualAuthenticators: true}
09:10:02.571 INFO [LocalNode.stopTimedOutSession] - Session id 8c2a18c7b195c1a712118a6d68556f60 timed out, stopping...
09:10:02.692 INFO [SessionSlot.stop] - Stopping session 8c2a18c7b195c1a712118a6d68556f60

And in k9s in video container i see:

2024-05-06T09:12:45UTC [video.recorder] - Video recording in progress 
2024-05-06T09:12:46UTC [video.recorder] - Video recording in progress 
2024-05-06T09:12:47UTC [video.recorder] - Video recording in progress 
2024-05-06T09:12:48UTC [video.recorder] - Video recording in progress 
2024-05-06T09:12:49UTC [video.recorder] - Video recording in progress 
2024-05-06T09:12:50UTC [video.recorder] - Video recording in progress 
2024-05-06T09:12:51UTC [video.recorder] - Video recording in progress 
2024-05-06T09:12:52UTC [video.recorder] - Video recording in progress 
2024-05-06T09:12:53UTC [video.recorder] - Video recording in progress 
2024-05-06T09:12:54UTC [video.recorder] - Video recording in progress 
2024-05-06T09:12:55UTC [video.recorder] - Video recording in progress 
2024-05-06T09:12:56UTC [video.recorder] - Video recording in progress 
2024-05-06T09:12:57UTC [video.recorder] - Video recording in progress 
2024-05-06T09:12:58UTC [video.recorder] - Video recording in progress 
2024-05-06T09:12:59UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:00UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:01UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:02UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:04UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:05UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:06UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:07UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:08UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:09UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:10UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:11UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:12UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:13UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:14UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:15UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:16UTC [video.recorder] - Video recording in progress 
Trapped SIGTERM/SIGINT/x so shutting down supervisord...
2024-05-06 09:13:17,640 WARN received SIGTERM indicating exit request
2024-05-06 09:13:17,640 INFO waiting for video-recording, video-ready, video-upload to die
2024-05-06T09:13:17UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:18UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:19UTC [video.recorder] - Video recording in progress 
2024-05-06 09:13:20,781 INFO waiting for video-recording, video-ready, video-upload to die
2024-05-06T09:13:20UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:21UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:22UTC [video.recorder] - Video recording in progress 
2024-05-06 09:13:23,924 INFO waiting for video-recording, video-ready, video-upload to die
2024-05-06T09:13:23UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:25UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:26UTC [video.recorder] - Video recording in progress 
2024-05-06 09:13:27,082 INFO waiting for video-recording, video-ready, video-upload to die
2024-05-06T09:13:27UTC [video.recorder] - Video recording in progress 
2024-05-06 09:13:28,144 WARN killing 'video-upload' (11) with SIGKILL
2024-05-06T09:13:28UTC [video.recorder] - Video recording in progress 
2024-05-06 09:13:28,189 INFO stopped: video-upload (terminated by SIGKILL)
Traceback (most recent call last):
  File "/opt/bin/video_ready.py", line 19, in <module>
    httpd.serve_forever()
  File "/usr/lib/python3.10/socketserver.py", line 232, in serve_forever
    ready = selector.select(poll_interval)
  File "/usr/lib/python3.10/selectors.py", line 416, in select
    fd_event_list = self._selector.poll(timeout)
KeyboardInterrupt
2024-05-06 09:13:28,200 INFO stopped: video-ready (terminated by SIGINT)
2024-05-06T09:13:29UTC [video.recorder] - Stopping to record video
2024-05-06T09:13:29UTC [video.recorder] - Add to pipe a signal Uploading video to s3-selenium-video:selenium-video/video//177aeb02fd19ec3fecd70775f045f4de.mp4
2024-05-06T09:13:29UTC [video.recorder] - Waiting for force exit file to be consumed by external upload container
2024-05-06 09:13:30,198 INFO waiting for video-recording to die
2024-05-06T09:13:30UTC [video.recorder] - Ready to shutdown the recorder
2024-05-06T09:13:30UTC [video.recorder] - Session: 177aeb02fd19ec3fecd70775f045f4de is created
2024-05-06 09:13:30,252 WARN received SIGINT indicating exit request
2024-05-06T09:13:30UTC [video.recorder] - Start recording: true, video file name: 177aeb02fd19ec3fecd70775f045f4de.mp4
{
  "data": {
    "session": {
      "id": "177aeb02fd19ec3fecd70775f045f4de",
      "capabilities": "{\n  \"acceptInsecureCerts\": false,\n  \"browserName\": \"chrome\",\n  \"browserVersion\": \"124.0.6367.118\",\n  \"chrome\": {\n    \"chromedriverVersion\": \"124.0.6367.91 (51df0e5e17a8b0a4f281c1665dbd1b8a0c6b46af-refs\\u002fbranch-heads\\u002f6367@{#984})\",\n    \"userDataDir\": \"\\u002ftmp\\u002f.org.chromium.Chromium.1bHXJ5\"\n  },\n  \"fedcm:accounts\": true,\n  \"goog:chromeOptions\": {\n    \"debuggerAddress\": \"localhost:27607\"\n  },\n  \"grid:teamName\": \"idevops\",\n  \"networkConnectionEnabled\": false,\n  \"pageLoadStrategy\": \"normal\",\n  \"platformName\": \"linux\",\n  \"proxy\": {\n  },\n  \"se:bidiEnabled\": false,\n  \"se:cdp\": \"wss:\\u002f\\u002fgrid-test.testkontur.ru\\u002fstaging\\u002fsession\\u002f177aeb02fd19ec3fecd70775f045f4de\\u002fse\\u002fcdp\",\n  \"se:cdpVersion\": \"124.0.6367.118\",\n  \"se:vnc\": \"wss:\\u002f\\u002fgrid-test.testkontur.ru\\u002fstaging\\u002fsession\\u002f177aeb02fd19ec3fecd70775f045f4de\\u002fse\\u002fvnc\",\n  \"se:vncEnabled\": true,\n  \"se:vncLocalAddress\": \"ws:\\u002f\\u002f10.233.108.59:7900\",\n  \"setWindowRect\": true,\n  \"strictFileInteractability\": false,\n  \"timeouts\": {\n    \"implicit\": 0,\n    \"pageLoad\": 300000,\n    \"script\": 30000\n  },\n  \"unhandledPromptBehavior\": \"dismiss and notify\",\n  \"webauthn:extension:credBlob\": true,\n  \"webauthn:extension:largeBlob\": true,\n  \"webauthn:extension:minPinLength\": true,\n  \"webauthn:extension:prf\": true,\n  \"webauthn:virtualAuthenticators\": true\n}",
      "startTime": "06/05/2024 09:08:54",
      "uri": "https://10.233.108.59:5555",
      "nodeId": "3daa4aa3-1410-41c5-8923-2ffed4cf00f2",
      "nodeUri": "https://10.233.108.59:5555",
      "sessionDurationMillis": "276345",
      "slot": {
        "id": "9e59f70d-5bc7-421b-a73c-de2e1e72740a",
        "stereotype": "{\n  \"browserName\": \"chrome\",\n  \"browserVersion\": \"124.0\",\n  \"goog:chromeOptions\": {\n    \"binary\": \"\\u002fusr\\u002fbin\\u002fgoogle-chrome\"\n  },\n  \"platformName\": \"linux\",\n  \"se:noVncPort\": 7900,\n  \"se:vncEnabled\": true\n}",
        "lastStarted": "06/05/2024 09:08:54"
      }
    }
  }
}
2024-05-06T09:13:30UTC [video.recorder] - Starting to record video
2024-05-06T09:13:30UTC [video.recorder] - Video recording started
[x11grab @ 0x55fbeb7e3900] Stream #0: not enough frames to estimate rate; consider increasing probesize
2024-05-06T09:13:32UTC [video.recorder] - Video recording in progress 
2024-05-06 09:13:33,515 INFO waiting for video-recording to die
2024-05-06T09:13:33UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:34UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:35UTC [video.recorder] - Video recording in progress 
2024-05-06 09:13:36,654 INFO waiting for video-recording to die
2024-05-06T09:13:36UTC [video.recorder] - Video recording in progress 
2024-05-06T09:13:37UTC [video.recorder] - Video recording in progress 
2024-05-06 09:13:38,747 WARN killing 'video-recording' (9) with SIGKILL
2024-05-06 09:13:39,749 INFO stopped: video-recording (terminated by SIGKILL)
Shutdown complete
Stream closed EOF for selenium-test/selenium-grid-selenium-chrome-node-78c7fb5788-9gpz6 (video)
VietND96 commented 7 months ago

Looks interesting; the record container was terminated by SIGKILL. Meanwhile, the Node container continues with a few sessions. Let me check on this further

isc-aray commented 6 months ago

I'm seeing this issue on 0.30.1. I set global.seleniumGrid.stdoutProbeLog, but the only thing in the log from PreStop is

[lifecycle.preStop] - Detected full distributed mode: true. Since SE_DISTRIBUTOR_HOST and SE_DISTRIBUTOR_PORT are set in Node ConfigMap
[lifecycle.preStop] - Node is already drained. Shutting down gracefully!
[lifecycle.preStop] - Exiting Node preStop...

Followed by:

Trapped SIGTERM/SIGINT/x so shutting down supervisord...
18:22:08.533 INFO [UnboundZmqEventBus.<init>] - Event bus ready
2024-05-15 18:22:08,534 WARN received SIGTERM indicating exit request
2024-05-15 18:22:08,535 INFO waiting for xvfb, vnc, novnc, selenium-node to die
18:22:08.623 INFO [NodeServer.createHandlers] - Reporting self as: http://10.244.2.178:5555
Exception in thread "Thread-0" 2024-05-15 18:22:08,628 INFO stopped: selenium-node (terminated by SIGTERM)
2024-05-15 18:22:09,630 INFO stopped: novnc (terminated by SIGTERM)
2024-05-15 18:22:10,632 INFO stopped: vnc (terminated by SIGTERM)
2024-05-15 18:22:11,635 INFO stopped: xvfb (terminated by SIGTERM)
Shutdown complete

But the pod never actually terminates.

twhart commented 4 months ago

We're still facing this issue as well, any updates on this?

barkep commented 3 months ago

We have the same problem.

We have made a preliminary analysis of the problem and the problem with closing the video container is only when an error occurs in the tests. When the test is completed correctly, everything works fine.

After changing videoRecorder: enabled: false the problem disappears, but this solution does not suit us. At this point we do not have recordings from the tests.

Do you have a fix for this?

VietND96 commented 3 months ago

Recently, there was an update in video container termination. Can you try the new video image tag and confirm?

barkep commented 3 months ago

I updated to 4.23.1 and the problems started. I went back to 4.18.1 and it works fine.

VietND96 commented 3 months ago

Ah ok, let me try to reproduce it based on your analysis and provide the fix

VietND96 commented 3 months ago

@barkep, can you provide few info in your setup

barkep commented 3 months ago

Autoscaling: job. Job is stuck with status NotReady 1/2. Now, I can't check describe pod, because I've downgraded to lower version.

VietND96 commented 3 months ago

Yes, right, I could reproduce the status NotReady 1/2. However, after 30s (via terminationGracePeriodSeconds: 30), it could move to Completed and disappear. I am still checking further.

github-actions[bot] commented 2 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.