SeleniumHQ / docker-selenium

Provides a simple way to run Selenium Grid with Chrome, Firefox, and Edge using Docker, making it easier to perform browser automation
http://www.selenium.dev/docker-selenium/
Other
7.86k stars 2.51k forks source link

[🐛 Bug]: Grid on Kubernetes with "homemade" container can not start sessions #1713

Closed parhedberg closed 1 year ago

parhedberg commented 1 year ago

What happened?

I have a setup with 5 nodes on a grid, all setup in kubernetes.

2 of the grids uses a special "home made" container that applies a certificate to login to our testenvironment (setup as the example here https://github.com/SeleniumHQ/docker-selenium#install-certificates-for-chromium-based-browsers) . This container works whitout problem locally. And so it does at first also in the grid environment on kubernetes. But after a while (not sure if it is timebased or after some sessions) the session request hangs, and does not get sent to the (available) node.

Maybe this is related to https://github.com/SeleniumHQ/docker-selenium/issues/1605 or https://github.com/SeleniumHQ/docker-selenium/issues/1701

The most strange thing is that the nodes that use "vanilla" images to build the pod (i.e. selenium/node-chrome from hub.docker.com and not the image that i am making with certificates that is based also based on selenium/node-chrome) does not have this problem. Som only the two nodes with added certificates have the problem.

Command used to start Selenium Grid with Docker

*** THIS IS THE DOCKERFILE FOR BUILDING THE IMAGE AND ADDING THE CERTIFICATE

FROM selenium/node-chrome:4.5.3
USER root

#=== Autoselect whatever cert that popups in chrome
COPY cert_policy.json /etc/opt/chrome/policies/managed/cert_policy.json

#=========
# libnss3-tools
# Network Security Service tools
# Manage certificates in nssdb (certutil, pk12util, modutil, shlibsign, signtool, ssltap)
#=========
RUN apt-get -qqy update && apt-get -qqy install libnss3-tools

#=== Copy cert to user
USER seluser
COPY siths_root_ca_v1_pp.cer /home/seluser
COPY cert.p12 /home/seluser
COPY pass.txt /home/seluser
COPY pngtest.png /home/seluser

WORKDIR /home/seluser

#=== Run commands to insert certs in database
RUN mkdir -p -m755 /home/seluser/.pki/nssdb 
RUN certutil -d sql:/home/seluser/.pki/nssdb -N --empty-password 
RUN certutil -d sql:/home/seluser/.pki/nssdb -A -t TC -n "SITHS CA v1 PP" -i siths_root_ca_v1_pp.cer 
RUN pk12util -d sql:/home/seluser/.pki/nssdb -i cert.p12 -w pass.txt

#=== Expose needed ports
EXPOSE 4442
EXPOSE 4443
EXPOSE 4444
EXPOSE 5555

****** THIS IS THE SCRIPT IN KUBERNETES TO SETUP NODE:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: selenium-node-chrome-personal1
  labels:
    app: selenium-node-chrome-personal1
spec:
  progressDeadlineSeconds: 600
  replicas: 1
  revisionHistoryLimit: 10
  selector:
    matchLabels:
      app: selenium-node-chrome-personal1
  strategy:
    rollingUpdate:
      maxSurge: 25%
      maxUnavailable: 25%
    type: RollingUpdate      
  template:
    metadata:
      labels:
        app: selenium-node-chrome-personal1
    spec:
      volumes:
      - name: dshm
        emptyDir:
          medium: Memory
      containers:
      - name: selenium-node-chrome-personal1
        image: <our local registry>/tnmt/chromenode-personal1
        imagePullPolicy: Always
        ports:
          - containerPort: 5555
          - containerPort: 5901          
        volumeMounts:
          - mountPath: /dev/shm
            name: dshm
        env:
          - name: SE_EVENT_BUS_HOST
            value: "selenium-hub"
          - name: SE_EVENT_BUS_SUBSCRIBE_PORT
            value: "4443"
          - name: SE_EVENT_BUS_PUBLISH_PORT
            value: "4442"
          - name: SE_NODE_OVERRIDE_MAX_SESSIONS
            value: "true"
          - name: SE_NODE_MAX_SESSIONS
            value: "1"
          - name: SE_NODE_SESSION_TIMEOUT
            value: "1000"                 
          - name: SE_NODE_STEREOTYPE
            value: "{\"browserName\":\"chrome\",\"platformName\": \"Linux\", \"networkname:applicationName\":\"personal1\", \"nodename:applicationName\":\"personal1\" }"              
          - name: SE_VNC_PORT
            value: "5901"            
          - name: TZ
            value: "Europe/Stockholm"
          - name: SE_LOG_LEVEL
            value: "ALL"
        resources:
          limits:
            memory: "1000Mi"
            cpu: "1"
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File            
      dnsPolicy: ClusterFirst
      imagePullSecrets:
      - name: docker-credentials
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30

Relevant log output

****** LOGS AT NODE ON KUBERNETES

Nothing relevant found here. No requests is forwarded from hub?

****** LOGS AT HUB ON KUBERNETES

15:28:34.445 WARN [SeleniumSpanExporter$1.lambda$export$3] - {"traceId": "982df960bf3807e515e73912aed29308","eventTime": 1667226514445460946,"eventName": "HTTP request execution complete","attributes": {"http.client_ip": "10.6.3.75","http.flavor": 1,"http.handler_class": "org.openqa.selenium.grid.sessionqueue.local.LocalNewSessionQueue","http.host": "<URL TO HUB>","http.method": "POST","http.request_content_length": "2109","http.scheme": "HTTP","http.status_code": 500,"http.target": "\u002fsession","http.user_agent": "selenium\u002f4.5.0 (ruby linux)"}}

15:29:04.503 WARN [SeleniumSpanExporter$1.lambda$export$3] - {"traceId": "a12c3dbcbd4282822abf92ce71c75b13","eventTime": 1667226544502494533,"eventName": "HTTP request execution complete","attributes": {"http.client_ip": "10.6.3.75","http.flavor": 1,"http.handler_class": "org.openqa.selenium.grid.sessionqueue.local.LocalNewSessionQueue","http.host": "<URL TO HUB>","http.method": "POST","http.request_content_length": "2109","http.scheme": "HTTP","http.status_code": 500,"http.target": "\u002fsession","http.user_agent": "selenium\u002f4.5.0 (ruby linux)"}}

15:29:34.566 WARN [SeleniumSpanExporter$1.lambda$export$3] - {"traceId": "3d0204b11dfbf942f09ecdce9454908c","eventTime": 1667226574565691599,"eventName": "HTTP request execution complete","attributes": {"http.client_ip": "10.6.3.75","http.flavor": 1,"http.handler_class": "org.openqa.selenium.grid.sessionqueue.local.LocalNewSessionQueue","http.host": "<URL TO HUB>","http.method": "POST","http.request_content_length": "2109","http.scheme": "HTTP","http.status_code": 500,"http.target": "\u002fsession","http.user_agent": "selenium\u002f4.5.0 (ruby linux)"}}

15:30:04.619 WARN [SeleniumSpanExporter$1.lambda$export$3] - {"traceId": "e615ad3ff79734da6518acf71481b829","eventTime": 1667226604618668712,"eventName": "HTTP request execution complete","attributes": {"http.client_ip": "10.6.3.75","http.flavor": 1,"http.handler_class": "org.openqa.selenium.grid.sessionqueue.local.LocalNewSessionQueue","http.host": "<URL TO HUB>","http.method": "POST","http.request_content_length": "2109","http.scheme": "HTTP","http.status_code": 500,"http.target": "\u002fsession","http.user_agent": "selenium\u002f4.5.0 (ruby linux)"}}

****** LOGS AT DRIVER OUTPUT

Could not start a new session. New session request timed out 
      Host info: host: 'selenium-hub-dff49c44-lf86x', ip: '10.57.1.77'
      Build info: version: '4.5.3', revision: '4b786a1e430'
      System info: os.name: 'Linux', os.arch: 'amd64', os.version: '5.4.0-88-generic', java.version: '11.0.16'
      Driver info: driver.version: unknown (Selenium::WebDriver::Error::SessionNotCreatedError)
      [remote server] org.openqa.selenium.grid.sessionqueue.local.LocalNewSessionQueue(LocalNewSessionQueue.java):202:in `addToQueue'
      [remote server] org.openqa.selenium.grid.sessionqueue.NewSessionQueue(NewSessionQueue.java):69:in `lambda$new$0'
      [remote server] org.openqa.selenium.remote.http.Route$TemplatizedRoute(Route.java):192:in `handle'
      [remote server] org.openqa.selenium.remote.http.Route(Route.java):68:in `execute'
      [remote server] org.openqa.selenium.remote.http.Route$CombinedRoute(Route.java):336:in `handle'
      [remote server] org.openqa.selenium.remote.http.Route(Route.java):68:in `execute'
      [remote server] org.openqa.selenium.grid.sessionqueue.NewSessionQueue(NewSessionQueue.java):123:in `execute'
      [remote server] org.openqa.selenium.remote.tracing.SpanWrappedHttpHandler(SpanWrappedHttpHandler.java):86:in `execute'
      [remote server] org.openqa.selenium.remote.http.Filter$1(Filter.java):64:in `execute'
      [remote server] org.openqa.selenium.remote.http.Route$CombinedRoute(Route.java):336:in `handle'
      [remote server] org.openqa.selenium.remote.http.Route(Route.java):68:in `execute'
      [remote server] org.openqa.selenium.grid.router.Router(Router.java):91:in `execute'
      [remote server] org.openqa.selenium.grid.web.EnsureSpecCompliantResponseHeaders(EnsureSpecCompliantResponseHeaders.java):34:in `lambda$apply$0'
      [remote server] org.openqa.selenium.remote.http.Filter$1(Filter.java):64:in `execute'
      [remote server] org.openqa.selenium.remote.http.Route$CombinedRoute(Route.java):336:in `handle'
      [remote server] org.openqa.selenium.remote.http.Route(Route.java):68:in `execute'
      [remote server] org.openqa.selenium.remote.http.Route$CombinedRoute(Route.java):336:in `handle'
      [remote server] org.openqa.selenium.remote.http.Route(Route.java):68:in `execute'
      [remote server] org.openqa.selenium.remote.AddWebDriverSpecHeaders(AddWebDriverSpecHeaders.java):35:in `lambda$apply$0'
      [remote server] org.openqa.selenium.remote.ErrorFilter(ErrorFilter.java):44:in `lambda$apply$0'
      [remote server] org.openqa.selenium.remote.http.Filter$1(Filter.java):64:in `execute'
      [remote server] org.openqa.selenium.remote.ErrorFilter(ErrorFilter.java):44:in `lambda$apply$0'
      [remote server] org.openqa.selenium.remote.http.Filter$1(Filter.java):64:in `execute'
      [remote server] org.openqa.selenium.netty.server.SeleniumHandler(SeleniumHandler.java):44:in `lambda$channelRead0$0'
      [remote server] java.util.concurrent.Executors$RunnableAdapter(Executors.java):515:in `call'
      [remote server] java.util.concurrent.FutureTask(FutureTask.java):264:in `run'
      [remote server] java.util.concurrent.ThreadPoolExecutor(ThreadPoolExecutor.java):1128:in `runWorker'
      [remote server] java.util.concurrent.ThreadPoolExecutor$Worker(ThreadPoolExecutor.java):628:in `run'
      [remote server] java.lang.Thread(Thread.java):829:in `run'
      <internal:kernel>:90:in `tap'
      ./features/step_definitions/personal_steps.rb:29:in `logga_in_personal'
      ./features/step_definitions/personal_admin.rb:6:in `"att jag är inloggad som Nationell administratör"'
      features/flode_webbtidbokstjanst.feature:18:in `att jag är inloggad som Nationell administratör'

Operating System

Windows 11 and Kubernetes

Docker Selenium version (tag)

4.5.3 both nodes and hub

parhedberg commented 1 year ago

Still seeing this in 4.6.0 hub and nodes.

diemol commented 1 year ago

What the log shows is that the session request was sitting for a long time in the queue and it expired. Nothing possible related to certificates is present in the logs.

Maybe check how long sessions are running and if some of them are getting stuck using resources.

parhedberg commented 1 year ago

I am curius about that "http.status_code: 500" that I see in the logs.. Is that what happens when it is waiting to long?

There is no other sessions running when this is happening.. All that is visible in sessions page is a queue of 1 session. No active ones.

I also did setup Edge nodes, just to test but same problem there. And still only issues with the nodes running "edited images" as in added certificates.

What session timeouts should i test with?

diemol commented 1 year ago

The error comes from the queue. It responds with a 500 when the session was not created for any reason. Maybe too many sessions in parallel? What if you run them sequentially, then with two threads, and so on?

parhedberg commented 1 year ago

They are running seqentially from start. There are no more than 1 concurrent session, and actually no more than one node active at the time.

I have tried upping the possible concurrent browsers on the nodes, to see if the queued session tries to start a new one, but it does not.

I have also tried shortening the session timeout to 10s, but all that does is "failing faster" at starting a new session.

It is like the node does not get the request. As if communications between hub and pod stops.. In the nodes logs i dont see any requests either...

To test if it it the node that freezes i open a terminal on the node and it is responsive, so the node it-self has not died. And curling to it from the hub-node also gets a response that looks ok.

diemol commented 1 year ago

Would it be possible to use only the "homemade" nodes? It is probably hard to track the issue when they are mixed.

parhedberg commented 1 year ago

The tests I run when I try this is only requesting sessions with the two "homemade" nodes.. If i run other tests that requests sessions at the "vanilla" nodes, there is no issue. I could probably for testing pourposes kill all the other nodes and only have the two "homemade" nodes active.

Will see if I can do that easily..

parhedberg commented 1 year ago

Tried now with only 2 "homemade" nodes active on the hub. The tests alternate between these two, and works well at first. But after the test has made a logout (that includes a page.driver.quit command) it is not able to recover and start another session on that node.

Now I also see this in the logs for the hub:

16:38:37.770 WARN [SeleniumSpanExporter$1.lambda$export$3] - {"traceId": "48284a7dbc098bcf5c76ce11878a5206","eventTime": 1667835517768345486,"eventName": "exception","attributes": {"exception.message": "Unable to execute request for an existing session: Unable to find session with ID: f4906669f1f9cb7c6f15d191d7816d9d\nBuild info: version: '4.6.0', revision: '79f1c02ae20'\nSystem info: os.name: 'Linux', os.arch: 'amd64', os.version: '5.4.0-88-generic', java.version: '11.0.16'\nDriver info: driver.version: unknown","exception.stacktrace": "org.openqa.selenium.NoSuchSessionException: Unable to find session with ID: f4906669f1f9cb7c6f15d191d7816d9d\nBuild info: version: '4.6.0', revision: '79f1c02ae20'\nSystem info: os.name: 'Linux', os.arch: 'amd64', os.version: '5.4.0-88-generic', java.version: '11.0.16'\nDriver info: driver.version: unknown\n\tat org.openqa.selenium.grid.sessionmap.local.LocalSessionMap.get(LocalSessionMap.java:129)\n\tat org.openqa.selenium.grid.router.HandleSession.lambda$loadSessionId$4(HandleSession.java:159)\n\tat io.opentelemetry.context.Context.lambda$wrap$2(Context.java:224)\n\tat org.openqa.selenium.grid.router.HandleSession.execute(HandleSession.java:122)\n\tat org.openqa.selenium.remote.http.Route$PredicatedRoute.handle(Route.java:373)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:68)\n\tat org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:336)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:68)\n\tat org.openqa.selenium.grid.router.Router.execute(Router.java:91)\n\tat org.openqa.selenium.grid.web.EnsureSpecCompliantResponseHeaders.lambda$apply$0(EnsureSpecCompliantResponseHeaders.java:34)\n\tat org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:64)\n\tat org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:336)\n\tat org.openqa.selenium.remote.http.Route.execute(Route.java:68)\n\tat org.openqa.selenium.remote.http.Route$CombinedRoute.handle(Route.java:336)\n\tat org.openqa.sele nium.remote.http.Route.execute(Route.java:68)\n\tat org.openqa.selenium.remote.AddWebDriverSpecHeaders.lambda$apply$0(AddWebDriverSpecHeaders.java:35)\n\tat org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)\n\tat org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:64)\n\tat org.openqa.selenium.remote.ErrorFilter.lambda$apply$0(ErrorFilter.java:44)\n\tat org.openqa.selenium.remote.http.Filter$1.execute(Filter.java:64)\n\tat org.openqa.selenium.netty.server.SeleniumHandler.lambda$channelRead0$0(SeleniumHandler.java:44)\n\tat java.base\u002fjava.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)\n\tat java.base\u002fjava.util.concurrent.FutureTask.run(FutureTask.java:264)\n\tat java.base\u002fjava.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)\n\tat java.base\u002fjava.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)\n\tat java.base\u002fjava.lang.Thread.run(Thread.java:829)\n","exception.type": "org.openqa.selenium.NoSuchSessionException","http.client_ip": "10.6.3.5","http.flavor": 1,"http.handler_class": "org.openqa.selenium.grid.router.HandleSession","http.host": "selenium-hub.tnmt.nordicmedtest.se","http.method": "DELETE","http.scheme": "HTTP","http.target": "\u002fsession\u002ff4906669f1f9cb7c6f15d191d7816d9d\u002fcookie","http.user_agent": "selenium\u002f4.6.1 (ruby linux)","session.id": "f4906669f1f9cb7c6f15d191d7816d9d"}}

I am not sure if I saw that earlier, but i get one of these for each of the old 500 errors..

parhedberg commented 1 year ago

I think that I have identified the problem.

Before I did set the capabilites for the two "homemade" nodes to "personal1" and "personal2".

When i change them to this:

"networkname:applicationName" => "personal-bjorn",
"nodename:applicationName" => "personal-bjorn"

and

"networkname:applicationName" => "personal-karin",
"nodename:applicationName" => "personal-karin"

I have no more issues...

diemol commented 1 year ago

The error from above means that in your test you are trying to access a session that does not exist anymore, probably the one you just quit is having some sort of retry on the client side?

I am glad you are not having more issues, however, I did not get how it got solved, just by changing a custom capability value? Maybe you can please explain that a bit in case someone bumps into a similar issue and they can learn from your solution.

parhedberg commented 1 year ago

Yes.. Changing the custom capability when registering the node did the trick..

I went from:

- name: SE_NODE_STEREOTYPE
value: "{\"browserName\":\"chrome\",\"platformName\": \"Linux\", \"networkname:applicationName\":\"personal1\", \"nodename:applicationName\":\"personal1\" }"              

to:

- name: SE_NODE_STEREOTYPE
value: "{\"browserName\":\"chrome\",\"platformName\": \"Linux\", \"networkname:applicationName\":\"personal-kerstin\", \"nodename:applicationName\":\"personal-kerstin\" }"     

and now it works...

I could not really believe that this was the only thing, so I tried changing the edge browsers first, and changing them fixed it. So i went ahead and tried also the chrome browsers. I first thought that i had changed something else, or that the redeploy only made it work, but to test that i did a full re-deploy of the chrome nodes before i changed their names, and that did not solve it. So it is clearly the capability name change that did it.

diemol commented 1 year ago

OK, thanks for the explanation. I just checked the code and I cannot see why that was the issue. But happy you got it working. I will close this issue then.