SeleniumHQ / docker-selenium

Provides a simple way to run Selenium Grid with Chrome, Firefox, and Edge using Docker, making it easier to perform browser automation
http://www.selenium.dev/docker-selenium/
Other
7.97k stars 2.51k forks source link

[🐛 Bug]: Cannot connect through NoVNC #2045

Closed Earlopain closed 11 months ago

Earlopain commented 11 months ago

What happened?

When upgrading standalone-chrome from 4.15.0-20231108 to 4.15.0-20231110, the NoVNC web interface is not able to connect. Issue exists on the latest version 4.15.0-20231129 as well, I just tested in which version it started.

It's perpetually stuck in the "Connecting..." screen, the websocket being openend is not recieving any data. image

The only difference between these two versions is the upgrade from Focal to Jammy in PR #1923

Command used to start Selenium Grid with Docker (or Kubernetes)

version: "3"

services:
  selenium:
    image: selenium/standalone-chrome:4.15.0-20231110
    environment:
      - SE_VNC_NO_PASSWORD=1
    shm_size: 2gb
    ports:
      - ${EXPOSED_VNC_PORT:-7900}:7900

Relevant log output

None that I can see

Operating System

Arch Linux

Docker Selenium version (tag or chart version)

4.15.0-20231110

github-actions[bot] commented 11 months ago

@Earlopain, thank you for creating this issue. We will troubleshoot it as soon as we can.


Info for maintainers

Triage this issue by using labels.

If information is missing, add a helpful comment and then I-issue-template label.

If the issue is a question, add the I-question label.

If the issue is valid but there is no time to troubleshoot it, consider adding the help wanted label.

If the issue requires changes or fixes from an external project (e.g., ChromeDriver, GeckoDriver, MSEdgeDriver, W3C), add the applicable G-* label, and it will provide the correct link and auto-close the issue.

After troubleshooting the issue, please add the R-awaiting answer label.

Thank you!

VietND96 commented 11 months ago

Hi @Earlopain, the same docker compose file that you shared but I could not reproduce. The image tag is used 4.15.0-20231129 image Can you check in DevTool if there is any request error?

Earlopain commented 11 months ago

Hi there,

no error in the console. The websocket is being openened but is not receiving any data. The first two packet seems to be some kind ping/pong type of deal, but again, that's just not happening.

I did just now test on another machine, Windows this time, and I have no trouble getting it to work there. I tested with firefox running on the host, and firefox running through wsl as well because why not. Both worked no problem.

I'm going to set up a fresh linux vm and check how it behaves there.

zhaoyaohui0 commented 11 months ago

I have encountered the same problem as you, but I encountered it on K8S. My VNC interface is blank, but my request can run normally. This is the same from the previous version 20231110 to this version 20231129.

zhaoyaohui0 commented 11 months ago

I don‘t understand why the container shows vnc port is 7900,but the service open port 6900:5900,could you please explain it?@vietnd96

Earlopain commented 11 months ago

Hi @vietnd96,

I have installed docker in a fresh linux vm with https://endeavouros.com/ installed. After setting up docker with the following commands and starting the selenium image I observe the same symptoms as in my initial report:

yay -S docker
yay -S docker-compose
sudo systemctl start docker

It may have something to do with arch/endevouros being a rolling release and as such always having the latest versions, or it may be linux specific. I'm not sure with what host OS you were testing with.

For the record, here are the docker/compose versions in use:

$ docker compose version
Docker Compose version 2.23.3
$ docker -v
Docker version 24.0.7, build afdd53b4e3
VietND96 commented 11 months ago

I don‘t understand why the container shows vnc port is 7900,but the service open port 6900:5900,could you please explain it?@vietnd96

Hi @zhaoyaohui0, as my understanding 5900 is container port for VNC (you can use any tools support VNC protocol to connect e.g VNC Viewer, Remmina, TigerVNC Viewer, etc.), and 7900 is container port for NoVNC (which is used to stream via websocket to live preview on Grid UI)

diemol commented 11 months ago

@Earlopain @zhaoyaohui0 where is this failing? Which environments? The report is very ambiguous.

Earlopain commented 11 months ago

@diemol I have provided additional information in my followup comment, is that not enough? I unfortunatly don't have more than "install this OS, setup up docker there and try again". How would I go about gathering more useful information for you, or what are you looking for?

diemol commented 11 months ago

You also mention Kubernetes at the beginning of the issue. Hence my question.

Also, how popular is that OS? I mean, we try to provide something that works in most OS, but if it fails in a few and the user base is small, we won't troubleshoot that because we are a small team, and we try to focus on the common use cases.

Having said that, do you see the same with Ubuntu? macOS? Windows?

Earlopain commented 11 months ago

Kubernetes was the other person, I'm just using it through docker. Endevouros is Arch with a GUI installer, it ships exactly the same software + some small GUI applications on top. I used it because it is convenient and easy to set up, contrary to when setting up Arch on your own.

I did test on Windows and had no trouble there. I don't own an Apple device so nothing for me to do there.

I can try out Ubuntu in a bit when I'm at my home PC. I will install latest docker versions, see how that turns out and let you know then.

Earlopain commented 11 months ago

I gave it a try with Ubuntu 23.10 and it just worked as well.

Ended up installing plain Arch instead of EndeavourOS just to make sure and it doesn't work with that.

Here are some other findings: I enabled stdout logging for the other services and as expected NoVNC is trying to establish a connection. I accidentally left it open while testing and after a whooping 2.5 minutes it actually managed to connect.

selenium-1  | 172.23.0.1 - - [05/Dec/2023 16:32:44] 172.23.0.1: Plain non-SSL (ws://) WebSocket connection
selenium-1  | 172.23.0.1 - - [05/Dec/2023 16:32:44] 172.23.0.1: Path: '/websockify'
selenium-1  | 172.23.0.1 - - [05/Dec/2023 16:32:44] connecting to: localhost:5900
selenium-1  | 05/12/2023 16:35:18 Got connection from client 127.0.0.1

After establishing a connection once, future connections still take the 2.5 minutes to establish.

It doesn't seem to have anything to do with NoVNC. I exposed port 5900, wanting to connect with a local client, and that takes this long as well. I did a few runs, and the duration seems consistent. For 5 runs, it always took 154 seconds.

I don't know what one would do with this information though. This all seems very nonsensical to me especially considering it works with other OSes and its just docker in the end.

VietND96 commented 11 months ago

I have encountered the same problem as you, but I encountered it on K8S. My VNC interface is blank, but my request can run normally. This is the same from the previous version 20231110 to this version 20231129.

For K8s, the URL to access grid UI that you are using with schema http:// right? If yes, can you try to use https:// (ignore the insecure warning if any), live preview can access.

Earlopain commented 11 months ago

I've started reducing the docker image and with a majority of the selenium things removed I still run into this issue.

At this point I'm almost certain it got nothing to do with anything in this repo, so feel free to close this issue, from my side at least. I'll continue to investigate myself and make the report for this at the proper place, if I manage to actually find it.

diemol commented 11 months ago

Thank you for your troubleshooting. I will close this based on your comments but feel free to add your findings in additional comments.

Earlopain commented 11 months ago

I did some digging and have found the root cause. Inside the docker container ulimit -n is incredibly high for some reason. ulimit -n => 1073741816

This code in libvncserver enumerates them all, taking up huge amounts of CPU time. I didn't notice CPU spinning beforehand. https://github.com/LibVNC/libvncserver/blob/784cccbb724517ee4e36d9938f93b9ee168a29e7/src/libvncserver/sockets.c#L508-L527

The temporary solution is quite simple: set the ulimit for docker manually:

version: "3"

services:
  selenium:
    image: selenium/standalone-chrome:4.15.0-20231110
    environment:
      - SE_VNC_NO_PASSWORD=1
    shm_size: 2gb
    ports:
      - ${EXPOSED_VNC_PORT:-7900}:7900
    ulimits:
      nofile:
        soft: 65536
        hard: 65536

I don't know why these limits would differ from the host, documentation states they are inherited. My host value is just a measly 524288, but it is what it is.

As for why it worked with focal but not with jammy, perhaps this codepath wasn't hit before. The limit is still high inside docker, what do I know.

Here's some prior art: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=920913 Here's a thread on the arch forum where I'm going to probably talk a bit more about this: https://bbs.archlinux.org/viewtopic.php?id=290863 Here's an issue on the docker engine repo which I think is most relevant: https://github.com/moby/moby/issues/44547 And a PR that supposedly fixes this but hasn't been part of a release yet: https://github.com/containerd/containerd/pull/8924 Here's an issue I made in libvncserver talking about the consequences of having an incredibly high RLIMIT_NOFILE: https://github.com/LibVNC/libvncserver/issues/600

diemol commented 11 months ago

Wow, great troubleshooting! Thanks for sharing.

Abdillah commented 11 months ago

I verified the fix above.

My case was both the VNC and noVNC lead to very long wait to connect, next to forever. In rarity, it reached password prompt but it still waits afterward and timeout.

Can we put this in README on troubleshoot section?

Earlopain commented 11 months ago

I'm not so sure on the value of that. This only happens when distros use the prepackaged systemd unit files with very recent docker and systemd versions, which in reality not very many actually do.

Once upstream releases versions that contain a fix this section would pretty much becomes obsolete. You seem to have found this through issues just fine, I think that is good enough.

VietND96 commented 11 months ago

I saw a few Dockerfiles have a practice that displays a warning if ulimit -n is too high when running Docker. I also tried added one to notice the user https://github.com/SeleniumHQ/docker-selenium/commit/acda753acb9745935531407628eee27a503d98b4 @Earlopain, do you think a workaround as below will work while waiting for upstream fixes that?

[program:vnc]
priority=5
command=ulimit -n 65536 && /opt/bin/start-vnc.sh
Earlopain commented 11 months ago

The idea is there, yes. However if ulimit is already set to a lower value in the container then trying to set it to something higher will return a non-zero exit code, at least for an unprivileged user. That needs to be accounted for.

In addition, TIL that ulimit is a shell buildin and supervisord seems to only starts actual binaries (so I think && would not work either. It needs to be part of the start script.

After doing both of that, it works fine for me. Nice that a workaround is being considered here (:

hirowatari commented 11 months ago
    ulimits:
      nofile:
        soft: 65536
        hard: 65536

Thank you. This fixed my issue as well. selenium/standalone-chrome:118.0 worked, but selenium/standalone-chrome:119.0 and 120 needed this fix.

Earlopain commented 11 months ago

New releases will contain a workaround, a section in the readme for this shouldn't be needed anymore. See #2058

github-actions[bot] commented 10 months ago

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.