BOINC / boinc

Open-source software for volunteer computing and grid computing.
https://boinc.berkeley.edu
GNU Lesser General Public License v3.0
1.95k stars 439 forks source link

[CRITICAL] boinc daemon hangs if port TCP/6006 is not an X11 server. #3405

Open Technologov opened 4 years ago

Technologov commented 4 years ago

Describe the bug A clear and concise description of what the bug is. Tilps says: I got a complete strace. "Boinc" happily doing things for a while - then it tries to look for a domain socket, doesn't find it - tries to connect to port 6000, no answer, tries to look for a different domain socket, doesn't find it, then port 6001 - it repeats this sequence until it gets to port 6006 - finds it can connect, and then hangs. 6006 is the port tensorboard uses It appears to be searching for an x windows session but since there isn't an x windows session it keeps searching until it hits the tensorboard port so either we disable its need to try and find an x windows session for whatever reason - or have an x windows session for it to find ... or we reconfigure tensorboard to use non-default port number

BOINC is searching for an x windows session, and when it finds port 6006 is open, but not responding in the way it would expect from an x windows session, it hangs.

Steps To Reproduce

  1. Run boinc daemon -- on a CLI-only server (no GUI, no X11)
  2. Run TensorFlow (it uses port TCP/6006 by default) -- or any other software that uses port TCP/6006. steps to reproduce (BOINC side):
    # apt-get install boinc-client
    cd /var/lib/boinc-client/
    boinccmd --project_attach http://www.worldcommunitygrid.org/  $KEY
    boinccmd --set_network_mode always
    boinccmd --set_run_mode always
    boinccmd --set_gpu_mode never
    # service boinc-client restart
    What actually happens ?
    root@rampage-107:~# boinccmd --read_global_prefs_override
    Operation failed: read() failed
    root@rampage-107:~#

    at this stage "boinc" daemon gets stuck, and no work units get processed anymore.

====================================

Expected behavior A clear and concise description of what you expected to happen.

root@rampage-107:~# boinccmd --read_global_prefs_override
root@rampage-107:~#
boinccmd must run without errors.

Screenshots If applicable, add screenshots to help explain your problem.

System Information

Additional context Add any other context about the problem here.

In practice any Linux server (CLI only) running Deep Learning (TensorFlow) and BOINC -- boinc will get stuck after about 30 minutes or so...

this server has enough RAM memory and disk space, so those issues can be ruled out:

root@rampage-107:~# uptime
 00:31:18 up 44 days,  2:37, 16 users,  load average: 57.85, 59.09, 57.01

root@rampage-107:~# free -h
              total        used        free      shared  buff/cache   available
Mem:           125G         36G        882M        1.0G         88G         87G
Swap:          8.0G        100M        7.9G

root@rampage-107:~# df -h
Filesystem      Size  Used Avail Use% Mounted on
udev             63G     0   63G   0% /dev
tmpfs            13G  2.8M   13G   1% /run
/dev/sda2       916G  358G  512G  42% /
tmpfs            63G  100K   63G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
tmpfs            63G     0   63G   0% /sys/fs/cgroup
/dev/loop2       90M   90M     0 100% /snap/core/8039
tmpfs            13G     0   13G   0% /run/user/1000
/dev/loop0       90M   90M     0 100% /snap/core/8213
tmpfs            13G     0   13G   0% /run/user/0

-Technologov, 17.12.2019.

Tilps commented 4 years ago

Tail of the strace that led me to the diagnosis above. Incase its useful.

socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 7
setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(7, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
connect(7, {sa_family=AF_INET, sin_port=htons(6005), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused)
close(7)                                = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 7
connect(7, {sa_family=AF_UNIX, sun_path=@"/tmp/.X11-unix/X6"}, 20) = -1 ECONNREFUSED (Connection refused)
close(7)                                = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 7
getsockopt(7, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
connect(7, {sa_family=AF_UNIX, sun_path="/tmp/.X11-unix/X6"}, 110) = -1 ENOENT (No such file or directory)
close(7)                                = 0
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=715, ...}) = 0
openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=226, ...}) = 0
read(7, "127.0.0.1 localhost\n127.0.1.1 ra"..., 4096) = 226
read(7, "", 4096)                       = 0
close(7)                                = 0
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 7
setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(7, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
connect(7, {sa_family=AF_INET, sin_port=htons(6006), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
getpeername(7, {sa_family=AF_INET, sin_port=htons(6006), sin_addr=inet_addr("127.0.0.1")}, [124->16]) = 0
uname({sysname="Linux", nodename="rampage-107", ...}) = 0
access("/var/lib/boinc-client/.Xauthority", R_OK) = -1 ENOENT (No such file or directory)
fcntl(7, F_GETFL)                       = 0x2 (flags O_RDWR)
fcntl(7, F_SETFL, O_RDWR|O_NONBLOCK)    = 0
fcntl(7, F_SETFD, FD_CLOEXEC)           = 0
poll([{fd=7, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=7, revents=POLLOUT}])
writev(7, [{iov_base="l\0\v\0\0\0\0\0\0\0\0\0", iov_len=12}, {iov_base="", iov_len=0}], 2) = 12
recvfrom(7, 0x55ff4fe26060, 8, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=7, events=POLLIN}], 1, -1)    = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
davidpanderson commented 4 years ago

Currently the client tries ports 6000..6006. In theory X is allocated 6000..6063: https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml?search=x11 Since there's already a reported non-X11 use of 6003, I propose changing the client to check only 6000..6002.

BTW, does anyone know if X11-based idle detection even works? If not we may as well remove it.

Technologov commented 4 years ago

What if tomorrow (or in 5 years) someone writes a real world server application on port TCP/6000? Will it break BOINC again? Not a good design decision. Why is boinc daemon event tries to do that? A more logical decision is that a separate app / executable like boinc_gui_client attempt to detect an X11 session, only if the end-user starts it manually. Boinc daemon should not even attempt such a check, because in fact it's supposed to be a background service (server), it supposed to listen to TCP sockets, not connect to them. So it can listen on both port TCP/X and UNIX local domain socket as a server. Does it makes any sense from architecture point of view?

davidpanderson commented 4 years ago

It's a bad design decision to use a port that's officially allocated to X11.

SETIguy commented 4 years ago

I would presume that there is a standard way to open an X port and, upon success, check whether an X server is running there.

If there is a service that uses an X TCP port (6000:6063) that causes such a check to hang, that service is broken.

So the question is, are we not detecting X properly, or is TensorFlow broken?

On Tue, Dec 17, 2019 at 2:02 PM David Anderson notifications@github.com wrote:

Currently the client tries ports 6000..6006. In theory X is allocated 6000..6063:

https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml?search=x11 Since there's already a reported non-X11 use of 6003, I propose changing the client to check only 6000..6002.

BTW, does anyone know if X11-based idle detection even works? If not we may as well remove it.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BOINC/boinc/issues/3405?email_source=notifications&email_token=ACS5ZMQEZCDIRKLANN3WJIDQZFEAXA5CNFSM4J3763KKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHEDLSI#issuecomment-566769097, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACS5ZMQO2YZR2EVEABV27S3QZFEAXANCNFSM4J3763KA .

-- Eric Korpela korpela@ssl.berkeley.edu AST:7731^29u18e3

ifettich commented 3 years ago

I'm not sure I understand why the client would even bother to detect an X server... But my non-understanding aside: if it does, there should be some config data that documents the ports that would be used (or at least the fact that an X server detection is going to happen). Try out what happens right now if the ports used for detection are NOT used by another service, BUT blocked via firewall: due to the timeouts, the client will appear to hang for loooooooooong periods. Go figure that you need something like iptables -I INPUT -p tcp -s 127.0.0.1 -j ACCEPT. On a headless server somewhere in the cloud, this is NOT necessarily a default setting that you'll "just have in place anyway".

makeasnek commented 1 year ago

A bounty has been placed on this issue by the Science Commons Initiative starting at $25. The bounty will continue to increase until it is claimed. You can contribute to the bounty, check its total amount, or claim it at our repo.

makeasnek commented 9 months ago

Bounty has been increased to $75