Open Technologov opened 4 years ago
Tail of the strace that led me to the diagnosis above. Incase its useful.
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 7
setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(7, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
connect(7, {sa_family=AF_INET, sin_port=htons(6005), sin_addr=inet_addr("127.0.0.1")}, 16) = -1 ECONNREFUSED (Connection refused)
close(7) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 7
connect(7, {sa_family=AF_UNIX, sun_path=@"/tmp/.X11-unix/X6"}, 20) = -1 ECONNREFUSED (Connection refused)
close(7) = 0
socket(AF_UNIX, SOCK_STREAM|SOCK_CLOEXEC, 0) = 7
getsockopt(7, SOL_SOCKET, SO_SNDBUF, [212992], [4]) = 0
connect(7, {sa_family=AF_UNIX, sun_path="/tmp/.X11-unix/X6"}, 110) = -1 ENOENT (No such file or directory)
close(7) = 0
stat("/etc/resolv.conf", {st_mode=S_IFREG|0644, st_size=715, ...}) = 0
openat(AT_FDCWD, "/etc/hosts", O_RDONLY|O_CLOEXEC) = 7
fstat(7, {st_mode=S_IFREG|0644, st_size=226, ...}) = 0
read(7, "127.0.0.1 localhost\n127.0.1.1 ra"..., 4096) = 226
read(7, "", 4096) = 0
close(7) = 0
socket(AF_INET, SOCK_STREAM|SOCK_CLOEXEC, IPPROTO_TCP) = 7
setsockopt(7, SOL_TCP, TCP_NODELAY, [1], 4) = 0
setsockopt(7, SOL_SOCKET, SO_KEEPALIVE, [1], 4) = 0
connect(7, {sa_family=AF_INET, sin_port=htons(6006), sin_addr=inet_addr("127.0.0.1")}, 16) = 0
getpeername(7, {sa_family=AF_INET, sin_port=htons(6006), sin_addr=inet_addr("127.0.0.1")}, [124->16]) = 0
uname({sysname="Linux", nodename="rampage-107", ...}) = 0
access("/var/lib/boinc-client/.Xauthority", R_OK) = -1 ENOENT (No such file or directory)
fcntl(7, F_GETFL) = 0x2 (flags O_RDWR)
fcntl(7, F_SETFL, O_RDWR|O_NONBLOCK) = 0
fcntl(7, F_SETFD, FD_CLOEXEC) = 0
poll([{fd=7, events=POLLIN|POLLOUT}], 1, -1) = 1 ([{fd=7, revents=POLLOUT}])
writev(7, [{iov_base="l\0\v\0\0\0\0\0\0\0\0\0", iov_len=12}, {iov_base="", iov_len=0}], 2) = 12
recvfrom(7, 0x55ff4fe26060, 8, 0, NULL, NULL) = -1 EAGAIN (Resource temporarily unavailable)
poll([{fd=7, events=POLLIN}], 1, -1) = ? ERESTART_RESTARTBLOCK (Interrupted by signal)
Currently the client tries ports 6000..6006. In theory X is allocated 6000..6063: https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml?search=x11 Since there's already a reported non-X11 use of 6003, I propose changing the client to check only 6000..6002.
BTW, does anyone know if X11-based idle detection even works? If not we may as well remove it.
What if tomorrow (or in 5 years) someone writes a real world server application on port TCP/6000? Will it break BOINC again? Not a good design decision. Why is boinc daemon event tries to do that? A more logical decision is that a separate app / executable like boinc_gui_client attempt to detect an X11 session, only if the end-user starts it manually. Boinc daemon should not even attempt such a check, because in fact it's supposed to be a background service (server), it supposed to listen to TCP sockets, not connect to them. So it can listen on both port TCP/X and UNIX local domain socket as a server. Does it makes any sense from architecture point of view?
It's a bad design decision to use a port that's officially allocated to X11.
I would presume that there is a standard way to open an X port and, upon success, check whether an X server is running there.
If there is a service that uses an X TCP port (6000:6063) that causes such a check to hang, that service is broken.
So the question is, are we not detecting X properly, or is TensorFlow broken?
On Tue, Dec 17, 2019 at 2:02 PM David Anderson notifications@github.com wrote:
Currently the client tries ports 6000..6006. In theory X is allocated 6000..6063:
https://www.iana.org/assignments/service-names-port-numbers/service-names-port-numbers.xhtml?search=x11 Since there's already a reported non-X11 use of 6003, I propose changing the client to check only 6000..6002.
BTW, does anyone know if X11-based idle detection even works? If not we may as well remove it.
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/BOINC/boinc/issues/3405?email_source=notifications&email_token=ACS5ZMQEZCDIRKLANN3WJIDQZFEAXA5CNFSM4J3763KKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEHEDLSI#issuecomment-566769097, or unsubscribe https://github.com/notifications/unsubscribe-auth/ACS5ZMQO2YZR2EVEABV27S3QZFEAXANCNFSM4J3763KA .
-- Eric Korpela korpela@ssl.berkeley.edu AST:7731^29u18e3
I'm not sure I understand why the client would even bother to detect an X server... But my non-understanding aside: if it does, there should be some config data that documents the ports that would be used (or at least the fact that an X server detection is going to happen). Try out what happens right now if the ports used for detection are NOT used by another service, BUT blocked via firewall: due to the timeouts, the client will appear to hang for loooooooooong periods. Go figure that you need something like iptables -I INPUT -p tcp -s 127.0.0.1 -j ACCEPT. On a headless server somewhere in the cloud, this is NOT necessarily a default setting that you'll "just have in place anyway".
A bounty has been placed on this issue by the Science Commons Initiative starting at $25. The bounty will continue to increase until it is claimed. You can contribute to the bounty, check its total amount, or claim it at our repo.
Bounty has been increased to $75
Describe the bug A clear and concise description of what the bug is. Tilps says: I got a complete strace. "Boinc" happily doing things for a while - then it tries to look for a domain socket, doesn't find it - tries to connect to port 6000, no answer, tries to look for a different domain socket, doesn't find it, then port 6001 - it repeats this sequence until it gets to port 6006 - finds it can connect, and then hangs. 6006 is the port tensorboard uses It appears to be searching for an x windows session but since there isn't an x windows session it keeps searching until it hits the tensorboard port so either we disable its need to try and find an x windows session for whatever reason - or have an x windows session for it to find ... or we reconfigure tensorboard to use non-default port number
BOINC is searching for an x windows session, and when it finds port 6006 is open, but not responding in the way it would expect from an x windows session, it hangs.
Steps To Reproduce
at this stage "boinc" daemon gets stuck, and no work units get processed anymore.
====================================
Expected behavior A clear and concise description of what you expected to happen.
Screenshots If applicable, add screenshots to help explain your problem.
System Information
Additional context Add any other context about the problem here.
In practice any Linux server (CLI only) running Deep Learning (TensorFlow) and BOINC -- boinc will get stuck after about 30 minutes or so...
this server has enough RAM memory and disk space, so those issues can be ruled out:
-Technologov, 17.12.2019.