epics-base / pvxs

PVA protocol client/server library and utilities.
https://mdavidsaver.github.io/pvxs/
Other
22 stars 29 forks source link

server: correctly adjudicate collision bind() of specific port #82

Closed mdavidsaver closed 3 weeks ago

mdavidsaver commented 1 month ago

Attempts to address #81.

On Linux (at least) SO_REUSEADDR, which allows a new listener to bind while an existing sock is in FIN-WAIT. Apparently this allows any number of sockets to bind(), but only when listen() to succeed.

Further, on Linux there is a known documented race condition which can result in all listen() failing. It isn't clear how to handle this case without a potentially infinite loop, so ignore it. If this happens, then eg. no PVA server will get port 5075.

So when probing for another listener, it is necessary to enter the listening state. When this fails, the socket is no longer usable for another bind(), so it is necessary to allocate another for the next attempt.

mdavidsaver commented 1 month ago

The OSX build CI failures will be resolved by https://github.com/epics-base/setuptools_dso/pull/35

anjohnson commented 1 month ago

It looks like this is only for stream sockets and TCP, so if no server gets port 5075 that won't prevent UDP searche packets from being received and distributed to other servers via the localhost loopback. Is that correct?

Have you given any thought to how much work might be needed for a server to accept its sockets from inetd (stdin/stdout) or systemd via sd_listen_fds()? That might be a useful option for embedded servers, although socket activation seems less likely to make sense for IOCs at least.

mdavidsaver commented 1 month ago

It looks like this is only for stream sockets and TCP ...

Correct. As I understand it, this laziness of bind() is specific to TCP sockets where the REUSEADDR is set (and so specific to *nix). eg.


S1=socket(AF_INET, SOCK_STREAM)
S2=socket(AF_INET, SOCK_STREAM)
S1.bind(('127.0.0.1', 5000))
S2.bind(('127.0.0.1', 5000)) # fails!  (EADDRINUSE)
S1=socket(AF_INET, SOCK_STREAM)
S2=socket(AF_INET, SOCK_STREAM)
S1.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
S2.setsockopt(SOL_SOCKET, SO_REUSEADDR, 1)
S1.bind(('127.0.0.1', 5000))
S2.bind(('127.0.0.1', 5000)) # succeeds!
S1.listen(4)
S2.listen(4) # fails! (EADDRINUSE)
mdavidsaver commented 1 month ago

Have you given any thought to ... sd_listen_fds()?

No, not really. It seems like a lot of work for not much benefit, with a high probability of mis-configured .socket files (plural!) causing chaos.

What I have thought about is calling sd_notify() with IOC lifecycle changes. Primarily using initHookAfterIocRunning to emit READY=1, so that dependent units don't race CA/PVA server startup.

mdavidsaver commented 1 month ago

... of course usage of sd_* is mote so long as procServ is involved.

mdavidsaver commented 1 month ago

fyi. my attempt at provoking this race was not successful. I guess a shell loop is too slow with so many fork()s.

cat > tick.db << EOF
record(calc, "$(P=)cnt") {
  field(INPA, "$(P=)cnt")
  field(CALC, "A+1")
  field(SCAN, "1 second")
}
EOF

for n in `seq 1 100`; do sh -c "softIocPVX -m P=$n: -d tick.db -S </dev/null &" ; done

Followed by

for n in `seq 1 100`; do echo $n:cnt; done | xargs pvxget

Will complete without timeout if all PVA servers started.

cleanup

killall softIocPVX
mdavidsaver commented 1 month ago

fyi. my attempt at provoking this race was not successful ...

It can, sometimes, eventually. Looping through iocBomb.sh gets a timeout on one or two PVs within a couple of minutes on my laptop without this PR. With this PR applied, I eventually got bored.

while sh iocBomb.sh; do date; done

I am satisfied with this result.