Closed KathrynBaker closed 2 weeks ago
On INES I killed just the external gateway.exe
process but on restart it did not serve PVS externally. When I looked closely at the log it had bound to the wrong internet interface, in fact it had bound to loopback like the block gateway should do. I had to restart all of IBEX server to get this to work. The gateway is managed by procServ, i don't know if there is some interference between the external and block gateways causing their settings to be merged in some cases.
We tried a full restart on the NGEM PC earlier on - mid afternoon but still still this did not clear things up (but which would have restarted IBEX) not sure if this is significant.
There are gateways on both the NGEM PC and NDXINES , these gateways only handle incoming connections. restarting the NGEM PC would have cleared up any issue with NDXINES viewing data on the NGEM PC, but in this case it was the NGEM PC needing to view run numbers on NDXINES so only the gateway on NDXINES itself was involved
[Fri May 10 12:30:12 2024] @@@ Restarting child "GWEXT"
[Fri May 10 12:30:16 2024] EPICS_CA_ADDR_LIST=127.255.255.255
[Fri May 10 12:30:16 2024] EPICS_CAS_INTF_ADDR_LIST=127.0.0.1
[Fri May 10 12:30:16 2024] EPICS_CAS_IGNORE_ADDR_LIST=127.0.0.1
[Fri May 10 12:30:16 2024] EPICS_CAS_BEACON_ADDR_LIST=Not specified
[Fri May 10 12:30:16 2024] Statistics PV prefix is IN:INES:CS:GATEWAY:EXTERNAL
[Wed Apr 24 10:37:27 2024] @@@ Restarting child "GWEXT"
[Wed Apr 24 10:37:34 2024] EPICS_CA_ADDR_LIST=127.255.255.255
[Wed Apr 24 10:37:34 2024] EPICS_CAS_INTF_ADDR_LIST=130.246.54.235
[Wed Apr 24 10:37:34 2024] EPICS_CAS_IGNORE_ADDR_LIST=130.246.54.235
[Wed Apr 24 10:37:34 2024] EPICS_CAS_BEACON_ADDR_LIST=130.246.55.255
[Wed Apr 24 10:37:34 2024] Statistics PV prefix is IN:INES:CS:GATEWAY:EXTERNAL
EPICS_CAS_INTF_ADDR_LIST
gets set incorrectly - basically the external gateway starts up but uses the block gateway settings. This is set as arguments to gateway.exe but EPICS_CAS_BEACON_ADDR_LIST
is just an inherited environment variable that is set just before spawning each process. I therefore conclude that this is a race condition where sometime the spawning of the two gateway processes happens too quickly and they interfere. I propose a simple solution of adding a short delay between the two spawns so the first fork() will have happened before the second procserv is started.
Where?
Multiple instruments have seen this, some with impacts on other local devices, where the PVs seem to be unavailable on other instruments
How?
We do not know how this came about, but a few too many have seen this during the most recent cycle, starting with SURF, and including INES, and a number of others (not always realised by anyone other than ourselves). Currently it is usually resolved by restarting the IBEX server on the instrument, but this can take a number of tries.
Reproducible?
No
Acceptance criteria
How to Test
verbose instructions for reviewer to test changes (Add before making a PR)
time in planning 01:55 23/5/24