ISISComputingGroup / IBEX

Top level repository for IBEX stories
4 stars 2 forks source link

Gateways: why are they failing? [Timebox: 2.5 days] #8371

Closed KathrynBaker closed 2 weeks ago

KathrynBaker commented 1 month ago

Where?

Multiple instruments have seen this, some with impacts on other local devices, where the PVs seem to be unavailable on other instruments

How?

We do not know how this came about, but a few too many have seen this during the most recent cycle, starting with SURF, and including INES, and a number of others (not always realised by anyone other than ourselves). Currently it is usually resolved by restarting the IBEX server on the instrument, but this can take a number of tries.

Reproducible?

No

Acceptance criteria

How to Test

verbose instructions for reviewer to test changes (Add before making a PR)

time in planning 01:55 23/5/24

FreddieAkeroyd commented 1 month ago

On INES I killed just the external gateway.exe process but on restart it did not serve PVS externally. When I looked closely at the log it had bound to the wrong internet interface, in fact it had bound to loopback like the block gateway should do. I had to restart all of IBEX server to get this to work. The gateway is managed by procServ, i don't know if there is some interference between the external and block gateways causing their settings to be merged in some cases.

ChrisM-S commented 1 month ago

We tried a full restart on the NGEM PC earlier on - mid afternoon but still still this did not clear things up (but which would have restarted IBEX) not sure if this is significant.

FreddieAkeroyd commented 1 month ago

There are gateways on both the NGEM PC and NDXINES , these gateways only handle incoming connections. restarting the NGEM PC would have cleared up any issue with NDXINES viewing data on the NGEM PC, but in this case it was the NGEM PC needing to view run numbers on NDXINES so only the gateway on NDXINES itself was involved

FreddieAkeroyd commented 1 month ago

Failing gateway

[Fri May 10 12:30:12 2024] @@@ Restarting child "GWEXT"
[Fri May 10 12:30:16 2024] EPICS_CA_ADDR_LIST=127.255.255.255
[Fri May 10 12:30:16 2024] EPICS_CAS_INTF_ADDR_LIST=127.0.0.1
[Fri May 10 12:30:16 2024] EPICS_CAS_IGNORE_ADDR_LIST=127.0.0.1
[Fri May 10 12:30:16 2024] EPICS_CAS_BEACON_ADDR_LIST=Not specified
[Fri May 10 12:30:16 2024] Statistics PV prefix is IN:INES:CS:GATEWAY:EXTERNAL

working gateway

[Wed Apr 24 10:37:27 2024] @@@ Restarting child "GWEXT"
[Wed Apr 24 10:37:34 2024] EPICS_CA_ADDR_LIST=127.255.255.255
[Wed Apr 24 10:37:34 2024] EPICS_CAS_INTF_ADDR_LIST=130.246.54.235
[Wed Apr 24 10:37:34 2024] EPICS_CAS_IGNORE_ADDR_LIST=130.246.54.235
[Wed Apr 24 10:37:34 2024] EPICS_CAS_BEACON_ADDR_LIST=130.246.55.255
[Wed Apr 24 10:37:34 2024] Statistics PV prefix is IN:INES:CS:GATEWAY:EXTERNAL

EPICS_CAS_INTF_ADDR_LIST gets set incorrectly - basically the external gateway starts up but uses the block gateway settings. This is set as arguments to gateway.exe but EPICS_CAS_BEACON_ADDR_LIST is just an inherited environment variable that is set just before spawning each process. I therefore conclude that this is a race condition where sometime the spawning of the two gateway processes happens too quickly and they interfere. I propose a simple solution of adding a short delay between the two spawns so the first fork() will have happened before the second procserv is started.