JeffersonLab / epics2web

EPICS CA Web Gateway
https://epicsweb.jlab.org/epics2web
MIT License
16 stars 5 forks source link

Investigate Correct Unresponsive IOC Handling #3

Closed slominskir closed 5 years ago

slominskir commented 5 years ago

We need to research how best to handle the scenario where an IOC becomes unresponsive and the client API (CAJ) throws ContextVirtualCircuitException with status 60:

EPICS CA Context Virtual Circuit Exception: Status: gov.aps.jca.CAStatus[UNRESPTMO=60,WARNING=0]=Virtual circuit connection unresponsive

Currently epics2web will attempt to reset the entire context in this scenario: context.destroy() is called and then a new CAJContext is created and all monitors are re-created.

However, the EPICS CA protocol specification (https://epics.anl.gov/docs/CAproto.html#secVCUnresponsive) says disconnect should be avoided in this scenario. Empirically I've observed CAJ does not recover automatically from status 60, but more research is needed as this could just be due to bugs in epics2web, especially if multiple IOCs are rebooted simultaneously. Recreating this scenario has proven tricky as killing a running Java CAS server results in status 24, something CAJ client can recover from automatically once the server is restarted. I'm seeing status 60 when an RTEMS IOC is restarted in production (maybe it crashes).

Note: an unresponsive IOC is different than a disconnected IOC. A disconnected IOC results in the Status 24: EPICS CA Context Virtual Circuit Exception: Status: gov.aps.jca.CAStatus[DISCONN=24,WARNING=0]=Virtual circuit disconnect

In the case of CA Status 24, epics2web will defer to the underlying CAJ library to watch for the IOC to come back online and automatically retry to connect. In other words epics2web does nothing and ignores a ContextVirtualCircuitException with status 24.

slominskir commented 5 years ago

Two additional points:

  1. Looks like there is a bug in epics2web where if multiple IOCs become unresponsive simultaneously, i.e. multiple status 60 virtual circuit exceptions, then a context reset could interfere with an already in-progress context reset. A fix for this is already committed. However, it still isn't clear if resetting the context is needed - perhaps "do nothing" as we do with status 24 is the correct thing to do? Aside: in load tests where we overwhelm the epics2web server an unresponsive exception is sometimes triggered, and this may be because the server can't process messages fast enough. Do nothing in this case is probably fine too.
  2. What about notifying epics2web clients about connection issues? EDM for example usually displays a white box around widgets when their PVs are disconnected / unavailable. If epics2web was to notify websocket clients about connectivity problems (after an initial successful connection) I believe we need to use the information contained in the ContextVirtualCircuitExceptionEvent object to determine which CAChannels, and in turn, which WebSocket clients to notify. Unless I'm misunderstanding the CAJ API this appears very clumsy and costly:
public void contextVirtualCircuitException(ContextVirtualCircuitExceptionEvent ev) {
    LOGGER.log(Level.SEVERE, "EPICS CA Context Virtual Circuit Exception: Status: {0}, Address: {1}, Fatal: {2}", new Object[]{ev.getStatus(), ev.getVirtualCircuit(), ev.getStatus().isFatal()});

    Transport[] transports = context.getTransportRegistry().toArray();

    for (Transport t : transports) {
        // No port in getVirtualCircuit().  Hope there aren't multiple CA servers at that IP!
        if (ev.getVirtualCircuit().equals(t.getRemoteAddress().getAddress())) {
            CATransport cat = (CATransport) t;

            Channel[] channels = context.getChannels();

            for (Channel c : channels) {

                CAJChannel cac = (CAJChannel) c;

                if (cat.equals(cac.getTransport())) {
                    //TODO: Now that we know the channel, we can lookup the WebSocket(s)!
                }
            }
        }
    }
}
slominskir commented 5 years ago

Opting to "do nothing". It seems the gov.aps.jca.event.ConnectionListener will notify ChannelMonitors when connectivity issues occur. Reset the whole context code was as removed.