JeffersonLab / epics2web

EPICS CA Web Gateway
https://epicsweb.jlab.org/epics2web
MIT License
16 stars 5 forks source link

Fine tune lock handling / timeout / async code #14

Closed slominskir closed 9 months ago

slominskir commented 9 months ago

Initial production running with https://github.com/JeffersonLab/epics2web/issues/12 suggests that a 10 second acquire lock timeout is not long enough when adding a new channel, specifically on a heavily loaded server during a restart. Actually rejecting addChannel requests during a restart would be fine, as clients could retry with browser reload (todo: automate this in client JavaScript lib) , but on closer code inspection I noticed a larger issue I think is the non-transactional nature of the addPvs method now due to separate locks for the (async) registering of a new channel with monitorLock and a separate clientLock for some book-keeping metadata. In particular if a set of PVs is requested and one near the end of the set fails we end up in a pretty bad state because the client bookkeeping isn't incrementally updated, but only is updated if all Pvs in the set are added. This wasn't really a concern before as the threads would wait indefinitely. Having timeout is reasonable, but only if we're careful about aborting with book-keeping in a good place. Consolidating the clientLock and monitorLock should resolve that concern. PR incoming. I'm specifically referring to this code:

https://github.com/JeffersonLab/epics2web/blob/956894699ef1b303907a04720aeb50260ffa72b1/src/main/java/org/jlab/epics2web/epics/ChannelManager.java#L203-L246

slominskir commented 9 months ago

Note: the client side retry todo is specifically needed for the case when web socket connection is fine, but server-side internal lock acquisition timeout occurs. In this case the client is not explicitly notified of a problem. On web socket issues the client IS notified and automatic retries are already in-place. A timer on the client side to confirm channels are being updated / really connected would be nice and could also catch other cases more general than just lock acquisition timeout.