fermi-ad / controls

Central repo for reporting bugs, making feature requests, managing RFCs, and requesting seminar topics.
https://www-bd.fnal.gov/controls/
2 stars 0 forks source link

CLX40 hanging for NuMI LCW parameters #43

Closed awattsFNAL closed 6 months ago

awattsFNAL commented 7 months ago

NuMI folks noticed Ops needing to restart the Erlang process on CLX40 more frequently these days; it's been hanging and not reporting data to MACALC for some important calculated LCW parameters.

awattsFNAL commented 7 months ago

https://www-bd.fnal.gov/Elog/?orEntryId=250374

kengell commented 7 months ago

INVESTIGATION:

Spoke w/ Ops...

MACALC devices in question are: E:62WV02, E:62WLVL, and E:62WV01.

These devices are homed on MACALC but rely on data provided by CLX40E devices of: E:62WL02, E:62WL01, E:62WL01 respectively.

CLX40E gets read backs for E:62WL02 and E:62WL01 from the PLC mi62-lcw-plc.

The PLC Direct protocol is UDP based (port 28784) so there are no log entries for reconnection.

My working theory is that the PLC stopped producing packets for the CLX40E front-end to consume. A reboot of CLX40E will resend 'join' message and thereby restart the data flow.

Will likely loop in G. Brown to investigate UDP join counters on the Cisco switches to see if we can determine how traffic flow stopped.

Image

kengell commented 7 months ago

CLX40E rebooted by Ops again on Dec-7 @ 0200.

Image

kengell commented 7 months ago

Looping in Tom Zuchnik (keeper of mi62-lcw-plc) to see if he has any insight on connection problems w/ CLX40E.

kengell commented 7 months ago

No reboot of CLX40E overnight. Tom had a TCP client attached to the mi62-lcw-plc overnight and reports no network disconnect. Will continue to monitor.

kengell commented 7 months ago

CLX40E is w/o reboot for 5 days. I believe installing the correct (updated) configuration file may have helped.

awattsFNAL commented 7 months ago

Latest from Keith is that Ops still haven't needed to restart CLX40E Erlang process since the updated configuration file. We'll keep this issue open through the end of the month just in case, but looking good so far.