fermi-ad / controls

Central repo for reporting bugs, making feature requests, managing RFCs, and requesting seminar topics.
https://www-bd.fnal.gov/controls/
2 stars 0 forks source link

UCD failure #44

Open awattsFNAL opened 11 months ago

awattsFNAL commented 11 months ago

Issue documented by Ops involving failed switch-over between UCDA and UCDB: https://www-bd.fnal.gov/Elog/?orEntryId=250402

rneswold commented 11 months ago

I rebooted UCDB, which fixed the problem.

We have several systems that were designed to "fail-over" (UCDs, FIRUS, STATES). None of these systems do well in that the fail-over logic sometimes gets confused when everything is fine. That's worse than having a ready back-up.

I propose, for now, that we power down UCDB. If UCDA has a hardware failure, we can power up UCDB. I'm not aware of any logging for when they switch. I would be interesting if it happens occasionally, but doesn't get in a stuck state where they fight each other.

The big problem is that these are 68k machines and it may be difficult to impossible to fix the code because our VxWorks development environment has moved to Linux and we don't have 68k support. Briegel was in the process of upgrading to MVME-5500 processor boards and VxWorks 6.x, but he never finished.

There was talk (by Mark Austin and Dan MacArthur) about making an Ethernet-based TCLK decoder. If we had that, I could modify my TCLK monitor service to be the replacement for the UCD front-ends.

The TCLK monitor currently listens to the multicast (generated by the UCD front-ends) and forwards them to Bobby's redis servers. It also provides a gRPC API so clients can register to receive TCLK events. If the monitor were to listen to the Ethernet decoder, it could generate the multicast and still perform the other tasks. The expensive, VME-based, UCD front-ends would be retired.

awattsFNAL commented 11 months ago

UCB has been powered off and alarm bypassed. Moving this to backburner: we still need to understand why the switchover doesn't work and come up with a plan to replace the VME UCDs with something more modern and maintainable (MFTU variant?)

rneswold commented 11 months ago

These are MVME-162 processors. We don't have the compilation tools to fix the bug (aside from trying to port the code to a MVME-5500 and a newer version of VxWorks.)

Replacing the UCD front-ends with an MFTU-type decoder is the better solution because, if we port the code to a different VME processor, we are still stuck with an expensive, single-core PowerPC, VME solution. An Ethernet, MTFU-decoder lets us use commodity, 20-core Linux boxes at a fraction of the cost and tons more performance.