fermi-ad / controls

Central repo for reporting bugs, making feature requests, managing RFCs, and requesting seminar topics.
https://www-bd.fnal.gov/controls/
2 stars 0 forks source link

Front end clock drift #20

Open awattsFNAL opened 9 months ago

awattsFNAL commented 9 months ago

https://www-bd.fnal.gov/Elog/?orEntryId=248251

awattsFNAL commented 9 months ago

Slack discussion

beauremus commented 9 months ago

I want to reiterate explicitly here that Ops, in the Slack thread, suggested that data logging the J: values would be useful in debugging. This is something that Controls should look into since it will likely require instantiating a new logger. Let me know if I should turn this into a feature request.

beauremus commented 9 months ago

Doing some searches across front-end config files, I find that these are the clock modules for the FEs mentioned in Slack.

beauremus commented 9 months ago

For posterity, here's the command I ran to parse the FE config files. grep -lr "sld\|ucd" /fecode-bd/vxworks_boot/fe | grep -E "\.(startup|cmd|vx|login)$" | xargs grep -H "sld\|ucd" > ~/fe_clk_module.txt

From @rneswold, here are the available types:

SLD -> "sld-*.out", IP-UCD -> "libiptrig-*.out", PMCUCD -> "libpmctrig-*.out", VUCD -> "libvucdtrig-*.o*", Multicast TCLK -> "libmctrig-*.o*"

Unfortunately, not all FE maintainers follow this convention, so doing a similar search in the relevant directory in /fecode-bd/vxworks_boot/fe can get you what you need.

Thanks a ton to @rneswold for getting me 98% of the way there. 🦾

awattsFNAL commented 8 months ago

From @kengell:

I’m working on a clock drift problem where the VME based front-ends (e.g. MCR01) exhibit a once per hour clock reset of about 30 milliseconds (see DATALOGGER plot). I received a call from the MCR and they reported a ‘Network Clock Storm’ By that, the MCR means a whole lot of J: devices (e.g. J:MCR01, J:CLX44E) alert them to slow FE response times. My understanding is that the MONITR FE (java based) pings the VME and ACSys FEs periodically and reports latency in the J: devices. What I find odd is that the VME based FEs appear to have a 1-hour clock drift/reset (see plot). I do not observe that behavior on the ACSys FEs (J:CLX44E). Anyone know why the VME based FEs exhibit this one hour rise/fall of ping latency? Thanks.

https://files.slack.com/files-pri/THF7S17RV-F063TUA6YQ2/screenshot_2023-11-01_at_11.01.41_am.png

awattsFNAL commented 8 months ago

Rich It’s hard to generalize. The ACSys front-ends are on Linux, which uses NTP to keep the clocks in sync. NTP is only supposed to speed up or slow down the clock interrupt so time is always increasing and it eventually keeps in sync with the time server. In VxWorks, we use the $8F (GPS) event to sync our system’s one-second boundary. This means time can briefly go backwards, if the clock was ahead. Each FE syncs its time differently, based on which TCLK decoder library they’re using. In addition, VxWorks startup scripts may start a periodic, background task that syncs with another system. The 6.x kernels have a simple, NTP client function which simply sets the system time, so it can jump forward or backwards. Your plot looks as though the system time drifts and you have a once-an-hour task that resets it. (edited)

Dennis Nicklaus It isn't VME in general, but it is more about MCR01 in particular. Here's a similar plot with a different VME front end, CMTIL1, with a much more stable clock. I don't know off the top of my head if it is a 5.4 thing, a 6040/162 thing, or just how mcr01 is configured. https://files.slack.com/files-pri/THF7S17RV-F063NF85UDC/screenshot_2023-11-01_at_2.31.58_pm.png