Improve monitoring of how long actions are taking

ChrisM-S commented 3 years ago

ENDs on MERLIN have been unpredictably slowing from a script, variable times but often taking up to 10 minutes before the END actually takes effect and is logged by the DAE. Restarting DAE and IOC had no effect on the problem. Before restarting the IBEX server the client was restarted just to check there was nothing going on there.

After a client restart the BEGINS and ENDS behaved normally. The behaviour developed during 6-8hours of the client's use, possibly as a result of a plot or error or some other issue. ICP and DAE IOC logs all looked fine while these problems were occurring. The delays appeared both in the scripted END and when doing a non scripted END on the console.

Acceptance Criteria

ENDs (and BEGINS etc.) in the client perform a handshake with the server and detect any significant delay in pasing on a control command via a timeout.
Genie Python reports on and possibly abandons/errors or retries. the command - depending on what is most helpful.
The client is pragmatic and if necessary reports on its poor performance and suggests a restart if it might help.

Notes 1) The scientist had had multiple DAE issues over the last few days so it would have been really useful if this had been clearly shown up as a different sort of issue. 2) It might also be useful to also make this separation with other equipment IOCs as problems in client to server communication could affect scripts and timings in a similar way. the DAE was collecting 43uAmps of data instead of the 40 specified in the waitfor command. 3) If there is an issue with the client, this may also need fixing in the script server

ChrisM-S commented 3 years ago

+1 on HRPD

There is a little more information from HRPD about the workaround to this problem the scientist had stopped IBEX Server & Client more than once to try to and fix very slow BEGINS, UPDATESTOREs and ENDs, again just long pauses - all verified that the ICP itself was not logging any DAE activity during these pauses.

To fix the issue it was necessary not just to restart the Client but also it was necessary to connect to the DAE IOC and restart this (which was done on MERLIN too). Why the complete restart of the Server on HRPD did not include the DAE restart which could fix this is puzzling.

Whilst BEGINS/ENDs have become almost instant to show on the dashboard after the Client/DAE restart, the scientist felt that the overall time for BEGINS/ENDS was slower than normal for the standard settings. What I would concurr is that the ICP log file did still show some pauses - they could have been things going on in the background, but usually the logging it a bit more continuous on DAE actions so the gaps in logging might be more due to pauses in states changing in the ICP? For example, in the fragment of log file below, there was a 47s delay between log entries which seemed unusual and was probably what made the whole command slower than normal from the scientists point of view.

2021-05-02 12:15:00  (520044) 1 DAQ periods of size 5074090 words
2021-05-02 12:15:00  (520044) *** Timing information for CRPTProxy::zeroHistRawData: real(0.016604) thread(kernel=0.015625, user=0) process(kernel=0.015625, user=0)
2021-05-02 12:15:47  (520044) *** Timing information for UPDATE: real(46.9141) thread(kernel=0.015625, user=0) process(kernel=0.28125, user=0.40625)
2021-05-02 12:15:47  (520044) Average DAE transfer rate = 432 KByte / s

FreddieAkeroyd commented 3 years ago

The 47s timestamp gap would be expected, the update of dae data took real(46.9) seconds. However the transfer rate reported (432 kb/s) is 1/10 of what it was previously. So their slowness is related to a drop in DAE connection speed, but i am not sure what is causing the actual delay/hang in ibex. Maybe if we solve one we can solve the other. I also note that MERLIN is running slower than before with DAE access (it is a DAE2 rather than a DAE3), but the common factor between MERLIN and HPRD seem to be a drop in dae response speed

FreddieAkeroyd commented 3 years ago

HRPD issue solved by power cycling NDH computer and replacing network switch, communication speed now back to normal.

FreddieAkeroyd commented 3 years ago

If this is a pure client restart fix then it is not the same as HRPD, but symptoms are similar and they are similarly seeing a slower DAE performance. I don't know if https://github.com/ISISComputingGroup/IBEX/issues/6470 might help

DominicOram commented 3 years ago

See https://github.com/ISISComputingGroup/IBEX/issues/5447 for general better logging of genie_python

ISISComputingGroup / IBEX

Improve monitoring of how long actions are taking #6469