ISISComputingGroup / IBEX

Top level repository for IBEX stories
5 stars 2 forks source link

Mercury: EMU issues reoccurred on the 22nd and 23rd of May 2021 #6720

Open JamesKingWork opened 3 years ago

JamesKingWork commented 3 years ago

As an EMU scientist, I would like my mercury to run consistently. The problems from https://github.com/ISISComputingGroup/IBEX/issues/6286 reoccurred on the 22nd and 23rd of May 2021, these problems should be investigated and fixed.

Acceptance criteria:

Notes

JamesKingWork commented 3 years ago

EMU scientists will set up with a cryostat to tests for this ticket and also https://github.com/ISISComputingGroup/IBEX/issues/6721

FreddieAkeroyd commented 3 years ago

The issue that was resolved previously was lock timeout - multiple simultaneous separate scans were initiated and they all need to start before lockTimeout, which they didn't before. The logs on the day mentioned above are all reply timeout, in some cvases the device is taking 4 seconds to respond, but the replyTimeout is 2 seconds. Maybe it gets overloaded with the number of commands? what timeout did labview use? I suspect it may have silently retried once. There seem to be periods of timouts, often hours apart. increasing replyTimout to 5 seconds may help. Probably need to increase lockTimeout too. Moving to a single scan loop will not resolve a replyTimeout, it would just make it less likely to ever hit a lockTimeout.

ChrisM-S commented 3 years ago

Just a quick note. Several of these EMU/Mercury tickets have come up and it is important to keep remembering that the physical electrical environment (long cable, electrically noisy environment and high baud rate of 57k) of this beamline setup is pretty well guaranteed to produce random transmission errors during cycle. This is sort of good for testing resilience and making the comms. robust but it is only likely to be properly fixed by dropping the baud rate (possibly back to 9600) or moving the MOXA close to the kit - not so easy.

ChrisM-S commented 3 years ago

OK, just checked on a chance meeting with muon scientists. The long cable was only set up at the time for our tests (on this issue) connected to a cryostat (which was on the MUSR side of the beamline). The connection to the EMU MOXA from the mercury was short during the cycle (in the EMU cryogenics rack).

It does appear however that the muon Mercury units have more cards in them than those used by anyone else (~10 cards?). IBEX is scanning all parameters roughly once every 5s which might be reasonable for temperature but is likely to be excessive for things like Helium levels, PID & MXA power values. It is possible that this rate of scanning for all parameters on all cards is more than the internal hardware bus is comfortable to synchronise. Note: Historically, on MUSR we unbundled individual instruments from the ISObus because of the traffic limitations of all the aggregated readings for the dilution fridge. Maybe this is a bit of history repeating itself - maybe there is similar hardware to the ISObus internally to the Mercurys...?

May be worth deliberately pushing the system into predictable failure by increasing the read frequency on all channels and plotting it as it starts to fail - it might then become clear under what conditions the tail of this issue can be avoided. The solution can then be to back off non-essential reads to a lower frequency.

FreddieAkeroyd commented 3 years ago

In the ioc log we do not see corrupt replies, instead we get a reply timeout and then the next command seems to get the reply meant for the previous command. So it looks like the device becomes periodically "busy" and doesn't reply as quickly a usual. I haven't looked to see if these "busy" periods correlate with anything. The muon mercury certainly has more cards, but I don't know how a mercury works internally as to how amount of polling v mercury internal processing may compete. From the log it looks like we always eventually get a reply, it is not corrupt or missing, so waiting a bit longer may be all we can do. It would be worth checking what the labview did - it may have had a longer timeout, or it may have done one retry before signalling an error (effectively doubling the timeout)

ChrisM-S commented 3 years ago

Perhaps there is flow control happening on the device but we are not aware of it i.e. if our write did not complete because it was x-off ed, we would expect to only do the read when the write had completed and the result was ready. As it is, I guess we might be delivering another command to the input buffer of the mercury before it is ready - which just sits there but what we needed to do first was read the result of the previous one. That way, I think we could get one command behind when the device lagged. It would happen only when we timed out because of some internal delay in the mercury (perhaps whilst it caught up with it's own processing, or serviced an internal queue).

So, given we are not using flow control (which would be holding back our writes to the mercury), what I think I'm saying is that we should be holding back our writes (queries, instructions) by waiting until we get output from the previous one - the timeout should be much longer so we don't ever get ahead of the device. Just set so that after a long time of having no bytes back - e.g. 30s , we might assume something has got disconnected and start a recovery. Our strategy does depend on all writes/commands giving an acknowledgment, if we can send commands without a corresponding result coming back, we could get a long way ahead of ourselves.

FreddieAkeroyd commented 3 years ago

We don't set xon/xoff out side, if the mercury did send it then we might see a strange charatcter in the log, but I'm not sure exactly what would happen. So currrently all commands do write/read, could the write get buffered by the moxa? what we see is a reply timeout on read. asyn always clears the input buffer before writing the next command, so it seems the reply to the previous command happens after the second command is sent. So the internal state machine of the mercury could get out of step. waiting longer (replyTimeout) would likely fix most things we have seen. Also actually enabling flow control as per #6546 would be interesting, does the mercury know it is busy and ask for no writes? Would be interesting to see - if it does then i guess we would see a write timeout rather than a read timeout.

JamesKingWork commented 2 years ago

Initial plan for investigation based on the comments above:

Possible solutions to consider suggested above:

JamesKingWork commented 2 years ago

Initial test:

The signature of the errors in the logs from this test and the 22nd and 23rd seem very similar:

[2021-09-17 09:56:39] 2021/09/17 09:56:39.908985 L0 IN:EMU:MERCURY_01:2:HEATER:CURR: No reply within 100 ms to "READ:DEV:DB1.SH_H:HTR:SIG:CURR<0a>"
[2021-09-17 09:56:39] 2021/09/17 09:56:39.923557 L0 IN:EMU:MERCURY_01:2:FLOW: Input "STAT:DEV:DB1.SH_H:HTR:SIG:CURR:-0.0000A..."
[2021-09-17 09:56:39] 2021/09/17 09:56:39.923573 L0 IN:EMU:MERCURY_01:2:FLOW: mismatch after 18 bytes "...:DB1.SH_H:"
[2021-05-22 18:23:29] 2021/05/22 18:23:29.371839 L0 IN:EMU:MERCURY_01:PRESSURE:1:HEATER: No reply within 2000 ms to "READ:DEV:DB5.P1:PRES:LOOP:HSET<0a>"
[2021-05-22 18:23:29] 2021/05/22 18:23:29.373383 L0 IN:EMU:MERCURY_01:PRESSURE:1:HEATER:VOLT_LIMIT: Input "STAT:DEV:DB8.T1:TEMP:LOOP:P:1.0"
[2021-05-22 18:23:29] 2021/05/22 18:23:29.373398 L0 IN:EMU:MERCURY_01:PRESSURE:1:HEATER:VOLT_LIMIT: mismatch after 16 bytes "...EV:DB8.T1:"
[2021-05-23 10:26:20] 2021/05/23 10:26:20.820881 L0 IN:EMU:MERCURY_01:1:HEATER: No reply within 2000 ms to "READ:DEV:MB1.T1:TEMP:LOOP:HSET<0a>"
[2021-05-23 10:26:21] 2021/05/23 10:26:21.019627 L0 IN:EMU:MERCURY_01:1:HEATER:VOLT_LIMIT: Input "STAT:DEV:MB1.T1:TEMP:LOOP:HSET:41.9001..."
[2021-05-23 10:26:21] 2021/05/23 10:26:21.019646 L0 IN:EMU:MERCURY_01:1:HEATER:VOLT_LIMIT: mismatch after 16 bytes "...EV:MB1.T1:"
JamesKingWork commented 2 years ago

Configuration changes show changes to mercury daughter boards on 26th of May 2021:

<ioc autostart="true" name="MERCURY_01" remotePvPrefix="" restart="true" simlevel="none">
+       <ioc name="MERCURY_01" autostart="true" restart="true" remotePvPrefix="" simlevel="none">
                <macros>
-                       <macro name="PRESSURE_2" value="DB5.P1"/>
-                       <macro name="PRESSURE_1" value="DB2.P1"/>
+                       <macro name="PRESSURE_2" value="DB3.P1"/>
+                       <macro name="PRESSURE_1" value="DB5.P1"/>
                        <macro name="PORT" value="COM5"/>
-                       <macro name="TEMP_3" value="DB6.T1"/>
+                       <macro name="TEMP_3" value="DB7.T1"/>
                        <macro name="TEMP_4" value="DB8.T1"/>
                        <macro name="TEMP_1" value="MB1.T1"/>
-                       <macro name="LEVEL_1" value="DB1.L1"/>
+                       <macro name="LEVEL_1" value="DB2.L1"/>
                        <macro name="LEVEL_2" value=""/>
-                       <macro name="TEMP_2" value="DB7.T1"/>
+                       <macro name="TEMP_2" value="DB6.T1"/>
 <macro name="SPC_GAIN" value="" description="Software pressure control offset gain for extra pressure based on how far away from set point the temperature is." pattern="^[0-9]+\.?[0-9]*$" hasDefault="no"/>
+                       <macro name="SPC_MAX_PRESSURE" value="" description="Software pressure control maximum pressure allowed." pattern="^[0-9]+\.?[0-9]*$" hasDefault="no"/>
+                       <macro name="SPC_MIN_PRESSURE" value="" description="Software pressure control minimum pressure allowed." pattern="^[0-9]+\.?[0-9]*$" defaultValue="0.0" hasDefault="yes"/>
+                       <macro name="SPC_OFFSET" value="" description="Software pressure control offset (extra pressuer to add before ramp down)." pattern="^[0-9]+\.?[0-9]*$" hasDefault="no"/>
+                       <macro name="SPC_OFFSET_DURATION" value="" description="Software pressure control offset duration in minutes (time over which to ramp down)." pattern="^[0-9]+\.?[0-9]*$" hasDefault="no"/>
+                       <macro name="SPC_PRESSURE_1" value="" description="The index of the pressure card to control with software pressure control for temp1. Blank for no control." pattern="^[1,2]$" defaultValue="" hasDefault="yes"/>
+                       <macro name="SPC_PRESSURE_2" value="" description="The index of the pressure card to control with software pressure control for temp2. Blank for no control." pattern="^[1,2]$" defaultValue="" hasDefault="yes"/>
+                       <macro name="SPC_PRESSURE_3" value="" description="The index of the pressure card to control with software pressure control for temp3. Blank for no control." pattern="^[1,2]$" defaultValue="" hasDefault="yes"/>
+                       <macro name="SPC_PRESSURE_4" value="" description="The index of the pressure card to control with software pressure control for temp4. Blank for no control." pattern="^[1,2]$" defaultValue="" hasDefault="yes"/>
+                       <macro name="SPC_TABLE_FILE" value="" description="File to load to related temperature to pressure from calibration directory other_devices." pattern="^\.*$" defaultValue="little_blue_cryostat.txt" hasDefault="yes"/>
+                       <macro name="SPC_TEMP_DEADBAND" value="" description="Software pressure control deadband for the temperature and setpoint." pattern="^[0-9]+\.?[0-9]*$" hasDefault="no"/>
                </macros>
                <pvs/>
                <pvsets/>
        </ioc>

This does not increase the number of daughter boards in use, but does change the addresses.

JamesKingWork commented 2 years ago

Based on logs from EMU on the 22nd:

[2021-05-22 18:23:15] 2021/05/22 18:23:15.403984 L0 IN:EMU:MERCURY_01:3:CALFILE: No reply within 2000 ms to "READ:DEV:DB7.T1:TEMP:CAL:FILE<0a>"
[2021-05-22 18:23:15] 2021/05/22 18:23:15.434824 L0 IN:EMU:MERCURY_01:4:NAME: Input "STAT:DEV:DB7.T1:TEMP:CAL:FILE:X135268_SR..."
[2021-05-22 18:23:15] 2021/05/22 18:23:15.434841 L0 IN:EMU:MERCURY_01:4:NAME: mismatch after 11 bytes "...TAT:DEV:DB"

This is the first in a bunch of reply timeouts and protocol mismatches, there is a reply timeout so the CALFILE is no longer looking for a reply but 30,000 microseconds (0.03 seconds) later NAME gets the CALFILE response and deems it a mismatch. This also occurs in other places with varying reply lengths:

[2021-05-22 17:18:31] 2021/05/22 17:18:31.701989 L0 IN:EMU:MERCURY_01:1:PID:AUTO: No reply within 2000 ms to "READ:DEV:MB1.T1:TEMP:LOOP:PIDT<0a>"
[2021-05-22 17:18:31] 2021/05/22 17:18:31.851828 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: Input "...EV:MB1.T1:TEMP:LOOP:PIDT:OFF"
[2021-05-22 17:18:31] 2021/05/22 17:18:31.851848 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: mismatch after 26 bytes "...TEMP:LOOP:"

This is timeout by a tiny amount in comparison to the 2 seconds we are waiting.

On the 23rd it is doing the same thing:

[2021-05-23 10:26:18] 2021/05/23 10:26:18.749374 L0 IN:EMU:MERCURY_01:1:NAME: No reply within 2000 ms to "READ:DEV:MB1.T1:TEMP:NICK<0a>"
[2021-05-23 10:26:18] 2021/05/23 10:26:18.818685 L0 IN:EMU:MERCURY_01:1:D: Input "...TAT:DEV:MB1.T1:TEMP:NICK:MB1.VTI_T"
[2021-05-23 10:26:18] 2021/05/23 10:26:18.818704 L0 IN:EMU:MERCURY_01:1:D: mismatch after 21 bytes "...1.T1:TEMP:"

We are seeing the same thing with using 150ms as replyTimeout today:

[2021-09-17 11:07:38] 2021/09/17 11:07:38.497727 L0 IN:EMU:MERCURY_01:LEVEL:1:NITROGEN: No reply within 150 ms to "READ:DEV:DB2.L1:LVL:SIG:NIT:LEV<0a>"
[2021-09-17 11:07:38] 2021/09/17 11:07:38.498235 L0 IN:EMU:MERCURY_01:PRESSURE:1:PRESSURE:SP:RBV: Input "STAT:DEV:DB2.L1:LVL:SIG:NIT:LEV:0.7390%..."
[2021-09-17 11:07:38] 2021/09/17 11:07:38.498248 L0 IN:EMU:MERCURY_01:PRESSURE:1:PRESSURE:SP:RBV: mismatch after 11 bytes "...TAT:DEV:DB"

It seems that a reply timeout trips up the IOC until it gets time to send the commands again and get a correct response. Increasing the reply timeout slightly would fix this, but we should investigate why on the 23rd and 22nd of May this was more prominent.

FreddieAkeroyd commented 2 years ago

could we use @replytimeout to resend the original command?

JamesKingWork commented 2 years ago

could we use @replytimeout to resend the original command?

What would this look like in the proto? I'm not sure I understand how we would use it

JamesKingWork commented 2 years ago

In LabVIEW a command is sent and then it waits for bytes to be at the port every 5 milliseconds (anything other than no bytes is accepted) for 1000 milliseconds. If the timeout occurs, there has been no error report and it hasn't found a terminator then it reports an error reading from the device. This timeout is shorter than the 2000ms we were using on the 22nd and 23rd of may and the same length of time before https://github.com/ISISComputingGroup/IBEX/issues/6286. The VI retries the write and subsequent read up to 6 times unless it passes without error, effectively becoming a 6000 milliseconds timeout. If there is an error reported prior to first write, it closes and reopens the reference to the port and reinitialises comms. When doing the send, response and retries the LabVIEW will block so the rest of the program won't be trying to send and get responses to other commands.

JamesKingWork commented 2 years ago

Questions about the EPICS protocol:

Possible solutions:

If we make the comms more robust in a way other than increasing the reply timeout we should attempt to test the robustness of the comms by decreasing the reply timeout i.e. if we do the retry have the comms set to 150ms to test that the retry means we see less errors.

We should also ask the scientists if there was anything different about the setup of the mercury on these two days and why the daughter board addresses were changed on the 26th as it could be related.

JamesKingWork commented 2 years ago

We should also ask the scientists if there was anything different about the setup of the mercury on these two days and why the daughter board addresses were changed on the 26th as it could be related.

In reponse to our questions the IS has said:

Can you confirm that the errors you saw only came back on those two days? Yes, I think that’s the case Or if not do you know when else the errors occurred? N/A Do we know if anything was different on those two days? Maybe the cabling had been changed or a different mercury was being used? We can see on the 26th of May the addresses of the daughter boards were changed. PRESSURE_2 changed from DB5.P1 to DB3.P1, PRESSURE_1 from DB2.P1 to DB5.P1, TEMP_3 from DB6.T1 to DB7.T1, LEVEL_1 from DB1.L1 to DB2.L1 and TEMP_2 from DB7.T1 to DB6.T1. Do you remember what this change of daughter board addresses was for? Maybe the mercury was swapped out for another one or some of the daughter boards were replaced or moved around? We have two Mercury controllers, one that controls the NV based on HX pressure and the other that controls the NV based on the set and current temperatures. We use the former when running with a DR and 3He insert, otherwise the latter. However, both are populated with a similar set of cards. There was an extended period ~13/5 – 25/5 when we were running with the latter setup (normal 4He setup), but changed to a DR setup on the 26/5. That would be why you saw the change in cards (just to note, that the control scripts we’re asking for will avoid the need to change controllers). Otherwise, no cabling was changed. I’m certain the errors didn’t appear/disappear with the switch of controllers, but I suppose it’s possible that the controller used on those two days might be more sensitive to errors (e.g. configuration of cards or firmware version), but that’s a complete guess. At present you’re using a different Mercury from the one that would have been used when we saw the errors in the Cycle.

JamesKingWork commented 2 years ago

With the replyTimeout set to 175ms there are still errors being given every few seconds by the mercury. We have added an @replyTimeout handler to all of the get functions in the protocol that simply repeats the protocol function. The plan is to compare the logs produced for fifteen minutes of running without the @replyTimeout handler and fifteen minutes of running with the @replyTimeout handler. We can then consider whether we have sufficiently increased the robustness of the comms.

Example of a protocol function without the @replyTimeout handler:

getCatalog {
    out "READ:SYS:CAT";
    in "STAT:SYS:CAT:%1024c";
}

Example with the @replyTimeout handler:

getCatalog {
    out "READ:SYS:CAT";
    in "STAT:SYS:CAT:%1024c";
    @replyTimeout {
        out "READ:SYS:CAT";
        in "STAT:SYS:CAT:%1024c";
    }
}

Logs when not using @replyTimeout handler from 16:47:41 (end of IOC start) to 17:02 (around 15 minutes) seem to include 10 separate groups where there was no reply and a timeout occurred which resulted in mismatches. The total of the time spent in error was 60.747844 seconds (= 6.531672 + 11.041188 + 3.752574 + 6.520043 + 5.976647 + 6.137069 + 6.218394 + 4.839099 + 4.751997 + 4.979161). Here is the log from that time::

[2021-09-17 16:47:41] 2021/09/17 16:47:41.212126 L0 IN:EMU:MERCURY_01:3:HEATER:VOLT: No reply within 175 ms to "READ:DEV:None:HTR:SIG:VOLT<0a>"
[2021-09-17 16:47:41] 2021/09/17 16:47:41.263862 L0 IN:EMU:MERCURY_01:4:TEMP:SP:RBV: Input "STAT:DEV:None:N/A"
[2021-09-17 16:47:41] 2021/09/17 16:47:41.263878 L0 IN:EMU:MERCURY_01:4:TEMP:SP:RBV: mismatch after 9 bytes "STAT:DEV:"
[2021-09-17 16:47:41] 2021/09/17 16:47:41.263886 L0 IN:EMU:MERCURY_01:4:TEMP:SP:RBV: got "None:N/A" where "DB8.T1:TEMP:LOOP:TSET:" was expected
[2021-09-17 16:47:41] 2021/09/17 16:47:41.295359 L0 IN:EMU:MERCURY_01:4:RESISTANCE: Input "...TAT:DEV:DB8.T1:TEMP:LOOP:TSET:10.0000K"
[2021-09-17 16:47:41] 2021/09/17 16:47:41.295377 L0 IN:EMU:MERCURY_01:4:RESISTANCE: mismatch after 21 bytes "...8.T1:TEMP:"
[2021-09-17 16:47:41] 2021/09/17 16:47:41.295386 L0 IN:EMU:MERCURY_01:4:RESISTANCE: got "LOOP:TSET:..." where "SIG:RES:" was expected
[2021-09-17 16:47:47] 2021/09/17 16:47:47.722443 L0 IN:EMU:MERCURY_01:4:TEMP:SP:RBV: Match passed again
[2021-09-17 16:47:47] 2021/09/17 16:47:47.743798 L0 IN:EMU:MERCURY_01:4:RESISTANCE: Match passed again

[2021-09-17 16:47:55] 2021/09/17 16:47:55.233098 L0 IN:EMU:MERCURY_01:1:CALFILE: No reply within 175 ms to "READ:DEV:MB1.T1:TEMP:CAL:FILE<0a>"
[2021-09-17 16:47:55] 2021/09/17 16:47:55.244910 L0 IN:EMU:MERCURY_01:2:NAME: Input "STAT:DEV:MB1.T1:TEMP:CAL:FILE:X127679_HX..."
[2021-09-17 16:47:55] 2021/09/17 16:47:55.244926 L0 IN:EMU:MERCURY_01:2:NAME: mismatch after 9 bytes "STAT:DEV:"
[2021-09-17 16:47:55] 2021/09/17 16:47:55.244934 L0 IN:EMU:MERCURY_01:2:NAME: got "MB1.T1:TEM..." where "DB6.T1:TEMP:NICK:" was expected
[2021-09-17 16:47:55] 2021/09/17 16:47:55.253916 L0 IN:EMU:MERCURY_01:2:HTRCHAN: Input "...TAT:DEV:DB6.T1:TEMP:NICK:DB6.SH_T"
[2021-09-17 16:47:55] 2021/09/17 16:47:55.253930 L0 IN:EMU:MERCURY_01:2:HTRCHAN: mismatch after 21 bytes "...6.T1:TEMP:"
[2021-09-17 16:47:55] 2021/09/17 16:47:55.253939 L0 IN:EMU:MERCURY_01:2:HTRCHAN: got "NICK:DB6.S..." where "LOOP:HTR:" was expected
[2021-09-17 16:47:55] 2021/09/17 16:47:55.265682 L0 IN:EMU:MERCURY_01:1:TEMP: Input "STAT:DEV:DB6.T1:TEMP:LOOP:HTR:DB1.SH_H..."
[2021-09-17 16:47:55] 2021/09/17 16:47:55.265696 L0 IN:EMU:MERCURY_01:1:TEMP: mismatch after 9 bytes "STAT:DEV:"
[2021-09-17 16:47:55] 2021/09/17 16:47:55.265705 L0 IN:EMU:MERCURY_01:1:TEMP: got "DB6.T1:TEM..." where "MB1.T1:TEMP:SIG:TEMP:" was expected
[2021-09-17 16:47:55] 2021/09/17 16:47:55.280757 L0 IN:EMU:MERCURY_01:2:AUXCHAN: Input "STAT:DEV:MB1.T1:TEMP:SIG:TEMP:293.8221K..."
[2021-09-17 16:47:55] 2021/09/17 16:47:55.280774 L0 IN:EMU:MERCURY_01:2:AUXCHAN: mismatch after 9 bytes "STAT:DEV:"
[2021-09-17 16:47:55] 2021/09/17 16:47:55.280782 L0 IN:EMU:MERCURY_01:2:AUXCHAN: got "MB1.T1:TEM..." where "DB6.T1:TEMP:LOOP:AUX:" was expected
[2021-09-17 16:47:55] 2021/09/17 16:47:55.323289 L0 IN:EMU:MERCURY_01:1:TEMP:SP:RBV: Input "STAT:DEV:DB6.T1:TEMP:LOOP:AUX:None..."
[2021-09-17 16:47:55] 2021/09/17 16:47:55.323307 L0 IN:EMU:MERCURY_01:1:TEMP:SP:RBV: mismatch after 9 bytes "STAT:DEV:"
[2021-09-17 16:47:55] 2021/09/17 16:47:55.323316 L0 IN:EMU:MERCURY_01:1:TEMP:SP:RBV: got "DB6.T1:TEM..." where "MB1.T1:TEMP:LOOP:TSET:" was expected
[2021-09-17 16:47:59] 2021/09/17 16:47:59.563663 L0 IN:EMU:MERCURY_01:1:TEMP: Match passed again
[2021-09-17 16:48:00] 2021/09/17 16:48:00.003902 L0 IN:EMU:MERCURY_01:1:TEMP:SP:RBV: Match passed again
[2021-09-17 16:48:06] 2021/09/17 16:48:06.232135 L0 IN:EMU:MERCURY_01:2:NAME: Match passed again
[2021-09-17 16:48:06] 2021/09/17 16:48:06.252618 L0 IN:EMU:MERCURY_01:2:HTRCHAN: Match passed again
[2021-09-17 16:48:06] 2021/09/17 16:48:06.274286 L0 IN:EMU:MERCURY_01:2:AUXCHAN: Match passed again

[2021-09-17 16:48:48] 2021/09/17 16:48:48.211355 L0 IN:EMU:MERCURY_01:LEVEL:1:HELIUM: No reply within 175 ms to "READ:DEV:DB2.L1:LVL:SIG:HEL:LEV<0a>"
[2021-09-17 16:48:48] 2021/09/17 16:48:48.220928 L0 IN:EMU:MERCURY_01:LEVEL:1:NITROGEN: Input "...:DEV:DB2.L1:LVL:SIG:HEL:LEV:-7.7999%"
[2021-09-17 16:48:48] 2021/09/17 16:48:48.220946 L0 IN:EMU:MERCURY_01:LEVEL:1:NITROGEN: mismatch after 24 bytes "...1:LVL:SIG:"
[2021-09-17 16:48:48] 2021/09/17 16:48:48.220959 L0 IN:EMU:MERCURY_01:LEVEL:1:NITROGEN: got "HEL:LEV:-7..." where "NIT:LEV:" was expected
[2021-09-17 16:48:48] 2021/09/17 16:48:48.266887 L0 IN:EMU:MERCURY_01:PRESSURE:1:PRESSURE:SP:RBV: Input "STAT:DEV:DB2.L1:LVL:SIG:NIT:LEV:0.7645%..."
[2021-09-17 16:48:48] 2021/09/17 16:48:48.266905 L0 IN:EMU:MERCURY_01:PRESSURE:1:PRESSURE:SP:RBV: mismatch after 11 bytes "...TAT:DEV:DB"
[2021-09-17 16:48:48] 2021/09/17 16:48:48.266914 L0 IN:EMU:MERCURY_01:PRESSURE:1:PRESSURE:SP:RBV: got "2.L1:LVL:S..." where "5.P1:PRES:LOOP:PRST:" was expected
[2021-09-17 16:48:51] 2021/09/17 16:48:51.944153 L0 IN:EMU:MERCURY_01:LEVEL:1:NITROGEN: Match passed again
[2021-09-17 16:48:51] 2021/09/17 16:48:51.963929 L0 IN:EMU:MERCURY_01:PRESSURE:1:PRESSURE:SP:RBV: Match passed again

[2021-09-17 16:49:01] 2021/09/17 16:49:01.237137 L0 IN:EMU:MERCURY_01:3:TEMP:SP:RBV: No reply within 175 ms to "READ:DEV:DB7.T1:TEMP:LOOP:TSET<0a>"
[2021-09-17 16:49:01] 2021/09/17 16:49:01.252552 L0 IN:EMU:MERCURY_01:3:RESISTANCE: Input "...TAT:DEV:DB7.T1:TEMP:LOOP:TSET:0.0000K"
[2021-09-17 16:49:01] 2021/09/17 16:49:01.252569 L0 IN:EMU:MERCURY_01:3:RESISTANCE: mismatch after 21 bytes "...7.T1:TEMP:"
[2021-09-17 16:49:01] 2021/09/17 16:49:01.252577 L0 IN:EMU:MERCURY_01:3:RESISTANCE: got "LOOP:TSET:..." where "SIG:RES:" was expected
[2021-09-17 16:49:01] 2021/09/17 16:49:01.279067 L0 IN:EMU:MERCURY_01:3:P: Input "...TAT:DEV:DB7.T1:TEMP:SIG:RES:10000000.000..."
[2021-09-17 16:49:01] 2021/09/17 16:49:01.279084 L0 IN:EMU:MERCURY_01:3:P: mismatch after 21 bytes "...7.T1:TEMP:"
[2021-09-17 16:49:01] 2021/09/17 16:49:01.279093 L0 IN:EMU:MERCURY_01:3:P: got "SIG:RES:10..." where "LOOP:P:" was expected
[2021-09-17 16:49:07] 2021/09/17 16:49:07.732128 L0 IN:EMU:MERCURY_01:3:RESISTANCE: Match passed again
[2021-09-17 16:49:07] 2021/09/17 16:49:07.757180 L0 IN:EMU:MERCURY_01:3:P: Match passed again

[2021-09-17 16:49:53] 2021/09/17 16:49:53.490585 L0 IN:EMU:MERCURY_01:PRESSURE:2:HEATER:VOLT: No reply within 175 ms to "READ:DEV:None:HTR:SIG:VOLT<0a>"
[2021-09-17 16:49:53] 2021/09/17 16:49:53.541296 L0 IN:EMU:MERCURY_01:1:PID:AUTO: Input "STAT:DEV:None:N/A"
[2021-09-17 16:49:53] 2021/09/17 16:49:53.541315 L0 IN:EMU:MERCURY_01:1:PID:AUTO: mismatch after 9 bytes "STAT:DEV:"
[2021-09-17 16:49:53] 2021/09/17 16:49:53.541323 L0 IN:EMU:MERCURY_01:1:PID:AUTO: got "None:N/A" where "MB1.T1:TEMP:LOOP:PIDT:" was expected
[2021-09-17 16:49:53] 2021/09/17 16:49:53.581526 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: Input "...EV:MB1.T1:TEMP:LOOP:PIDT:OFF"
[2021-09-17 16:49:53] 2021/09/17 16:49:53.581542 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: mismatch after 26 bytes "...TEMP:LOOP:"
[2021-09-17 16:49:53] 2021/09/17 16:49:53.581550 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: got "PIDT:OFF" where "ENAB:" was expected
[2021-09-17 16:49:59] 2021/09/17 16:49:59.450282 L0 IN:EMU:MERCURY_01:1:PID:AUTO: Match passed again
[2021-09-17 16:49:59] 2021/09/17 16:49:59.467232 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: Match passed again

[2021-09-17 16:52:50] 2021/09/17 16:52:50.493311 L0 IN:EMU:MERCURY_01:1:HEATER:VOLT: No reply within 175 ms to "READ:DEV:MB0.VTI_H:HTR:SIG:VOLT<0a>"
[2021-09-17 16:52:50] 2021/09/17 16:52:50.509807 L0 IN:EMU:MERCURY_01:1:HEATER:CURR: Input "...V:MB0.VTI_H:HTR:SIG:VOLT:0.0002V"
[2021-09-17 16:52:50] 2021/09/17 16:52:50.509823 L0 IN:EMU:MERCURY_01:1:HEATER:CURR: mismatch after 27 bytes "...H:HTR:SIG:"
[2021-09-17 16:52:50] 2021/09/17 16:52:50.509832 L0 IN:EMU:MERCURY_01:1:HEATER:CURR: got "VOLT:0.000..." where "CURR:" was expected
[2021-09-17 16:52:50] 2021/09/17 16:52:50.521668 L0 IN:EMU:MERCURY_01:1:FLOW: Input "STAT:DEV:MB0.VTI_H:HTR:SIG:CURR:-0.0000A..."
[2021-09-17 16:52:50] 2021/09/17 16:52:50.521684 L0 IN:EMU:MERCURY_01:1:FLOW: mismatch after 19 bytes "...MB0.VTI_H:"
[2021-09-17 16:52:50] 2021/09/17 16:52:50.521693 L0 IN:EMU:MERCURY_01:1:FLOW: got "HTR:SIG:CU..." where "AUX:SIG:PERC:" was expected
[2021-09-17 16:52:50] 2021/09/17 16:52:50.538799 L0 IN:EMU:MERCURY_01:2:TEMP:SP:RBV: Input "STAT:DEV:None:N/A"
[2021-09-17 16:52:50] 2021/09/17 16:52:50.538814 L0 IN:EMU:MERCURY_01:2:TEMP:SP:RBV: mismatch after 9 bytes "STAT:DEV:"
[2021-09-17 16:52:50] 2021/09/17 16:52:50.538822 L0 IN:EMU:MERCURY_01:2:TEMP:SP:RBV: got "None:N/A" where "DB6.T1:TEMP:LOOP:TSET:" was expected
[2021-09-17 16:52:50] 2021/09/17 16:52:50.580455 L0 IN:EMU:MERCURY_01:2:RESISTANCE: Input "...TAT:DEV:DB6.T1:TEMP:LOOP:TSET:0.1000K"
[2021-09-17 16:52:50] 2021/09/17 16:52:50.580471 L0 IN:EMU:MERCURY_01:2:RESISTANCE: mismatch after 21 bytes "...6.T1:TEMP:"
[2021-09-17 16:52:50] 2021/09/17 16:52:50.580480 L0 IN:EMU:MERCURY_01:2:RESISTANCE: got "LOOP:TSET:..." where "SIG:RES:" was expected
[2021-09-17 16:52:56] 2021/09/17 16:52:56.516953 L0 IN:EMU:MERCURY_01:1:HEATER:CURR: Match passed again
[2021-09-17 16:52:56] 2021/09/17 16:52:56.536367 L0 IN:EMU:MERCURY_01:1:FLOW: Input "STAT:DEV:None:N/A"
[2021-09-17 16:52:56] 2021/09/17 16:52:56.536381 L0 IN:EMU:MERCURY_01:1:FLOW: mismatch after 14 bytes "...:DEV:None:"
[2021-09-17 16:52:56] 2021/09/17 16:52:56.536390 L0 IN:EMU:MERCURY_01:1:FLOW: got "N/A" where "AUX:SIG:PERC:" was expected
[2021-09-17 16:52:56] 2021/09/17 16:52:56.573506 L0 IN:EMU:MERCURY_01:2:TEMP:SP:RBV: Match passed again
[2021-09-17 16:52:56] 2021/09/17 16:52:56.630380 L0 IN:EMU:MERCURY_01:2:RESISTANCE: Match passed again

[2021-09-17 16:54:20] 2021/09/17 16:54:20.735488 L0 IN:EMU:MERCURY_01:2:I: No reply within 175 ms to "READ:DEV:DB6.T1:TEMP:LOOP:I<0a>"
[2021-09-17 16:54:20] 2021/09/17 16:54:20.748331 L0 IN:EMU:MERCURY_01:2:D: Input "...EV:DB6.T1:TEMP:LOOP:I:2.000000"
[2021-09-17 16:54:20] 2021/09/17 16:54:20.748347 L0 IN:EMU:MERCURY_01:2:D: mismatch after 26 bytes "...TEMP:LOOP:"
[2021-09-17 16:54:20] 2021/09/17 16:54:20.748356 L0 IN:EMU:MERCURY_01:2:D: got "I:2.000000" where "D:" was expected
[2021-09-17 16:54:20] 2021/09/17 16:54:20.817469 L0 IN:EMU:MERCURY_01:2:HEATER: Input "...EV:DB6.T1:TEMP:LOOP:D:0.000000"
[2021-09-17 16:54:20] 2021/09/17 16:54:20.817488 L0 IN:EMU:MERCURY_01:2:HEATER: mismatch after 26 bytes "...TEMP:LOOP:"
[2021-09-17 16:54:20] 2021/09/17 16:54:20.817497 L0 IN:EMU:MERCURY_01:2:HEATER: got "D:0.000000" where "HSET:" was expected
[2021-09-17 16:54:26] 2021/09/17 16:54:26.932299 L0 IN:EMU:MERCURY_01:2:D: Match passed again
[2021-09-17 16:54:26] 2021/09/17 16:54:26.953882 L0 IN:EMU:MERCURY_01:2:HEATER: Match passed again

[2021-09-17 16:58:07] 2021/09/17 16:58:07.736360 L0 IN:EMU:MERCURY_01:PRESSURE:2:FLOW: No reply within 175 ms to "READ:DEV:None:AUX:SIG:PERC<0a>"
[2021-09-17 16:58:07] 2021/09/17 16:58:07.751763 L0 IN:EMU:MERCURY_01:1:PID:AUTO: Input "STAT:DEV:None:N/A"
[2021-09-17 16:58:07] 2021/09/17 16:58:07.751780 L0 IN:EMU:MERCURY_01:1:PID:AUTO: mismatch after 9 bytes "STAT:DEV:"
[2021-09-17 16:58:07] 2021/09/17 16:58:07.751788 L0 IN:EMU:MERCURY_01:1:PID:AUTO: got "None:N/A" where "MB1.T1:TEMP:LOOP:PIDT:" was expected
[2021-09-17 16:58:07] 2021/09/17 16:58:07.777064 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: Input "...EV:MB1.T1:TEMP:LOOP:PIDT:OFF"
[2021-09-17 16:58:07] 2021/09/17 16:58:07.777080 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: mismatch after 26 bytes "...TEMP:LOOP:"
[2021-09-17 16:58:07] 2021/09/17 16:58:07.777088 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: got "PIDT:OFF" where "ENAB:" was expected
[2021-09-17 16:58:12] 2021/09/17 16:58:12.541744 L0 IN:EMU:MERCURY_01:1:PID:AUTO: Match passed again
[2021-09-17 16:58:12] 2021/09/17 16:58:12.575459 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: Match passed again

[2021-09-17 16:58:31] 2021/09/17 16:58:31.983621 L0 IN:EMU:MERCURY_01:4:HEATER:POWER: No reply within 175 ms to "READ:DEV:None:HTR:SIG:POWR<0a>"
[2021-09-17 16:58:32] 2021/09/17 16:58:32.074135 L0 IN:EMU:MERCURY_01:LEVEL:1:HELIUM: Input "STAT:DEV:None:N/A"
[2021-09-17 16:58:32] 2021/09/17 16:58:32.074153 L0 IN:EMU:MERCURY_01:LEVEL:1:HELIUM: mismatch after 9 bytes "STAT:DEV:"
[2021-09-17 16:58:32] 2021/09/17 16:58:32.074162 L0 IN:EMU:MERCURY_01:LEVEL:1:HELIUM: got "None:N/A" where "DB2.L1:LVL:SIG:HEL:LEV:" was expected
[2021-09-17 16:58:36] 2021/09/17 16:58:36.735618 L0 IN:EMU:MERCURY_01:LEVEL:1:HELIUM: Match passed again

[2021-09-17 16:59:10] 2021/09/17 16:59:10.987756 L0 IN:EMU:MERCURY_01:2:FLOW: No reply within 175 ms to "READ:DEV:None:AUX:SIG:PERC<0a>"
[2021-09-17 16:59:10] 2021/09/17 16:59:10.997036 L0 IN:EMU:MERCURY_01:3:TEMP:SP:RBV: Input "STAT:DEV:None:N/A"
[2021-09-17 16:59:10] 2021/09/17 16:59:10.997050 L0 IN:EMU:MERCURY_01:3:TEMP:SP:RBV: mismatch after 9 bytes "STAT:DEV:"
[2021-09-17 16:59:10] 2021/09/17 16:59:10.997058 L0 IN:EMU:MERCURY_01:3:TEMP:SP:RBV: got "None:N/A" where "DB7.T1:TEMP:LOOP:TSET:" was expected
[2021-09-17 16:59:11] 2021/09/17 16:59:11.029635 L0 IN:EMU:MERCURY_01:3:RESISTANCE: Input "...TAT:DEV:DB7.T1:TEMP:LOOP:TSET:0.0000K"
[2021-09-17 16:59:11] 2021/09/17 16:59:11.029650 L0 IN:EMU:MERCURY_01:3:RESISTANCE: mismatch after 21 bytes "...7.T1:TEMP:"
[2021-09-17 16:59:11] 2021/09/17 16:59:11.029658 L0 IN:EMU:MERCURY_01:3:RESISTANCE: got "LOOP:TSET:..." where "SIG:RES:" was expected
[2021-09-17 16:59:15] 2021/09/17 16:59:15.944810 L0 IN:EMU:MERCURY_01:3:TEMP:SP:RBV: Match passed again
[2021-09-17 16:59:15] 2021/09/17 16:59:15.966917 L0 IN:EMU:MERCURY_01:3:RESISTANCE: Match passed again

Logs when using @replyTimeout handler from 16:28:44 (end of IOC start) to 16:44 (around 15 minutes) seem to include 7 separate groups where there was no reply and a timeout occurred which resulted in mismatches. The total of the time spent in error was 48.815173 seconds (= 6.215212 + 6.734763 + 5.172614 + 15.742833 + 4.748808 + 5.009762 + 5.191181). Here is the log from that time:

[2021-09-17 16:28:44] 2021/09/17 16:28:44.218742 L0 IN:EMU:MERCURY_01:PRESSURE:2:HEATER:POWER: No reply within 175 ms to "READ:DEV:None:HTR:SIG:POWR<0a>"
[2021-09-17 16:28:44] 2021/09/17 16:28:44.298443 L0 IN:EMU:MERCURY_01:1:PID:AUTO: Input "STAT:DEV:None:N/A"
[2021-09-17 16:28:44] 2021/09/17 16:28:44.298460 L0 IN:EMU:MERCURY_01:1:PID:AUTO: mismatch after 9 bytes "STAT:DEV:"
[2021-09-17 16:28:44] 2021/09/17 16:28:44.298469 L0 IN:EMU:MERCURY_01:1:PID:AUTO: got "None:N/A" where "MB1.T1:TEMP:LOOP:PIDT:" was expected
[2021-09-17 16:28:44] 2021/09/17 16:28:44.330822 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: Input "...EV:MB1.T1:TEMP:LOOP:PIDT:OFF"
[2021-09-17 16:28:44] 2021/09/17 16:28:44.330840 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: mismatch after 26 bytes "...TEMP:LOOP:"
[2021-09-17 16:28:44] 2021/09/17 16:28:44.330849 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: got "PIDT:OFF" where "ENAB:" was expected
[2021-09-17 16:28:50] 2021/09/17 16:28:50.414617 L0 IN:EMU:MERCURY_01:1:PID:AUTO: Match passed again
[2021-09-17 16:28:50] 2021/09/17 16:28:50.433954 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: Match passed again

[2021-09-17 16:31:53] 2021/09/17 16:31:53.727365 L0 IN:EMU:MERCURY_01:PRESSURE:2:I: No reply within 175 ms to "READ:DEV:DB3.P1:PRES:LOOP:I<0a>"
[2021-09-17 16:31:53] 2021/09/17 16:31:53.902714 L0 IN:EMU:MERCURY_01:PRESSURE:2:I: No reply within 175 ms to "READ:DEV:DB3.P1:PRES:LOOP:I<0a>"
[2021-09-17 16:31:53] 2021/09/17 16:31:53.951891 L0 IN:EMU:MERCURY_01:PRESSURE:2:D: Input "...EV:DB3.P1:PRES:LOOP:I:1.0"
[2021-09-17 16:31:53] 2021/09/17 16:31:53.951945 L0 IN:EMU:MERCURY_01:PRESSURE:2:D: mismatch after 26 bytes "...PRES:LOOP:"
[2021-09-17 16:31:53] 2021/09/17 16:31:53.951955 L0 IN:EMU:MERCURY_01:PRESSURE:2:D: got "I:1.0" where "D:" was expected
[2021-09-17 16:31:53] 2021/09/17 16:31:53.993563 L0 IN:EMU:MERCURY_01:PRESSURE:2:HEATER: Input "...EV:DB3.P1:PRES:LOOP:I:1.0"
[2021-09-17 16:31:53] 2021/09/17 16:31:53.993579 L0 IN:EMU:MERCURY_01:PRESSURE:2:HEATER: mismatch after 26 bytes "...PRES:LOOP:"
[2021-09-17 16:31:53] 2021/09/17 16:31:53.993587 L0 IN:EMU:MERCURY_01:PRESSURE:2:HEATER: got "I:1.0" where "HSET:" was expected
[2021-09-17 16:32:00] 2021/09/17 16:32:00.422295 L0 IN:EMU:MERCURY_01:PRESSURE:2:D: Match passed again
[2021-09-17 16:32:00] 2021/09/17 16:32:00.462128 L0 IN:EMU:MERCURY_01:PRESSURE:2:HEATER: Match passed again

[2021-09-17 16:34:51] 2021/09/17 16:34:51.483258 L0 IN:EMU:MERCURY_01:3:TEMP: No reply within 175 ms to "READ:DEV:DB7.T1:TEMP:SIG:TEMP<0a>"
[2021-09-17 16:34:51] 2021/09/17 16:34:51.506090 L0 IN:EMU:MERCURY_01:4:TEMP: Input "STAT:DEV:DB7.T1:TEMP:SIG:TEMP:0.0000K..."
[2021-09-17 16:34:51] 2021/09/17 16:34:51.506106 L0 IN:EMU:MERCURY_01:4:TEMP: mismatch after 11 bytes "...TAT:DEV:DB"
[2021-09-17 16:34:51] 2021/09/17 16:34:51.506114 L0 IN:EMU:MERCURY_01:4:TEMP: got "7.T1:TEMP:..." where "8.T1:TEMP:SIG:TEMP:" was expected
[2021-09-17 16:34:51] 2021/09/17 16:34:51.517097 L0 IN:EMU:MERCURY_01:PRESSURE:1:PRESSURE: Input "STAT:DEV:DB8.T1:TEMP:SIG:TEMP:0.0000K..."
[2021-09-17 16:34:51] 2021/09/17 16:34:51.517112 L0 IN:EMU:MERCURY_01:PRESSURE:1:PRESSURE: mismatch after 11 bytes "...TAT:DEV:DB"
[2021-09-17 16:34:51] 2021/09/17 16:34:51.517121 L0 IN:EMU:MERCURY_01:PRESSURE:1:PRESSURE: got "8.T1:TEMP:..." where "5.P1:PRES:SIG:PRES:" was expected
[2021-09-17 16:34:51] 2021/09/17 16:34:51.531788 L0 IN:EMU:MERCURY_01:PRESSURE:2:PRESSURE: Input "STAT:DEV:DB5.P1:PRES:SIG:PRES:989.6612mB..."
[2021-09-17 16:34:51] 2021/09/17 16:34:51.531804 L0 IN:EMU:MERCURY_01:PRESSURE:2:PRESSURE: mismatch after 11 bytes "...TAT:DEV:DB"
[2021-09-17 16:34:51] 2021/09/17 16:34:51.531812 L0 IN:EMU:MERCURY_01:PRESSURE:2:PRESSURE: got "5.P1:PRES:..." where "3.P1:PRES:SIG:PRES:" was expected
[2021-09-17 16:34:51] 2021/09/17 16:34:51.573974 L0 IN:EMU:MERCURY_01:1:TEMP:SP:RBV: Input "STAT:DEV:DB3.P1:PRES:SIG:PRES:993.6565mB..."
[2021-09-17 16:34:51] 2021/09/17 16:34:51.573990 L0 IN:EMU:MERCURY_01:1:TEMP:SP:RBV: mismatch after 9 bytes "STAT:DEV:"
[2021-09-17 16:34:51] 2021/09/17 16:34:51.573999 L0 IN:EMU:MERCURY_01:1:TEMP:SP:RBV: got "DB3.P1:PRE..." where "MB1.T1:TEMP:LOOP:TSET:" was expected
[2021-09-17 16:34:55] 2021/09/17 16:34:55.246246 L0 IN:EMU:MERCURY_01:4:TEMP: Match passed again
[2021-09-17 16:34:55] 2021/09/17 16:34:55.266308 L0 IN:EMU:MERCURY_01:PRESSURE:1:PRESSURE: Match passed again
[2021-09-17 16:34:55] 2021/09/17 16:34:55.307764 L0 IN:EMU:MERCURY_01:PRESSURE:2:PRESSURE: Match passed again
[2021-09-17 16:34:56] 2021/09/17 16:34:56.655872 L0 IN:EMU:MERCURY_01:1:TEMP:SP:RBV: Match passed again

[2021-09-17 16:36:44] 2021/09/17 16:36:44.721388 L0 IN:EMU:MERCURY_01:PRESSURE:2:HEATER:VOLT: No reply within 175 ms to "READ:DEV:None:HTR:SIG:VOLT<0a>"
[2021-09-17 16:36:44] 2021/09/17 16:36:44.770144 L0 IN:EMU:MERCURY_01:1:PID:AUTO: Input "STAT:DEV:None:N/A"
[2021-09-17 16:36:44] 2021/09/17 16:36:44.770161 L0 IN:EMU:MERCURY_01:1:PID:AUTO: mismatch after 9 bytes "STAT:DEV:"
[2021-09-17 16:36:44] 2021/09/17 16:36:44.770169 L0 IN:EMU:MERCURY_01:1:PID:AUTO: got "None:N/A" where "MB1.T1:TEMP:LOOP:PIDT:" was expected
[2021-09-17 16:36:44] 2021/09/17 16:36:44.811608 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: Input "...EV:MB1.T1:TEMP:LOOP:PIDT:OFF"
[2021-09-17 16:36:44] 2021/09/17 16:36:44.811624 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: mismatch after 26 bytes "...TEMP:LOOP:"
[2021-09-17 16:36:44] 2021/09/17 16:36:44.811633 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: got "PIDT:OFF" where "ENAB:" was expected
[2021-09-17 16:36:47] 2021/09/17 16:36:47.727122 L0 IN:EMU:MERCURY_01:3:TEMP:SP:RBV: No reply within 175 ms to "READ:DEV:DB7.T1:TEMP:LOOP:TSET<0a>"
[2021-09-17 16:36:47] 2021/09/17 16:36:47.750852 L0 IN:EMU:MERCURY_01:3:RESISTANCE: Input "...TAT:DEV:DB7.T1:TEMP:LOOP:TSET:0.0000K"
[2021-09-17 16:36:47] 2021/09/17 16:36:47.750869 L0 IN:EMU:MERCURY_01:3:RESISTANCE: mismatch after 21 bytes "...7.T1:TEMP:"
[2021-09-17 16:36:47] 2021/09/17 16:36:47.750877 L0 IN:EMU:MERCURY_01:3:RESISTANCE: got "LOOP:TSET:..." where "SIG:RES:" was expected
[2021-09-17 16:36:47] 2021/09/17 16:36:47.763124 L0 IN:EMU:MERCURY_01:3:P: Input "...TAT:DEV:DB7.T1:TEMP:SIG:RES:10000000.000..."
[2021-09-17 16:36:47] 2021/09/17 16:36:47.763140 L0 IN:EMU:MERCURY_01:3:P: mismatch after 21 bytes "...7.T1:TEMP:"
[2021-09-17 16:36:47] 2021/09/17 16:36:47.763148 L0 IN:EMU:MERCURY_01:3:P: got "SIG:RES:10..." where "LOOP:P:" was expected
[2021-09-17 16:36:47] 2021/09/17 16:36:47.774260 L0 IN:EMU:MERCURY_01:3:I: Input "...EV:DB7.T1:TEMP:LOOP:P:1.0"
[2021-09-17 16:36:47] 2021/09/17 16:36:47.774276 L0 IN:EMU:MERCURY_01:3:I: mismatch after 26 bytes "...TEMP:LOOP:"
[2021-09-17 16:36:47] 2021/09/17 16:36:47.774284 L0 IN:EMU:MERCURY_01:3:I: got "P:1.0" where "I:" was expected
[2021-09-17 16:36:47] 2021/09/17 16:36:47.805056 L0 IN:EMU:MERCURY_01:3:D: Input "...EV:DB7.T1:TEMP:LOOP:I:1.0"
[2021-09-17 16:36:47] 2021/09/17 16:36:47.805071 L0 IN:EMU:MERCURY_01:3:D: mismatch after 26 bytes "...TEMP:LOOP:"
[2021-09-17 16:36:47] 2021/09/17 16:36:47.805079 L0 IN:EMU:MERCURY_01:3:D: got "I:1.0" where "D:" was expected
[2021-09-17 16:36:49] 2021/09/17 16:36:49.504377 L0 IN:EMU:MERCURY_01:1:PID:AUTO: Match passed again
[2021-09-17 16:36:49] 2021/09/17 16:36:49.529008 L0 IN:EMU:MERCURY_01:1:HEATER:MODE: Match passed again
[2021-09-17 16:36:50] 2021/09/17 16:36:50.735806 L0 IN:EMU:MERCURY_01:3:CALFILE: No reply within 175 ms to "READ:DEV:DB7.T1:TEMP:CAL:FILE<0a>"
[2021-09-17 16:36:50] 2021/09/17 16:36:50.774702 L0 IN:EMU:MERCURY_01:4:NAME: Input "STAT:DEV:DB7.T1:TEMP:CAL:FILE:X129878 K2..."
[2021-09-17 16:36:50] 2021/09/17 16:36:50.774720 L0 IN:EMU:MERCURY_01:4:NAME: mismatch after 11 bytes "...TAT:DEV:DB"
[2021-09-17 16:36:50] 2021/09/17 16:36:50.774729 L0 IN:EMU:MERCURY_01:4:NAME: got "7.T1:TEMP:..." where "8.T1:TEMP:NICK:" was expected
[2021-09-17 16:36:52] 2021/09/17 16:36:52.733434 L0 IN:EMU:MERCURY_01:3:RESISTANCE: Match passed again
[2021-09-17 16:36:52] 2021/09/17 16:36:52.750040 L0 IN:EMU:MERCURY_01:3:P: Match passed again
[2021-09-17 16:36:52] 2021/09/17 16:36:52.775121 L0 IN:EMU:MERCURY_01:3:I: Match passed again
[2021-09-17 16:36:52] 2021/09/17 16:36:52.810216 L0 IN:EMU:MERCURY_01:3:D: Match passed again
[2021-09-17 16:37:00] 2021/09/17 16:37:00.464221 L0 IN:EMU:MERCURY_01:4:NAME: Match passed again

[2021-09-17 16:37:27] 2021/09/17 16:37:27.729619 L0 IN:EMU:MERCURY_01:2:HEATER:CURR: No reply within 175 ms to "READ:DEV:DB1.SH_H:HTR:SIG:CURR<0a>"
[2021-09-17 16:37:27] 2021/09/17 16:37:27.753304 L0 IN:EMU:MERCURY_01:2:FLOW: Input "STAT:DEV:DB1.SH_H:HTR:SIG:CURR:-0.0000A..."
[2021-09-17 16:37:27] 2021/09/17 16:37:27.753322 L0 IN:EMU:MERCURY_01:2:FLOW: mismatch after 18 bytes "...:DB1.SH_H:"
[2021-09-17 16:37:27] 2021/09/17 16:37:27.753331 L0 IN:EMU:MERCURY_01:2:FLOW: got "HTR:SIG:CU..." where "AUX:SIG:PERC:" was expected
[2021-09-17 16:37:27] 2021/09/17 16:37:27.761586 L0 IN:EMU:MERCURY_01:3:TEMP:SP:RBV: Input "STAT:DEV:None:N/A"
[2021-09-17 16:37:27] 2021/09/17 16:37:27.761599 L0 IN:EMU:MERCURY_01:3:TEMP:SP:RBV: mismatch after 9 bytes "STAT:DEV:"
[2021-09-17 16:37:27] 2021/09/17 16:37:27.761607 L0 IN:EMU:MERCURY_01:3:TEMP:SP:RBV: got "None:N/A" where "DB7.T1:TEMP:LOOP:TSET:" was expected
[2021-09-17 16:37:27] 2021/09/17 16:37:27.776727 L0 IN:EMU:MERCURY_01:3:RESISTANCE: Input "...TAT:DEV:DB7.T1:TEMP:LOOP:TSET:0.0000K"
[2021-09-17 16:37:27] 2021/09/17 16:37:27.776742 L0 IN:EMU:MERCURY_01:3:RESISTANCE: mismatch after 21 bytes "...7.T1:TEMP:"
[2021-09-17 16:37:27] 2021/09/17 16:37:27.776761 L0 IN:EMU:MERCURY_01:3:RESISTANCE: got "LOOP:TSET:..." where "SIG:RES:" was expected
[2021-09-17 16:37:27] 2021/09/17 16:37:27.819857 L0 IN:EMU:MERCURY_01:3:P: Input "...TAT:DEV:DB7.T1:TEMP:SIG:RES:10000000.000..."
[2021-09-17 16:37:27] 2021/09/17 16:37:27.819873 L0 IN:EMU:MERCURY_01:3:P: mismatch after 21 bytes "...7.T1:TEMP:"
[2021-09-17 16:37:27] 2021/09/17 16:37:27.819881 L0 IN:EMU:MERCURY_01:3:P: got "SIG:RES:10..." where "LOOP:P:" was expected
[2021-09-17 16:37:32] 2021/09/17 16:37:32.285545 L0 IN:EMU:MERCURY_01:2:FLOW: Input "STAT:DEV:None:N/A"
[2021-09-17 16:37:32] 2021/09/17 16:37:32.285560 L0 IN:EMU:MERCURY_01:2:FLOW: mismatch after 14 bytes "...:DEV:None:"
[2021-09-17 16:37:32] 2021/09/17 16:37:32.285569 L0 IN:EMU:MERCURY_01:2:FLOW: got "N/A" where "AUX:SIG:PERC:" was expected
[2021-09-17 16:37:32] 2021/09/17 16:37:32.322301 L0 IN:EMU:MERCURY_01:3:TEMP:SP:RBV: Match passed again
[2021-09-17 16:37:32] 2021/09/17 16:37:32.442117 L0 IN:EMU:MERCURY_01:3:RESISTANCE: Match passed again
[2021-09-17 16:37:32] 2021/09/17 16:37:32.478427 L0 IN:EMU:MERCURY_01:3:P: Match passed again

[2021-09-17 16:40:27] 2021/09/17 16:40:27.706516 L0 IN:EMU:MERCURY_01:3:HEATER:CURR: No reply within 175 ms to "READ:DEV:None:HTR:SIG:CURR<0a>"
[2021-09-17 16:40:27] 2021/09/17 16:40:27.736893 L0 IN:EMU:MERCURY_01:4:TEMP:SP:RBV: Input "STAT:DEV:None:N/A"
[2021-09-17 16:40:27] 2021/09/17 16:40:27.736909 L0 IN:EMU:MERCURY_01:4:TEMP:SP:RBV: mismatch after 9 bytes "STAT:DEV:"
[2021-09-17 16:40:27] 2021/09/17 16:40:27.736918 L0 IN:EMU:MERCURY_01:4:TEMP:SP:RBV: got "None:N/A" where "DB8.T1:TEMP:LOOP:TSET:" was expected
[2021-09-17 16:40:27] 2021/09/17 16:40:27.747757 L0 IN:EMU:MERCURY_01:4:RESISTANCE: Input "...TAT:DEV:DB8.T1:TEMP:LOOP:TSET:10.0000K"
[2021-09-17 16:40:27] 2021/09/17 16:40:27.747773 L0 IN:EMU:MERCURY_01:4:RESISTANCE: mismatch after 21 bytes "...8.T1:TEMP:"
[2021-09-17 16:40:27] 2021/09/17 16:40:27.747781 L0 IN:EMU:MERCURY_01:4:RESISTANCE: got "LOOP:TSET:..." where "SIG:RES:" was expected
[2021-09-17 16:40:27] 2021/09/17 16:40:27.759888 L0 IN:EMU:MERCURY_01:4:P: Input "...TAT:DEV:DB8.T1:TEMP:SIG:RES:10000000.000..."
[2021-09-17 16:40:27] 2021/09/17 16:40:27.759903 L0 IN:EMU:MERCURY_01:4:P: mismatch after 21 bytes "...8.T1:TEMP:"
[2021-09-17 16:40:27] 2021/09/17 16:40:27.759912 L0 IN:EMU:MERCURY_01:4:P: got "SIG:RES:10..." where "LOOP:P:" was expected
[2021-09-17 16:40:27] 2021/09/17 16:40:27.767310 L0 IN:EMU:MERCURY_01:4:I: Input "...EV:DB8.T1:TEMP:LOOP:P:10.000000"
[2021-09-17 16:40:27] 2021/09/17 16:40:27.767347 L0 IN:EMU:MERCURY_01:4:I: mismatch after 26 bytes "...TEMP:LOOP:"
[2021-09-17 16:40:27] 2021/09/17 16:40:27.767356 L0 IN:EMU:MERCURY_01:4:I: got "P:10.00000..." where "I:" was expected
[2021-09-17 16:40:27] 2021/09/17 16:40:27.782782 L0 IN:EMU:MERCURY_01:4:D: Input "...EV:DB8.T1:TEMP:LOOP:I:100.000000"
[2021-09-17 16:40:27] 2021/09/17 16:40:27.782797 L0 IN:EMU:MERCURY_01:4:D: mismatch after 26 bytes "...TEMP:LOOP:"
[2021-09-17 16:40:27] 2021/09/17 16:40:27.782805 L0 IN:EMU:MERCURY_01:4:D: got "I:100.0000..." where "D:" was expected
[2021-09-17 16:40:27] 2021/09/17 16:40:27.823595 L0 IN:EMU:MERCURY_01:4:HEATER: Input "...EV:DB8.T1:TEMP:LOOP:D:0.000000"
[2021-09-17 16:40:27] 2021/09/17 16:40:27.823614 L0 IN:EMU:MERCURY_01:4:HEATER: mismatch after 26 bytes "...TEMP:LOOP:"
[2021-09-17 16:40:27] 2021/09/17 16:40:27.823622 L0 IN:EMU:MERCURY_01:4:HEATER: got "D:0.000000" where "HSET:" was expected
[2021-09-17 16:40:32] 2021/09/17 16:40:32.470211 L0 IN:EMU:MERCURY_01:4:TEMP:SP:RBV: Match passed again
[2021-09-17 16:40:32] 2021/09/17 16:40:32.491435 L0 IN:EMU:MERCURY_01:4:RESISTANCE: Match passed again
[2021-09-17 16:40:32] 2021/09/17 16:40:32.510367 L0 IN:EMU:MERCURY_01:4:P: Match passed again
[2021-09-17 16:40:32] 2021/09/17 16:40:32.548900 L0 IN:EMU:MERCURY_01:4:I: Match passed again
[2021-09-17 16:40:32] 2021/09/17 16:40:32.678181 L0 IN:EMU:MERCURY_01:4:D: Match passed again
[2021-09-17 16:40:32] 2021/09/17 16:40:32.716278 L0 IN:EMU:MERCURY_01:4:HEATER: Match passed again

[2021-09-17 16:41:22] 2021/09/17 16:41:22.977353 L0 IN:EMU:MERCURY_01:3:FLOW: No reply within 175 ms to "READ:DEV:None:AUX:SIG:PERC<0a>"
[2021-09-17 16:41:23] 2021/09/17 16:41:23.018836 L0 IN:EMU:MERCURY_01:4:TEMP:SP:RBV: Input "STAT:DEV:None:N/A"
[2021-09-17 16:41:23] 2021/09/17 16:41:23.018853 L0 IN:EMU:MERCURY_01:4:TEMP:SP:RBV: mismatch after 9 bytes "STAT:DEV:"
[2021-09-17 16:41:23] 2021/09/17 16:41:23.018862 L0 IN:EMU:MERCURY_01:4:TEMP:SP:RBV: got "None:N/A" where "DB8.T1:TEMP:LOOP:TSET:" was expected
[2021-09-17 16:41:28] 2021/09/17 16:41:28.168534 L0 IN:EMU:MERCURY_01:4:TEMP:SP:RBV: Match passed again

Although 48.815173 is an 11.932671-second improvement on 60.747844, I'm not sure it's a big enough improvement to suggest the method of using the @replyTimeout handler we are using makes the handling of reply timeout issues significantly more robust. It may be that we are not using @replyTimeout to its full potential, we can also try using ExtraInput=Ignore and running some longer tests.

JamesKingWork commented 2 years ago

One more thing to look into next week: were we sending lots of commands to the mercury at the time when the logs show the errors most prevalently? If so can we simulate that again?

I have tried simulating this using:

from genie_python import genie as g
from genie_python import BLOCK_NAMES as b
import time
from datetime import datetime, timedelta

temps_to_set = [27, 27.5, 28]
ps_to_set = [0, 0.1, 0.2]
is_to_set = [0.3, 0.4, 0.5]
ds_to_set = [6.4, 6.5, 6.6]

def main(seconds_for_running = 400, temps_to_set = temps_to_set, ps_to_set = ps_to_set, is_to_set = is_to_set, ds_to_set = ds_to_set, sleep_time = 0.1):
    start_time = datetime.now()
    i = 0
    while timedelta(seconds=seconds_for_running) + start_time > datetime.now():
        temp_to_set = temps_to_set[i % len(temps_to_set)]
        p_to_set = ps_to_set[i % len(ps_to_set)]
        i_to_set = is_to_set[i % len(is_to_set)]
        d_to_set = ds_to_set[i % len(ds_to_set)]
        print(f"Setting temp_sp={temp_to_set}, p={p_to_set}, i={i_to_set}, d={d_to_set}")
        g.cset(b.Temp_SP, temp_to_set)
        g.cset(b.Temp_P, p_to_set)
        g.cset(b.Temp_I, i_to_set)
        g.cset(b.Temp_D, d_to_set)
        print(f"Sleeping for {sleep_time} seconds")
        time.sleep(sleep_time)
        i += 1

I did not spot any errors initially running this, but eventually, there were a few errors in the log that seemed to occur every few minutes (with a 1000ms reply timeout):

[2021-09-20 12:00:20] 2021/09/20 12:00:20.401394 L0 IN:EMU:MERCURY_01:1:I:SP: No reply within 1000 ms to "SET:DEV:MB1.T1:TEMP:LOOP:I:0.400000<0a>"
[2021-09-20 12:00:20] 2021/09/20 12:00:20.911752 L0 IN:EMU:MERCURY_01:1:D:SP: Input "...EV:MB1.T1:TEMP:LOOP:I:0.400000:VALID"
[2021-09-20 12:00:20] 2021/09/20 12:00:20.911772 L0 IN:EMU:MERCURY_01:1:D:SP: mismatch after 30 bytes "...TEMP:LOOP:"
[2021-09-20 12:00:20] 2021/09/20 12:00:20.911780 L0 IN:EMU:MERCURY_01:1:D:SP: got "I:0.400000..." where "D:" was expected
[2021-09-20 12:00:20] 2021/09/20 12:00:20.926796 L0 IN:EMU:MERCURY_01:1:TEMP:SP: Input "...EV:MB1.T1:TEMP:LOOP:D:6.500000:VALID"
[2021-09-20 12:00:20] 2021/09/20 12:00:20.926836 L0 IN:EMU:MERCURY_01:1:TEMP:SP: mismatch after 30 bytes "...TEMP:LOOP:"
[2021-09-20 12:00:20] 2021/09/20 12:00:20.926850 L0 IN:EMU:MERCURY_01:1:TEMP:SP: got "D:6.500000..." where "TSET:" was expected
[2021-09-20 12:00:21] 2021/09/20 12:00:21.232192 L0 IN:EMU:MERCURY_01:1:TEMP:SP:RBV: Input "STAT:SET:DEV:MB1.T1:TEMP:LOOP:TSET:27.00..."
[2021-09-20 12:00:21] 2021/09/20 12:00:21.232211 L0 IN:EMU:MERCURY_01:1:TEMP:SP:RBV: mismatch after 5 bytes "STAT:"
[2021-09-20 12:00:21] 2021/09/20 12:00:21.232219 L0 IN:EMU:MERCURY_01:1:TEMP:SP:RBV: got "SET:DEV:MB..." where "DEV:MB1.T1:TEMP:LOOP:TSET:" was expected
[2021-09-20 12:00:25] 2021/09/20 12:00:25.568673 L0 IN:EMU:MERCURY_01:1:D:SP: Match passed again
[2021-09-20 12:00:25] 2021/09/20 12:00:25.609078 L0 IN:EMU:MERCURY_01:1:TEMP:SP: Match passed again
[2021-09-20 12:00:26] 2021/09/20 12:00:26.858656 L0 IN:EMU:MERCURY_01:1:TEMP:SP:RBV: Match passed again
[2021-09-20 12:01:45] 2021/09/20 12:01:45.801414 L0 IN:EMU:MERCURY_01:1:P:SP: No reply within 1000 ms to "SET:DEV:MB1.T1:TEMP:LOOP:P:0.100000<0a>"
[2021-09-20 12:01:46] 2021/09/20 12:01:46.706087 L0 IN:EMU:MERCURY_01:1:I:SP: Input "...EV:MB1.T1:TEMP:LOOP:P:0.100000:VALID"
[2021-09-20 12:01:46] 2021/09/20 12:01:46.706104 L0 IN:EMU:MERCURY_01:1:I:SP: mismatch after 30 bytes "...TEMP:LOOP:"
[2021-09-20 12:01:46] 2021/09/20 12:01:46.706113 L0 IN:EMU:MERCURY_01:1:I:SP: got "P:0.100000..." where "I:" was expected
[2021-09-20 12:01:46] 2021/09/20 12:01:46.751918 L0 IN:EMU:MERCURY_01:1:TEMP: Input "STAT:SET:DEV:MB1.T1:TEMP:LOOP:I:0.400000..."
[2021-09-20 12:01:46] 2021/09/20 12:01:46.751935 L0 IN:EMU:MERCURY_01:1:TEMP: mismatch after 5 bytes "STAT:"
[2021-09-20 12:01:46] 2021/09/20 12:01:46.751947 L0 IN:EMU:MERCURY_01:1:TEMP: got "SET:DEV:MB..." where "DEV:MB1.T1:TEMP:SIG:TEMP:" was expected
[2021-09-20 12:01:53] 2021/09/20 12:01:53.452557 L0 IN:EMU:MERCURY_01:1:I:SP: Match passed again
[2021-09-20 12:01:53] 2021/09/20 12:01:53.509999 L0 IN:EMU:MERCURY_01:1:TEMP: Match passed again

With a 500ms reply timeout these timeouts occur every few seconds when running the tests but don't occur when not running the tests. Proving the device fails to reply as fast in busy periods.

JamesKingWork commented 2 years ago

I have added a retry mechanism for protocols in https://github.com/ISISComputingGroup/EPICS-StreamDevice/tree/Ticket6720_add_max_retries_for_mercury_issues. I will deploy this to EMU and test it on the mercury, by copying stream.dll from the build into the directory containing the mercury executable. We will need to consider how this affects the lock timeout, as other commands will be locked out for a longer period of time with retries.

JamesKingWork commented 2 years ago

Impeded on the deployment of the new NDXEMU

JamesKingWork commented 2 years ago

Awaiting on the completion of windows 10 testing