Stuck COM ports on various instruments

Tom-Willemsen commented 5 years ago

MOXA com ports appear to get "stuck" on various beamlines, in various situations. The symptom is that "access is denied" to the port even though no process is using the com port. This means that the COM port is effectively useless after this issue occurs. You cannot even connect to the port in this state, windows commands like MODE COM5 will just return access denied.

This has (recently) happened on:

RIKENFE Danfysiks, https://github.com/ISISComputingGroup/IBEX/issues/4376
ZOOM a labview Mercury ITC driver was the last to use the port, https://github.com/ISISComputingGroup/IBEX/issues/4421
MERLIN TPG x2, https://github.com/ISISComputingGroup/IBEX/issues/3527
Others

In this situation, rebooting the MOXA has no effect. The only ways we have found to solve the problem are:

Reboot the NDX control PC
Move the affected device to a different physical port

This ticket is to investigate the problem further to see if we can reproduce it in various conditions. If we can, then either fix our drivers or escalate the problem to MOXA support as appropriate.

Tom-Willemsen commented 5 years ago

First test: turn both hardware and software flow control on to the real com port via nport, disconnect device and see if it causes this state.

ChrisM-S commented 5 years ago

As part of this ticket (seems to be the best place for the moment unless anyone knows better) we need to pull back a little from the immediate problem and look at the likely causes. This is likely to be informative in generating/countering/testing the results of failure situations. The first thing to note is that if there was not a preceding communication "protocol" failure in any of these cases we would likely not be in the position we are. How is this the case?

Broadly speaking we have two ends of the communication: 1) The program on the PC end we control, able to read and write as and when (asynchronously). 2) The device at the far end, (usually) expecting some form of recognised, hopefully documented command syntax and possibly a formal or informal handshaking protocol (command/response..., command/response-if-recgonized, etc.). There may also be other constraints in the device, real device processing time for commands, expectations as to mode of use etc. For example, many devices simply expect a human to type commands - absorbing character by character followed by a terminator '<CR/LF>' , but others are developed to buffer without interpretation until a terminator arrives.

To get (1) and (2) talking reliably , we need to implement some approximation to a protocol which is largely defined by the device end of things. The device behaviour is essentially fixed, we are either guessing /trying or ideally following a protocol document. A good device is likely to be robust, not all are!

So what are some likely faults? the device might expect a whole buffer within 500ms after receiving the first character, and to be holding a command like "ASR5" at this point. Probably a near certainty for a device developed on a local PC (transmitting all the data on a <CR/LF> from hyperterm). Now, in our environment a stall in our network during transmission might send "AS" and then later "R5". Both potentially rubbish to the device, possibly worth two error messages? who knows, probably OK, we usually retry commands, the device will probably know it's got rubbish, the next transmission is smooth (no network jitters this time).

No flow control example (does happen in real life)

But now what if we send two commands, too quickly right after each other and we overwrite (or overrun) the device's input buffer (easy to do on a Thurlby Thandar PSU because some commands take a second or two to finish) - we crash the device completely. Comms failure, unhappy users - reset device only option.

Same with flow control (no idea whether this happens with the Thurlby)

We do the same thing, first command arrives, device inverts our CTS line signal to it's RTS line, should ensure we buffer our data until it's ready for more. We don't, we send the bytes, buffer overruns, device crashes, leaves RTS low. We keep trying to send things, what happens? do any of our signals change? What does the MOXA do, it has to respond to a hardware signal in real time (or does it have to check back at base on the control machine to know what to do?) it's a proxy for the hardware signal. Is a deadlock possible - we are engaging in hardware flow control. The computer end suffers a lockup, the device might recover!

The solution? Overall, match the device better, find out if we send commands too quickly, in the wrong pieces, ensure we don't give the device anything controversial (to it). Ensure a response before we submit the next command. test during development for broken/comms. Be very aware of network latency and the possibility of stalling or losing or fragmenting a command (all the text sent at once in one API call will generally ensure it all goes in one ethernet packet, not a few). Ideally always use no flow control but keep the device happy, this will stop local port lockups. Local ports don't lockup with no flow control!

Tom-Willemsen commented 2 years ago

+1 INTER

We have had multiple components lose comms to IBEX, where no light flashed on the Moxa box port specified in the IBEX component. Moving the cable to a different port and changing the address in IBEX fixed the communication, which suggests an issue with the specific port and potentially that these Moxa's need replacing.

Moxa on the wall: Jamie Nutter had to move a Galil from port 1 to port 5 and a second Galil from port 2 to port 6. Moxa under the long travel: I had to move the pressure gauge connection from port 9 to port 10.

Example log from relevant IOC:

[2022-07-11 09:25:07]     drvAsynSerialPortConfigure("L0", "COM13", 0, 0, 0, 0)
[2022-07-11 09:25:07] 2022/07/11 09:25:07.262 L0 -1 autoConnect could not connect: \\.\COM13 Can't open: Access is denied.

ISISComputingGroup / IBEX

Stuck COM ports on various instruments #4420

No flow control example (does happen in real life)

Same with flow control (no idea whether this happens with the Thurlby)