Reports of serial protocol failures under load

dankamongmen commented 5 years ago

@barulicm was working on a field Jrk Friday, and reported errors when using the serial protocol. He's worked around these errors in his branch with some extra terminal setup and control functionality, none of which I understand particularly well.

Try to reproduce the problem @barulicm saw with a loop around the Poller object. If it can be reproduced, try to understand exactly what the problem is. I'm thinking that strace ought make things quickly apparent. Once it's understood, build up a minimal fix. If it can't be reproduced in our code, figure out why that is the case.

Either way, this makes me think that moving exclusively to USB transfers instead of the serial protocol would be a good call in the long term. We probably want to fix #12 before heading down that path, though.

dankamongmen commented 5 years ago

I was able to reproduce the failure using

https://github.com/dankamongmen/pololuJrkUSB/blob/serial-load-test/test/test.cpp

thanks to @barulicm for providing the core of this demonstration! Now to fix it.

dankamongmen commented 5 years ago

With a usleep(1000), we don't see the failure. With usleep(100), we do, after two runthroughs of the command suite. With usleep(500), we do, after about 5 runthroughs of the command suite (and three sets of returns).

dankamongmen commented 5 years ago

What I'm seeing @barulicm is EAGAIN when attempting to write the command, not on the read. Might that match up with what you were seeing? I thought you were seeing the EAGAIN on read() following poll(). Even if you were, that doesn't mean we're not reproducing the same general behavior.

dankamongmen commented 5 years ago

nanosleep({tv_sec=0, tv_nsec=100000}, NULL) = 0
write(3, "\241", 1)                     = 1
write(3, "\243", 1)                     = 1
write(3, "\245", 1)                     = 1
write(3, "\247", 1)                     = 1
write(3, "\251", 1)                     = 1
write(3, "\253", 1)                     = 1
write(3, "\255", 1)                     = 1
write(3, "\265", 1)                     = 1
write(1, "wrote command suite\n", 20)   = 20
nanosleep({tv_sec=0, tv_nsec=100000}, NULL) = 0
write(3, "\241", 1)                     = 1
write(3, "\243", 1)                     = 1
write(3, "\245", 1)                     = 1
write(3, "\247", 1)                     = 1
write(3, "\251", 1)                     = 1
write(3, "\253", 1)                     = 1
write(3, "\255", 1)                     = 1
write(3, "\265", 1)                     = -1 EAGAIN (Resource temporarily unavailable)
write(2, "terminate called after throwing "..., 48terminate called after throwing an instance of ') = 48
write(2, "std::runtime_error", 18std::runtime_error)      = 18
write(2, "'\n", 2'
)                      = 2
write(2, "  what():  ", 11  what():  )             = 11
write(2, "error writing command: Resource "..., 55error writing command: Resource temporarily unavailable) = 55
write(2, "\n", 1
)                       = 1
rt_sigprocmask(SIG_UNBLOCK, [ABRT], NULL, 8) = 0
rt_sigprocmask(SIG_BLOCK, ~[RTMIN RT_1], [], 8) = 0
getpid()                                = 20174
gettid()                                = 20174
tgkill(20174, 20174, SIGABRT)           = 0
rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0
--- SIGABRT {si_signo=SIGABRT, si_code=SI_TKILL, si_pid=20174, si_uid=1000} ---
+++ killed by SIGABRT +++
[schwarzgerat](0) $

dankamongmen commented 5 years ago

If I don't throw on the write() failure, I reproduce the condition where the Jrk sets error bits, and no further commands are read off the serial line.

@barulicm , I'm starting to suspect that some of those tcdrain() and tcflush() operations might be hiding the root problem. I'm now going to start bringing in some of your proposed fix, and see what changes what.

Also, I want to see what happens if I mix in some SetTarget calls. That's the only one which sends a multibyte command. Imagine that the pipe has 4096 bytes of buffer, and there are 4095 bytes of command backed up. You then try to write a SetTarget. The first byte will be written, with a return value of 1. The next will return with EAGAIN. Abandoning the effort at this point leaves half a command in the pipe (though admittedly any following byte, if written in time, would be read as a valid parameter).

dankamongmen commented 5 years ago

My code is sending 0xb5 to read errors, which doesn't clear any latched ones. Yet a secondary run doesn't seem to see error flags set, despite no 0xb3 having been sent. If they're not being cleared some other way, they ought persist across runs, no? Interesting...perhaps when we read them, we're reading garbage? That seems likely, actually, since we tend to see a whole bunch of error bits...

dankamongmen commented 5 years ago

If I put a tcdrain() in after my write(), the problem goes away entirely, even without a usleep(). If I then redirect to /dev/null, allowing it to truly bang on the serial line (in a way we're unlikely to do in practice), I eventually lock up, strace showing:

[schwarzgerat](0) $ strace -p `pidof sertest`
strace: Process 29852 attached
futex(0x7fffc721df00, FUTEX_WAKE_PRIVATE, 1) = 1
write(3, "\247", 1)                     = 1
ioctl(3, TCSBRK, 1

with that ioctl(TCSBRK) corresponding to tcdrain(). So at this point, the Jrk is no longer reading off the serial line, which makes sense given that if I restart sertest, I get the following state:

Error bits: AwaitingCmd InputDisconn FdbckDisconn

I suspect that this is the same behavior we'd see if I replaced the nonblocking write() + tcdrain() with a blocking write(). This is indeed the reason why I'm using nonblocking I/O -- so that a device fault can't hang us in I/O. The tcdrain() undoes that design goal.

What we really want to do is not write commands so quickly that we fill up the Jrk's input buffer. Once we do, there are no particularly great solutions. Still, it seems the Jrk ought eventually churn down through the outstanding commands, at which point fresh write()s ought succeed. Instead, we see a persistent failure, though not usually across restarts. Hrmm.

dankamongmen commented 5 years ago

I suspect that this is the same behavior we'd see if I replaced the nonblocking write() + tcdrain() with a blocking write(). This is indeed the reason why I'm using nonblocking I/O -- so that a device fault can't hang us in I/O. The tcdrain() undoes that design goal.

Indeed, if I loop and retry on EAGAIN for the write(), eventually the Jrk totally stops responding until I reset the port (by starting a new pololu process), at which point it returns error flags until I clear them. These were all 1-byte, presumably atomic, transfers. So fundamentally, it seems that writing too much to the serial port can lock up the Jrk's serial interface. It's not the first time you get EAGAIN -- when I spin on EAGAIN, that solves things for a bit -- but eventually the Jrk vomits.

It is also possible that, at such load, we eventually get a bad bit on the serial connection, and since we're not using CRC capabilities (see #3 ), that upsets the Jrk. If this is the case, it's also plausible that @barulicm 's termio changes had an effect here. It might be possible to determine this with USB sniffing, but I think this is all unlikely. Enabling CRC, and testing with it in use, ought lock this out. This would be nice if it's the case.

Instead, I suspect we need to manage the incoming queue. This is easy enough to do in our Poller by putting a max on the send_cmds queue. Better still, we could move to USB exchanges, and I bet this problem goes away.

dankamongmen commented 5 years ago

It is also possible that, at such load, we eventually get a bad bit on the serial connection, and since we're not using CRC capabilities (see #3 ), that upsets the Jrk. If this is the case, it's also plausible that @barulicm 's termio changes had an effect here. It might be possible to determine this with USB sniffing, but I think this is all unlikely. Enabling CRC, and testing with it in use, ought lock this out. This would be nice if it's the case.

in @barulicm 's code, it's still nonblocking at the read() and write() levels, but blocking at the command level -- i.e., a write() is issued, then we wait for a read(). So that means there's never more than one outstanding command. In that case, the issue we're seeing here isn't in play, and we're not reproducing the true failure he saw, one that can happen with only one outstanding command. Which would indeed correspond to his termio fixes.

dankamongmen commented 5 years ago

When I freshly insert the Jrk, I see:

[schwarzgerat](0) $ sudo stty -a -F /dev/ttyACM0
speed 9600 baud; rows 0; columns 0; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>;
eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R;
werase = ^W; lnext = ^V; discard = ^O; min = 1; time = 0;
-parenb -parodd -cmspar cs8 hupcl -cstopb cread clocal -crtscts
-ignbrk -brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl ixon -ixoff
-iuclc -ixany -imaxbel -iutf8
opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
isig icanon iexten echo echoe echok -echonl -noflsh -xcase -tostop -echoprt
echoctl echoke -flusho -extproc
[schwarzgerat](0) $

9600 here matches the actual baud setting of the Jrk as defined in its config, though maybe that only refers to the UART? After I've used my tool, or the reference tool, we get:

[schwarzgerat](1) $ sudo stty -a -F /dev/ttyACM0
speed 9600 baud; rows 0; columns 0; line = 0;
intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>;
eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R;
werase = ^W; lnext = ^V; discard = ^O; min = 1; time = 0;
-parenb -parodd -cmspar cs8 hupcl -cstopb cread clocal -crtscts
-ignbrk -brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl ixon -ixoff
-iuclc -ixany -imaxbel -iutf8
opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
-isig -icanon iexten -echo -echoe echok -echonl -noflsh -xcase -tostop -echoprt
echoctl echoke -flusho -extproc
[schwarzgerat](0) $

Here we've disabled canonical mode, signal generation, echo, and echoe. This corresponds to what we expect from my code. I can do thousands of cycles per second if i loop on EAGAIN. If I don't loop on EAGAIN, or if I try for millions of cycles per second, I throw the Jrk into an error state. Copying @barulicm 's termio changes in do not change any of this. For the record, the diff there relative to the output above is:

[schwarzgerat](1) $ diff -ur b c
--- b   2019-03-02 19:59:19.714575715 -0500
+++ c   2019-03-02 20:09:02.299458822 -0500
@@ -1,10 +1,10 @@
 speed 9600 baud; rows 0; columns 0; line = 0;
 intr = ^C; quit = ^\; erase = ^?; kill = ^U; eof = ^D; eol = <undef>;
 eol2 = <undef>; swtch = <undef>; start = ^Q; stop = ^S; susp = ^Z; rprnt = ^R;
-werase = ^W; lnext = ^V; discard = ^O; min = 1; time = 0;
+werase = ^W; lnext = ^V; discard = ^O; min = 1; time = 1;
 -parenb -parodd -cmspar cs8 hupcl -cstopb cread clocal -crtscts
--ignbrk -brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr icrnl ixon -ixoff
+-ignbrk -brkint -ignpar -parmrk -inpck -istrip -inlcr -igncr -icrnl -ixon -ixoff
 -iuclc -ixany -imaxbel -iutf8
-opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
+-opost -olcuc -ocrnl onlcr -onocr -onlret -ofill -ofdel nl0 cr0 tab0 bs0 vt0 ff0
 -isig -icanon iexten -echo -echoe echok -echonl -noflsh -xcase -tostop -echoprt
 echoctl echoke -flusho -extproc
[schwarzgerat](1) $

which again is what we would expect from @barulicm 's code (save the baud rate still being 9600). He's turned off IXON, ICRNL, and OPOST relative to us, which all sounds quite sensible, and indeed possibly a solid fix -- I'm going to import that.

@barulicm also sets VTIME to 1 from 0. Here's the details on VMIN/VTIME:

       input  mode is switched to canonical.  The settings of MIN (c_cc[VMIN])
       and TIME (c_cc[VTIME]) determine the circumstances in which  a  read(2)
       completes; there are four distinct cases:

       MIN == 0, TIME == 0 (polling read)
              If  data  is  available,  read(2)  returns immediately, with the
              lesser of the number of bytes available, or the number of  bytes
              requested.  If no data is available, read(2) returns 0.

       MIN > 0, TIME == 0 (blocking read)
              read(2)  blocks until MIN bytes are available, and returns up to
              the number of bytes requested.

       MIN == 0, TIME > 0 (read with timeout)
              TIME specifies the limit for a timer in tenths of a second.  The
              timer is started when read(2) is called.  read(2) returns either
              when at least one byte of data is available, or when  the  timer
              expires.  If the timer expires without any input becoming avail‐
              able, read(2) returns 0.  If data is already  available  at  the
              time of the call to read(2), the call behaves as though the data
              was received immediately after the call.
              able, read(2) returns 0.  If data is already  available  at  the
              time of the call to read(2), the call behaves as though the data
              was received immediately after the call.

       MIN > 0, TIME > 0 (read with interbyte timeout)
              TIME specifies the limit for a timer  in  tenths  of  a  second.
              Once  an  initial  byte of input becomes available, the timer is
              restarted after each further byte is received.  read(2)  returns
              when any of the following conditions is met:

              *  MIN bytes have been received.

              *  The interbyte timer expires.

              *  The  number  of bytes requested by read(2) has been received.
                 (POSIX does not specify this termination  condition,  and  on
                 some  other  implementations  read(2) does not return in this
                 case.)

              Because the timer is started only after the initial byte becomes
              available,  at  least one byte will be read.  If data is already
              available at the time of the call to read(2), the  call  behaves
              as though the data was received immediately after the call.

       POSIX  does not specify whether the setting of the O_NONBLOCK file sta‐
       tus flag takes precedence over the MIN and TIME  settings.   If  O_NON‐
       BLOCK  is  set,  a read(2) in noncanonical mode may return immediately,
       regardless of the setting of MIN or TIME.  Furthermore, if no  data  is
       available,  POSIX  permits a read(2) in noncanonical mode to return ei‐
       ther 0, or -1 with errno set to EAGAIN.

So we're using NONBLOCK, so I doubt the VTIME setting is being picked up. Either way, I don't think it should be necessary, unless @barulicm knows something I don't. So I'm gonna pick up the IXON. ICRNL, and OPOST, but I'm unsure of the VTIME, and outright dubious of the baud=115200. Most of the other changes there are no-ops.

dankamongmen commented 5 years ago

I verified that the same default settings came up on Ubuntu 18.04 for stty on the device, so that's not the source of any weirdness.

dankamongmen commented 5 years ago

Update: I went and looked at @barulicm 's code, and the error flags decoding there was wonky (it was working on ErrorSum, not the error register). So that report is kinda questionable, but I still see things here worth thinking about.

barulicm commented 5 years ago

Just so you know how I ended up where I did, I noticed odd things happening in the sniffed logs of the raw serial communication (weird extra bytes coming back from the servo seemingly un-requested). This felt similar to behavior I've seen with other devices when baud rates aren't exactly matched. While the JRK documentation says that baud rate on the PC side shouldn't have any effect in the dual-port UART mode, I went ahead and googled a refresher on how to set it, just to check. I ended up at this SO post that explains some good details about some nuances in nonblocking code when it comes to serial. https://stackoverflow.com/questions/25996171/linux-blocking-vs-non-blocking-serial-read/26006680#26006680

barulicm commented 5 years ago

As for the question of serial v native USB, I agree native USB is the better long term solution. The serial protocol on the JRK has no packet boundary specifiers in feedback from the servo, so it is difficult to detect communication errors on the PC side if any extra or lost bytes pop up in the feedback stream.

dankamongmen commented 5 years ago

Just so you know how I ended up where I did, I noticed odd things happening in the sniffed logs of the raw serial communication (weird extra bytes coming back from the servo seemingly un-requested). This felt similar to behavior I've seen with other devices when baud rates aren't exactly matched. While the JRK documentation says that baud rate on the PC side shouldn't have any effect in the dual-port UART mode, I went ahead and googled a refresher on how to set it, just to check. I ended up at this SO post that explains some good details about some nuances in nonblocking code when it comes to serial. https://stackoverflow.com/questions/25996171/linux-blocking-vs-non-blocking-serial-read/26006680#26006680

Cool, I'll look into it! I've still been unable to reproduce the problem you see. BTW, how were you sniffing the serial communication? Were you using a usbmon device, or a named pipe proxy, or some technique of which I'm unaware?

I'd have thought that a baud disagreement would lead to general anarchy, but I can't say I have strong reasoning for that.

As said, I don't think any of your changes are bad, but most of them are definitely no-ops against the default POSIX terminal settings, and I'd like to understand this baud thing better. I do think that the 9600 I've seen referenced in the Jrk config applies to the UART-based communication, not USB dual-port, as you noted above. Either way, I'll continue looking for this issue.

What are your thoughts on turning on CRC7 mode? And beyond that, what do you think of moving to libusb-based transfers, rather than serial emulation? Especially if you've already accepted what's essentially a blocking mode (given the tcdrain() and tcflush(), as well as the read() following the write()), we could replace it with blocking usb control transfers very easily, which I believe might be more robust. Just a thought.

barulicm commented 5 years ago

I used socat to create a proxy path for the tty device path. It spits out timestamped logs of bytes sent and which direction they were sent.

socat -x /dev/ttyACM0,raw PTY,link=$HOME/myserial,raw,echo=0 2>logfile.txt

dankamongmen commented 5 years ago

socat -x /dev/ttyACM0,raw PTY,link=$HOME/myserial,raw,echo=0 2>logfile.txt

Good man! Were you aware of this technique before I mentioned it in slack?

dankamongmen commented 5 years ago

As for the question of serial v native USB, I agree native USB is the better long term solution. The serial protocol on the JRK has no packet boundary specifiers in feedback from the servo, so it is difficult to detect communication errors on the PC side if any extra or lost bytes pop up in the feedback stream.

I couldn't agree more. I think I'm going to go ahead and move to a pure libusb model in this tree. Let me ask: what interface would make things easiest to integrate into greenzie stuff? I see that you're currently using a send-and-wait model; if you'd prefer that to send-and-callback, I can make it available. I'll probably just do things as a send-and-callback core with a send-and-wait synchronous wrapper atop that, so you can do either.

dankamongmen / pololuJrkUSB

Reports of serial protocol failures under load #17