Read blocks indefinitely on linux even with timeout

soypat commented 2 years ago

Tested with raspberry Pi and my desktop:

Steps to reproduce

Modify ReadClose verification program to be as follows

```go package main import ( "flag" "fmt" "io" "log" "time" "github.com/distributed/sers" ) func main() { err := Main() if err != nil { log.Fatal(err) } } func readpart(r io.ReadCloser) error { go func() { time.Sleep(50 * time.Second) fmt.Printf("============================> close now\n") err := r.Close() fmt.Printf("close err %v\n", err) }() _, err := r.Read(make([]byte, 128)) if err != nil { return err } return nil } func Main() error { flag.Parse() if len(flag.Args()) < 1 { return fmt.Errorf("please provide a serial file name") } fn := flag.Args()[0] f, err := sers.Open(fn) if err != nil { return err } err = f.SetReadParams(0, 1) if err != nil { return err } return readpart(f) } ```

Program now blocks for 50 seconds even though timeout is 1 second.

I am available to help resolving this issue but am unfamiliar with this repo and CGo.

soypat commented 2 years ago

After reading some termios stuff it seems this is not expected functionality: If VMIN is zero then the function should be a purely timed function and return after 1 second.

distributed commented 2 years ago

Hey there! Good to see another sers user. Thanks for taking the time to provide a repro case. I do see the same behavior that you are getting, i.e. the program takes 50 seconds to terminate, not 1 second as the timeout/VTIME setting might lead one to believe.

My answer below goes into some detail, the TL;DR is this: Your issue is real, but it is unlikely that I will make it go away. There is a workaround that will give you the VMIN/VTIME behavior you want. However there are most likely better ways to achieve your goals. Finally I back up my arguments with some straces.

The issue

Unfortunately, you stumbled into the darkest corner of the sers library. As one line in the manual hints at, I am considering to drop support for SetReadParams in future releases of the library. There are two reasons for this.

The first is that the interface mirrors termios, but sers aims to support non-termios platforms as well and it is not always easy to emulate VMIN/VTIME on other systems.

The second is that I would like to support the standard set of Set{,Read,Write}Deadline methods that are often required by protocol implementations that work over different cases of stream transports and are the standard for I/O timeouts in the go standard library. The easiest route to do this is by letting the go runtime handle the I/O for the serial port file descriptor, using nonblocking I/O with epoll/kqueue etc. behind the scenes. Apart from deadline support this also means that I/O on serial ports does not tie up an OS thread but is scheduled by the runtime, as it is for sockets, which is a nice bonus.

The termios man page writes the following about the VMIN/VTIME settings in conjunction with nonblocking I/O:

       POSIX does not specify whether the setting of the O_NONBLOCK file
       status flag takes precedence over the MIN and TIME settings.  If
       O_NONBLOCK is set, a [read(2)](https://www.man7.org/linux/man-pages/man2/read.2.html) in noncanonical mode may return
       immediately, regardless of the setting of MIN or TIME.

Thus using nonblocking I/O as I wish leaves VMIN/VTIME behavior unspecified. As you found out, nonblocking I/O seems to override your settings. There are two traces attached below to support this.

A workaround

You can get around this, if you force blocking I/O on the file descriptor that underlies the SerialPort. For instance you can modify the program to do the following.

    osf, err := os.OpenFile(fn, os.O_RDWR, 0666)
    if err != nil {
        return err
    }

    f, err := sers.TakeOver(osf)
    if err != nil {
        return err
    }

As the documentation points out, TakeOver calls osf.Fd() which will cause the Go runtime to not do nonblocking I/O on the file descriptor because it does not know what other code might use the file descriptor in incompatible ways. On my machine, this program terminates after 1 second with error timeout, as you initially expected.

The workaround works in the current version of sers and will most likely continue to work in future versions that offer SetReadParams, but I am not guaranteeing this. Note that the workaround means that a Close() does not unblock readers and you have to make your readers check for shutdown in some other manner.

A solution?

First I have to start with a question: What do you want to achieve?

Do you want to be able to unblock serial port readers in case you shut down the code that talks to serial port? In this case it is enough to call Close() on the SerialPort to unblock readers.

Do you want to have a timeout on Read()? In that case SetReadDeadline would fit the bill. I have an untested implementation lying around on my machine for literally years and I could expedite merging it if this would help you. Please get in touch if this is the case.

Some traces

In order to get a grip on the behavior, I used strace which prints out a trace ("a log") of all the syscalls a program makes. I ran the program as strace -tt -f -v ./readclose [ttyfilename].

Your case

``` [pid 8828] 09:02:32.955015 openat(AT_FDCWD, "/dev/ttyS4", O_RDWR|O_NOCTTY|O_NONBLOCK [pid 8832] 09:02:32.955034 futex(0xc000053148, FUTEX_WAIT_PRIVATE, 0, NULL) = -1 EAGAIN (Resource temporarily unavailable) [pid 8830] 09:02:32.955064 <... clone resumed> child_stack=0x7f9fdca13fb0, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0x7f9fdca149d0, tls=0x7f9fdca14700, child_tidptr=0x7f9fdca149d0) = 8833 [pid 8832] 09:02:32.955085 futex(0xc000088148, FUTEX_WAKE_PRIVATE, 1 [pid 8830] 09:02:32.955105 rt_sigprocmask(SIG_SETMASK, [], [pid 8833] 09:02:32.955127 <... futex resumed> ) = 0 [pid 8832] 09:02:32.955143 <... futex resumed> ) = 1 [pid 8833] 09:02:32.955159 futex(0xc000088148, FUTEX_WAIT_PRIVATE, 0, NULL [pid 8832] 09:02:32.955178 futex(0xc000053148, FUTEX_WAIT_PRIVATE, 0, NULL [pid 8830] 09:02:32.955197 <... rt_sigprocmask resumed> NULL, 8) = 0 [pid 8829] 09:02:32.955212 sigaltstack({ss_sp=0xc000066000, ss_flags=0, ss_size=32768}, [pid 8830] 09:02:32.955234 futex(0xc000052948, FUTEX_WAIT_PRIVATE, 0, NULL [pid 8829] 09:02:32.955252 <... sigaltstack resumed> NULL) = 0 [pid 8829] 09:02:32.955273 rt_sigprocmask(SIG_SETMASK, [], NULL, 8) = 0 [pid 8829] 09:02:32.955306 gettid() = 8829 [pid 8829] 09:02:32.955336 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 8828] 09:02:32.955391 <... openat resumed> ) = 3 [pid 8828] 09:02:32.955410 ioctl(3, TCGETS, {c_iflags=0, c_oflags=0x4, c_cflags=0xcbd, c_lflags=0xa30, c_line=0, c_cc[VMIN]=0, c_cc[VTIME]=10, c_cc="\x03\x1c\x7f\x15\x04\x0a\x00\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0 [pid 8829] 09:02:32.955444 <... nanosleep resumed> NULL) = 0 [pid 8828] 09:02:32.955457 ioctl(3, TCGETS [pid 8829] 09:02:32.955472 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 8828] 09:02:32.955486 <... ioctl resumed> , {c_iflags=0, c_oflags=0x4, c_cflags=0xcbd, c_lflags=0xa30, c_line=0, c_cc[VMIN]=0, c_cc[VTIME]=10, c_cc="\x03\x1c\x7f\x15\x04\x0a\x00\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0 [pid 8828] 09:02:32.955508 ioctl(3, SNDCTL_TMR_START or TCSETS, {c_iflags=0, c_oflags=0x4, c_cflags=0xcbd, c_lflags=0xa30, c_line=0, c_cc[VMIN]=1, c_cc[VTIME]=0, c_cc="\x03\x1c\x7f\x15\x04\x00\x01\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0 [pid 8828] 09:02:32.955542 ioctl(3, TCGETS, {c_iflags=0, c_oflags=0x4, c_cflags=0xcbd, c_lflags=0xa30, c_line=0, c_cc[VMIN]=1, c_cc[VTIME]=0, c_cc="\x03\x1c\x7f\x15\x04\x00\x01\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0 [pid 8829] 09:02:32.955571 <... nanosleep resumed> NULL) = 0 [pid 8828] 09:02:32.955583 fcntl(3, F_GETFL [pid 8829] 09:02:32.955598 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 8828] 09:02:32.955613 <... fcntl resumed> ) = 0x8802 (flags O_RDWR|O_NONBLOCK|O_LARGEFILE) [pid 8828] 09:02:32.955629 epoll_create1(EPOLL_CLOEXEC) = 4 [pid 8828] 09:02:32.955660 pipe2([5, 6], O_NONBLOCK|O_CLOEXEC) = 0 [pid 8829] 09:02:32.955693 <... nanosleep resumed> NULL) = 0 [pid 8828] 09:02:32.955704 epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=7825232, u64=7825232}} [pid 8829] 09:02:32.955722 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 8828] 09:02:32.955737 <... epoll_ctl resumed> ) = 0 [pid 8828] 09:02:32.955756 epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3692930888, u64=140324569458504}}) = 0 [pid 8828] 09:02:32.955790 fcntl(3, F_GETFL) = 0x8802 (flags O_RDWR|O_NONBLOCK|O_LARGEFILE) [pid 8829] 09:02:32.955815 <... nanosleep resumed> NULL) = 0 [pid 8828] 09:02:32.955827 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK|O_LARGEFILE [pid 8829] 09:02:32.955841 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 8828] 09:02:32.955856 <... fcntl resumed> ) = 0 [pid 8828] 09:02:32.955875 write(1, "setting read params\n", 20setting read params ) = 20 [pid 8829] 09:02:32.955927 <... nanosleep resumed> NULL) = 0 [pid 8828] 09:02:32.955939 ioctl(3, TCGETS [pid 8829] 09:02:32.955953 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 8828] 09:02:32.955967 <... ioctl resumed> , {c_iflags=0, c_oflags=0x4, c_cflags=0xcbd, c_lflags=0xa30, c_line=0, c_cc[VMIN]=1, c_cc[VTIME]=0, c_cc="\x03\x1c\x7f\x15\x04\x00\x01\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0 [pid 8828] 09:02:32.955985 ioctl(3, TCGETS, {c_iflags=0, c_oflags=0x4, c_cflags=0xcbd, c_lflags=0xa30, c_line=0, c_cc[VMIN]=1, c_cc[VTIME]=0, c_cc="\x03\x1c\x7f\x15\x04\x00\x01\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0 [pid 8828] 09:02:32.956017 ioctl(3, SNDCTL_TMR_START or TCSETS, {c_iflags=0, c_oflags=0x4, c_cflags=0xcbd, c_lflags=0xa30, c_line=0, c_cc[VMIN]=0, c_cc[VTIME]=10, c_cc="\x03\x1c\x7f\x15\x04\x0a\x00\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0 [pid 8829] 09:02:32.956046 <... nanosleep resumed> NULL) = 0 [pid 8828] 09:02:32.956057 ioctl(3, TCGETS [pid 8829] 09:02:32.956071 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 8828] 09:02:32.956085 <... ioctl resumed> , {c_iflags=0, c_oflags=0x4, c_cflags=0xcbd, c_lflags=0xa30, c_line=0, c_cc[VMIN]=0, c_cc[VTIME]=10, c_cc="\x03\x1c\x7f\x15\x04\x0a\x00\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0 [pid 8828] 09:02:32.956103 write(1, "done\n", 5done ) = 5 [pid 8828] 09:02:32.956141 futex(0xc000052948, FUTEX_WAKE_PRIVATE, 1 [pid 8829] 09:02:32.956162 <... nanosleep resumed> NULL) = 0 [pid 8828] 09:02:32.956173 <... futex resumed> ) = 1 [pid 8830] 09:02:32.956184 <... futex resumed> ) = 0 [pid 8828] 09:02:32.956195 read(3, [pid 8830] 09:02:32.956209 futex(0xc000053148, FUTEX_WAKE_PRIVATE, 1 [pid 8828] 09:02:32.956222 <... read resumed> 0xc00001e100, 128) = -1 EAGAIN (Resource temporarily unavailable) [pid 8832] 09:02:32.956234 <... futex resumed> ) = 0 [pid 8828] 09:02:32.956247 epoll_pwait(4, [pid 8832] 09:02:32.956261 epoll_pwait(4, [pid 8828] 09:02:32.956274 <... epoll_pwait resumed> [{EPOLLOUT, {u32=3692930888, u64=140324569458504}}], 128, 0, NULL, 2) = 1 [pid 8832] 09:02:32.956292 <... epoll_pwait resumed> [], 128, 0, NULL, 0) = 0 [pid 8828] 09:02:32.956304 epoll_pwait(4, [pid 8832] 09:02:32.956318 futex(0xc000053148, FUTEX_WAIT_PRIVATE, 0, NULL [pid 8830] 09:02:32.956331 <... futex resumed> ) = 1 [pid 8829] 09:02:32.956341 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 8830] 09:02:32.956485 write(6, "\0", 1 [pid 8828] 09:02:32.956512 <... epoll_pwait resumed> [{EPOLLIN, {u32=7825232, u64=7825232}}], 128, -1, NULL, 0) = 1 [pid 8830] 09:02:32.956527 <... write resumed> ) = 1 [pid 8828] 09:02:32.956538 read(5, [pid 8830] 09:02:32.956552 futex(0xc000052948, FUTEX_WAIT_PRIVATE, 0, NULL [pid 8828] 09:02:32.956565 <... read resumed> "\0", 16) = 1 [pid 8829] 09:02:32.956578 <... nanosleep resumed> NULL) = 0 [pid 8828] 09:02:32.956589 epoll_pwait(4, [pid 8829] 09:02:32.956603 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 8828] 09:02:32.956617 <... epoll_pwait resumed> [], 128, 0, NULL, 0) = 0 [pid 8828] 09:02:32.956633 epoll_pwait(4, [pid 8829] 09:02:32.956695 <... nanosleep resumed> NULL) = 0 [pid 8829] 09:02:32.956753 futex(0x748558, FUTEX_WAIT_PRIVATE, 0, {tv_sec=3, tv_nsec=999597962}) = -1 ETIMEDOUT (Connection timed out) [pid 8829] 09:02:36.956774 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 8829] 09:02:36.957187 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 8829] 09:02:36.957499 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 8829] 09:02:36.957756 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 8829] 09:02:36.958102 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 8828] 09:02:36.958223 <... epoll_pwait resumed> [], 128, 3999, NULL, 96916898964330) = 0 [pid 8829] 09:02:36.958318 <... nanosleep resumed> NULL) = 0 [pid 8828] 09:02:36.958381 futex(0xc000052948, FUTEX_WAKE_PRIVATE, 1 [pid 8829] 09:02:36.958456 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 8828] 09:02:36.958538 <... futex resumed> ) = 1 [pid 8830] 09:02:36.958775 <... futex resumed> ) = 0 [pid 8828] 09:02:36.958858 write(1, "============================> cl"..., 40============================> close now [pid 8830] 09:02:36.958989 epoll_pwait(4, [pid 8828] 09:02:36.959051 <... write resumed> ) = 40 [pid 8830] 09:02:36.959091 <... epoll_pwait resumed> [], 128, 0, NULL, 0) = 0 [pid 8829] 09:02:36.959131 <... nanosleep resumed> NULL) = 0 [pid 8828] 09:02:36.959208 epoll_ctl(4, EPOLL_CTL_DEL, 3, 0xc000061d1c [pid 8830] 09:02:36.959300 futex(0xc000052948, FUTEX_WAIT_PRIVATE, 0, NULL [pid 8828] 09:02:36.959359 <... epoll_ctl resumed> ) = 0 [pid 8829] 09:02:36.959401 getpid( [pid 8828] 09:02:36.959449 close(3 ```

The gist is this: (Timeout in the closing goroutine set to 4 seconds, not 50)

[pid  8828] 09:02:32.955015 openat(AT_FDCWD, "/dev/ttyS4", O_RDWR|O_NOCTTY|O_NONBLOCK <unfinished ...>
[pid  8828] 09:02:32.955391 <... openat resumed> ) = 3
[pid  8828] 09:02:32.955756 epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3692930888, u64=140324569458504}}) = 0
[pid  8828] 09:02:32.956017 ioctl(3, SNDCTL_TMR_START or TCSETS, {c_iflags=0, c_oflags=0x4, c_cflags=0xcbd, c_lflags=0xa30, c_line=0, c_cc[VMIN]=0, c_cc[VTIME]=10, c_cc="\x03\x1c\x7f\x15\x04\x0a\x00\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0
[pid  8828] 09:02:32.956195 read(3,  <unfinished ...>
[pid  8828] 09:02:32.956222 <... read resumed> 0xc00001e100, 128) = -1 EAGAIN (Resource temporarily unavailable)
[pid  8828] 09:02:36.958858 write(1, "============================> cl"..., 40============================> close now
[pid  8828] 09:02:36.959449 close(3 <unfinished ...>

The file /dev/ttyS4 is opened and assigned fd 3. The Go runtime decides to do nonblocking I/O and registers the fd with its epoll instance to get notifications. Then you see the VTIME setting taking place, VTIME=10. The Go runtime reads from the FD, which is in nonblocking mode, and immediately gets EAGAIN, i.e. no data. This already hints at VTIME being ignored. No I/O happens on the file descriptor as epoll never notifies the runtime of anything. In other words, epoll does not respect the VTIME setting. After the SerialPort was closed, the FD 3 is closed as well.

Read and epoll seem to ignore VTIME in nonblocking mode, which explains the behavior you are seeing.

With workaround

``` [pid 10109] 09:08:24.899707 openat(AT_FDCWD, "/dev/ttyS4", O_RDWR|O_CLOEXEC [pid 10115] 09:08:24.899811 gettid( [pid 10111] 09:08:24.899892 futex(0xc000053148, FUTEX_WAKE_PRIVATE, 1 [pid 10115] 09:08:24.899962 <... gettid resumed> ) = 10115 [pid 10114] 09:08:24.900031 <... futex resumed> ) = 0 [pid 10115] 09:08:24.900100 futex(0x776978, FUTEX_WAIT_PRIVATE, 0, NULL [pid 10114] 09:08:24.900189 futex(0xc000053148, FUTEX_WAIT_PRIVATE, 0, NULL [pid 10111] 09:08:24.900263 <... futex resumed> ) = 1 [pid 10110] 09:08:24.900315 <... nanosleep resumed> NULL) = 0 [pid 10111] 09:08:24.900367 futex(0xc000052948, FUTEX_WAIT_PRIVATE, 0, NULL [pid 10110] 09:08:24.900452 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 10109] 09:08:24.900539 <... openat resumed> ) = 3 [pid 10109] 09:08:24.900623 epoll_create1(EPOLL_CLOEXEC [pid 10110] 09:08:24.900694 <... nanosleep resumed> NULL) = 0 [pid 10109] 09:08:24.900743 <... epoll_create1 resumed> ) = 4 [pid 10109] 09:08:24.900841 pipe2( [pid 10110] 09:08:24.900912 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 10109] 09:08:24.900982 <... pipe2 resumed> [5, 6], O_NONBLOCK|O_CLOEXEC) = 0 [pid 10109] 09:08:24.901060 epoll_ctl(4, EPOLL_CTL_ADD, 5, {EPOLLIN, {u32=7825232, u64=7825232}} [pid 10110] 09:08:24.901139 <... nanosleep resumed> NULL) = 0 [pid 10109] 09:08:24.901189 <... epoll_ctl resumed> ) = 0 [pid 10110] 09:08:24.901249 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 10109] 09:08:24.901344 epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3761674056, u64=139950976046920}} [pid 10110] 09:08:24.901473 <... nanosleep resumed> NULL) = 0 [pid 10109] 09:08:24.901528 <... epoll_ctl resumed> ) = 0 [pid 10110] 09:08:24.901579 epoll_pwait(4, [pid 10109] 09:08:24.901642 fcntl(3, F_GETFL [pid 10110] 09:08:24.901726 <... epoll_pwait resumed> [{EPOLLOUT, {u32=3761674056, u64=139950976046920}}], 128, 0, NULL, 0) = 1 [pid 10109] 09:08:24.901793 <... fcntl resumed> ) = 0x8002 (flags O_RDWR|O_LARGEFILE) [pid 10110] 09:08:24.901851 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 10109] 09:08:24.901921 fcntl(3, F_SETFL, O_RDWR|O_NONBLOCK|O_LARGEFILE) = 0 [pid 10110] 09:08:24.902036 <... nanosleep resumed> NULL) = 0 [pid 10109] 09:08:24.902088 fcntl(3, F_GETFL [pid 10110] 09:08:24.902150 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 10109] 09:08:24.902219 <... fcntl resumed> ) = 0x8802 (flags O_RDWR|O_NONBLOCK|O_LARGEFILE) [pid 10109] 09:08:24.902286 fcntl(3, F_SETFL, O_RDWR|O_LARGEFILE [pid 10110] 09:08:24.902352 <... nanosleep resumed> NULL) = 0 [pid 10109] 09:08:24.902401 <... fcntl resumed> ) = 0 [pid 10110] 09:08:24.902451 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 10109] 09:08:24.902521 write(1, "setting read params\n", 20setting read params ) = 20 [pid 10110] 09:08:24.902664 <... nanosleep resumed> NULL) = 0 [pid 10109] 09:08:24.902719 ioctl(3, TCGETS [pid 10110] 09:08:24.902785 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 10109] 09:08:24.902855 <... ioctl resumed> , {c_iflags=0, c_oflags=0x4, c_cflags=0xcbd, c_lflags=0xa30, c_line=0, c_cc[VMIN]=0, c_cc[VTIME]=10, c_cc="\x03\x1c\x7f\x15\x04\x0a\x00\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0 [pid 10109] 09:08:24.902946 ioctl(3, TCGETS [pid 10110] 09:08:24.903015 <... nanosleep resumed> NULL) = 0 [pid 10109] 09:08:24.903065 <... ioctl resumed> , {c_iflags=0, c_oflags=0x4, c_cflags=0xcbd, c_lflags=0xa30, c_line=0, c_cc[VMIN]=0, c_cc[VTIME]=10, c_cc="\x03\x1c\x7f\x15\x04\x0a\x00\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0 [pid 10110] 09:08:24.903131 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 10109] 09:08:24.903248 ioctl(3, SNDCTL_TMR_START or TCSETS, {c_iflags=0, c_oflags=0x4, c_cflags=0xcbd, c_lflags=0xa30, c_line=0, c_cc[VMIN]=0, c_cc[VTIME]=10, c_cc="\x03\x1c\x7f\x15\x04\x0a\x00\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"} [pid 10110] 09:08:24.903347 <... nanosleep resumed> NULL) = 0 [pid 10109] 09:08:24.903399 <... ioctl resumed> ) = 0 [pid 10110] 09:08:24.903454 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 10109] 09:08:24.903527 ioctl(3, TCGETS, {c_iflags=0, c_oflags=0x4, c_cflags=0xcbd, c_lflags=0xa30, c_line=0, c_cc[VMIN]=0, c_cc[VTIME]=10, c_cc="\x03\x1c\x7f\x15\x04\x0a\x00\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0 [pid 10110] 09:08:24.903681 <... nanosleep resumed> NULL) = 0 [pid 10109] 09:08:24.903726 write(1, "done\n", 5done [pid 10110] 09:08:24.903815 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 10109] 09:08:24.903874 <... write resumed> ) = 5 [pid 10109] 09:08:24.903976 futex(0xc000052948, FUTEX_WAKE_PRIVATE, 1 [pid 10110] 09:08:24.904052 <... nanosleep resumed> NULL) = 0 [pid 10109] 09:08:24.904097 <... futex resumed> ) = 1 [pid 10111] 09:08:24.904140 <... futex resumed> ) = 0 [pid 10109] 09:08:24.904183 read(3, [pid 10111] 09:08:24.904237 futex(0xc000053148, FUTEX_WAKE_PRIVATE, 1 [pid 10110] 09:08:24.904296 nanosleep({tv_sec=0, tv_nsec=20000}, [pid 10114] 09:08:24.904367 <... futex resumed> ) = 0 [pid 10111] 09:08:24.904416 <... futex resumed> ) = 1 [pid 10114] 09:08:24.904473 futex(0xc000053148, FUTEX_WAIT_PRIVATE, 0, NULL [pid 10111] 09:08:24.904541 futex(0xc000053148, FUTEX_WAKE_PRIVATE, 1 [pid 10114] 09:08:24.904598 <... futex resumed> ) = -1 EAGAIN (Resource temporarily unavailable) [pid 10111] 09:08:24.904653 <... futex resumed> ) = 0 [pid 10114] 09:08:24.904709 epoll_pwait(4, [pid 10111] 09:08:24.904775 futex(0xc000052948, FUTEX_WAIT_PRIVATE, 0, NULL [pid 10114] 09:08:24.904835 <... epoll_pwait resumed> [{EPOLLOUT, {u32=3761674056, u64=139950976046920}}], 128, 3999, NULL, 97268847104150) = 1 [pid 10110] 09:08:24.904911 <... nanosleep resumed> NULL) = 0 [pid 10114] 09:08:24.904968 epoll_pwait(4, [pid 10110] 09:08:24.905034 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.905355 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.905574 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.905768 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.905957 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.906343 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.906637 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.906957 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.907199 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.907519 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.907787 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.907983 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.908176 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.908367 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.908558 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.908749 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.908941 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.909132 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.909324 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.909570 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.909780 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.909969 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.910163 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.910354 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.910543 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.910734 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.910923 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.911111 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.911303 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.911491 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.911679 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.911868 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.912057 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.912245 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.912441 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.912638 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.912827 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.913015 nanosleep({tv_sec=0, tv_nsec=40000}, NULL) = 0 [pid 10110] 09:08:24.913223 nanosleep({tv_sec=0, tv_nsec=80000}, NULL) = 0 [pid 10110] 09:08:24.913505 nanosleep({tv_sec=0, tv_nsec=160000}, NULL) = 0 [pid 10110] 09:08:24.914033 nanosleep({tv_sec=0, tv_nsec=320000}, NULL) = 0 [pid 10110] 09:08:24.914812 nanosleep({tv_sec=0, tv_nsec=640000}, NULL) = 0 [pid 10110] 09:08:24.915828 nanosleep({tv_sec=0, tv_nsec=20000}, NULL) = 0 [pid 10110] 09:08:24.916186 futex(0x748558, FUTEX_WAIT_PRIVATE, 0, {tv_sec=3, tv_nsec=988330494} [pid 10109] 09:08:25.909699 <... read resumed> "", 128) = 0 ```

Again the gist:

[pid 10109] 09:08:24.899707 openat(AT_FDCWD, "/dev/ttyS4", O_RDWR|O_CLOEXEC <unfinished ...>
[pid 10109] 09:08:24.900539 <... openat resumed> ) = 3
[pid 10109] 09:08:24.901344 epoll_ctl(4, EPOLL_CTL_ADD, 3, {EPOLLIN|EPOLLOUT|EPOLLRDHUP|EPOLLET, {u32=3761674056, u64=139950976046920}} <unfinished ...>
[pid 10109] 09:08:24.901528 <... epoll_ctl resumed> ) = 0
[pid 10109] 09:08:24.903527 ioctl(3, TCGETS, {c_iflags=0, c_oflags=0x4, c_cflags=0xcbd, c_lflags=0xa30, c_line=0, c_cc[VMIN]=0, c_cc[VTIME]=10, c_cc="\x03\x1c\x7f\x15\x04\x0a\x00\x00\x11\x13\x1a\x00\x12\x0f\x17\x16\x00\x00\x00"}) = 0
[pid 10109] 09:08:24.904183 read(3,  <unfinished ...>
[pid 10109] 09:08:25.909699 <... read resumed> "", 128) = 0

The file is opnened and added to epoll by the Go runtime as well. However, after calling .Fd() calls to .Read() do do read()s directly, without waiting for readiness notification by epoll. You can see this in the read() call taking place uncoditionally and taking 1 second to complete, matching the VTIME setting.

soypat commented 2 years ago

I'm sorry for not answering sooner, how rude of me.

I'll be back and working with this repo sometime within the next 20ish days. My use case is the following: I have an actuator that must be initialized and then is ready to take commands via Write and reply status via Read. Below is a sketch showing the current implementation:

There's surely a better way to do this heh ```go // SendCommand writes the recieved command through UART and reads in r // the recieved information (if any). // Calls to this function are blocking. func (p *Servo) SendCommand(command string, r []byte) (err error) { p.comm_mutex.Lock() defer p.comm_mutex.Unlock() c := make(chan error) go func() { //Read goRoutine var r_error error = nil if r != nil { _, r_error = p.comm.Read(r[:]) //Reads port until all bytes are read } c <- r_error }() go func() { //Write goRoutine var w_error error = nil if command != "" { w := []byte(command) _, w_error = p.comm.Write(w[:]) //Writes port until all bytes are sent } c <- w_error }() err1, err2 := <-c, <-c if err1 != nil { return err1 } else if err2 != nil { return err2 } else { return nil } } ```

If the actuator at any point is disconnected or there is no actuator at initialization the whole program hangs indefinitely. I have only detected this issue when trying to run the program with a disconnected actuator. If the actuator is connected the program works perfectly fine.

My worry is that if during a mission this actuator for one reason or another stops replying a goroutine is leaked in the best case, and in the worst case the whole program is halted. It'd be nice if there was a read timeout.

Before your reply I was planning on having a watchdog checking elapsed time during a read transaction and cycling the file if the timeout was triggered (closing and reopening the file). I am curious as to how you'd implement the Read timeout though.

Thank you for your quick reply and greatly detailed explanation <3!

distributed commented 2 years ago

No worries, life happens.

Regarding your code: Yes, you do want some kind of timeout. There are multiple ways to go about it.

Working a bit with your code and assuming there's SetReadDeadline() method (e.g. from a net.TCPConn, but bear with me) on your p.Comm, I can slightly rewrite it as show below.

```go // SendCommand writes the received command through UART and reads in // the received information (if any). // Calls to this function are blocking. func (p *Servo) SendCommand(command string, r []byte) (err error) { p.comm_mutex.Lock() defer p.comm_mutex.Unlock() c := make(chan error) go func() { //Read goRoutine var err error if r != nil { err = p.comm.SetReadDeadline(time.Now().Add(500*time.Millisecond)) if err != nil { c <- err } // Tries to read up to len(r) bytes. Not that p.comm.Read() may return with less than // len(r) bytes and not return an error, see io.Reader documentation. _, err = io.ReadFull(p.comm, r) } c <- err }() go func() { //Write goRoutine var w_error error = nil if command != "" { w := []byte(command) _, w_error = p.comm.Write(w[:]) //Writes port until all bytes are sent } c <- w_error }() err1, err2 := <-c, <-c if err1 != nil { return err1 } else if err2 != nil { return err2 } else { return nil } } ```

This program will not block indefinitely, assuming the write completes. Notice how I used io.ReadFull instead of a plain Read.

The latest release of sers doesn't have a SetReadDeadline on the SerialPort, unfortunately. However if you are using Linux or OS X, you might be in luck. You can try using the remove-cgo branch of the library, i.e. run go get -u github.com/distributed/sers@remove-cgo on your project. You can try a type conversion on the SerialPort to get access to the deadline methods, see: https://github.com/distributed/sers/blob/remove-cgo/verification/readdeadline/readdeadline.go. I plan to introduce the deadline methods to the Windows port as well, but I won't and can't promise you a timeline. The advantage of using the deadline methods is that they are what the net.Conn interface provides so you can easily swap out a serial port for e.g. a TCP connection in your programs, which can be handy.

Barring adding a read deadline as described above, you still have options. I use sers in various setups and typically I do the following. The described approach works even without deadlines.

I have a goroutine that reads messages (i.e. packets, delimited groups of bytes) from the serial port and sends them on a channel. This goroutine continues until it gets an error reading from the serial port. If I want to stop the reading goroutine, I close the serial port which will lead to the Read() returning to the read goroutine. This functionality is what is verified by verification/readclose.go. The goroutine looks something like this:

// readGoroutine reads whole packets and sends them on channel c. If there is
// no actor to receive from c, the channel done should be closed.
// This function ties can be started with an errgroup (golang.org/x/sync/errgroup)
// with eg.Go(func() error { readGoroutine(sp, packchan, ctx.Done()) }
func readGoroutine(sp sers.SerialPort, c chan <- *Packet, done <-chan struct) error {
    for {
         pack, err := readFrame(sp)
         if err != nil {
              return err
         }
         select {
         case c <- pack:
         case <-done:
         }
}

When you want to send-then-receive you could do the following:

func exchange(ctx context.Context, out []byte) (*Packet,error) {
    _,err := sp.Write(out)
    if err != nil {
         return nil, err
    }

    select {
    case pack := <- packchan:
        return pack, nil
    case <-ctx.Done():
        return nil, io.EOF
    case <-time.After(100*time.Millisecond):
         return nil, fmt.Errorf("timeout while waiting to receive packet")
     } 
}

To add two things that are not sers-specific, but might still be of help: It pays to have some kind of message definition, .e.g. "first a byte 0xfe, then a byte N that indicated the length of the payload, N payload bytes, then a CRC-16 of all the N payload bytes" and use that for communication.

Consider what happens if your serial line happens to be noisy and your computer receives an extra byte: This means that you can read e.g. 11 instead of 10 bytes. After you read your 10 byte message, there is still one extra byte left in the receive buffer. The next time you receive a 10 byte message, that 1 byte will carry over etc. Barring further errors, you won't be able to receive messages indefinitely.

A similar issue can happen if the actuator responds, but the response only arrives after timeout. This can also happen because of delays at the OS or application layer. In this case, the next read operation will read an old message, with the pattern repeating indefinitely if you always read one message for each one sent.

There are many more reasons to introduce some kind of message definition.

Because of issues such as these described above it also pays to number your messages so that you know that you just received a reply to request number 37. If you receive an old reply or one that doesn't pertain to any request you made, you can ignore it.

TL;DR:

If you're on Linux/OS X you can use the SetReadDeadline modification to your program.
If you need Windows support and/or you are willing to extend your program somewhat, you can use the read goroutine approach shown above.
As mentioned before, you can also try to use the workaround with .Takeover(). This is the least future safe option, though.

How do I implement read deadlines?

I don't, I let them be handled by the poll package in the Go runtime. I merely call SetReadDeadline and friend on the serial port file descriptor (as *os.File) and use the functionality already provided.

What happens under the hood is this: You specify a read deadline for e.g. in 2 seconds, then you do a .Read(). The Go runtime doesn't call read() on the serial port file descriptor, instead it adds it to the set of file descriptors watched for readability. Under Linux this mechanism uses epoll(). When the runtime actually blocks because there nothing else to do, using epoll_wait it specifies a timeout of 2 seconds*. If data arrives within these 2 seconds, epoll will instruct the Go runtime about the file descriptors on which I/O can be done. Otherwise it will return after 2 seconds. At this point, the Go runtime checks that the deadline has been exceeded, unblock the goroutine running Read() and return an error to it.

Your program sees a timeout on a synchronous read operation. The Go runtime creates this behavior by using asynchronous I/O primitives and returning control to your goroutine at appropriate times with the corresponding return count/error return values.

Or whatever comes before, if you have other, shorter deadlines active.

soypat commented 1 year ago

So today I had to work with yet another actuator and settled for your library yet again. I keep coming back to it with new insights.

So I like your readGoroutine approach. The problem it has is the one you mentioned regarding the read buffer being filled with noise in between reads- I think it can be mitigated with a Flush function. I really would like a type that provides this synchronous functionality wrapping the SerialPort type. If you don't mind I'd like to PR this functionality to the repo (maybe best suited as a subpackage). Whatdya think? Also: Is there no way to see how many bytes are in the buffer ready to be read?

soypat commented 1 year ago

Here's an example of what the API could look like:

```go package sy0x import ( "context" "errors" "math" "time" "github.com/distributed/sers" "github.com/smallnest/ringbuffer" ) type Txer struct { sp sers.SerialPort err error buffer ringbuffer.RingBuffer waitForResponse time.Duration } func NewTxer(ctx context.Context, sp sers.SerialPort, maxBufferSize int, waitForResponse time.Duration) (*Txer, error) { if waitForResponse < 0 { return nil, errors.New("negative wait for response") } tx := &Txer{ sp: sp, buffer: *ringbuffer.New(maxBufferSize), waitForResponse: waitForResponse, } go tx.readGoroutine(ctx) return tx, nil } func (t *Txer) Reset() { t.buffer.Reset() } func (t *Txer) Tx(w, r []byte) (int, error) { t.Reset() n, err := t.sp.Write(w) startWait := time.Now() if err != nil { return 0, err } if n != len(w) { panic("bad writer implementation on SerialPort: did not return error on being unable to write all bytes to wire") } for time.Since(startWait) < t.waitForResponse && t.buffer.IsEmpty() { time.Sleep(time.Millisecond) } return t.buffer.Write(r) } func (t *Txer) readGoroutine(ctx context.Context) { var buf [math.MaxUint16]byte for { n, err := t.sp.Read(buf[:]) if err != nil { t.err = err return } _, err = t.buffer.Write(buf[:n]) if err != nil { t.err = err } select { case <-ctx.Done(): t.err = ctx.Err() return default: } } } ```

distributed commented 1 year ago

I would not work with time based ways of distinguishing messages. It limits message throughput and whether your protocol works is dependent on how timely data is being delivered to your program. Time based approaches work well for real time systems where you have tight bounds on when data arrives. A scheduling hiccup in your PC can easily move timing around by dozens or hundreds of milliseconds.

As I mentioned above, I believe the best way to find distinct messages in a byte stream is to define what a message looks like. The following article has a nice overview: https://eli.thegreenplace.net/2009/08/12/framing-in-serial-communications

Note how the article discounts time based approaches as approach (1).

Usually I go for a message definition something like the following:

ff [payload len N: 0-255] payload*N [checksum of the previous bytes]

To read a message you write a function that looks for a 0xff byte, then reads another byte and interprets it as a length N, then reads N bytes, puts them into a buffer, then calculates a checksum of the previous bytes and compares it to a received checksum. If the calculated and received checksums match, it is very likely your received a correct message, if so, deliver it to your program, otherwise drop.

Even if your data transmission is disturbed, the above method has a good practical chance to lock onto a subsequent ff and start receiving correct data again*.

Concerning your code, I would make Tx wait in a select, not in a while xxx { sleep } loop. Note in my example:

    case pack := <- packchan:
        return pack, nil
    case <-ctx.Done():
        return nil, io.EOF
    case <-time.After(100*time.Millisecond):
         return nil, fmt.Errorf("timeout while waiting to receive packet")

This implements waiting for the next message on packchan, returning if nothing arrives withing 100 ms and is even cancellable through a context.

Thanks for the offer of the PR. Sers is a library focussed on providing access to serial ports, not to framing. If you would like to publish your code, I suggest you do it in a separate repository on your account.

Regarding your question. No, there is no way to know how much data is ready to read. You can kind of fake it if your programs reads into a buffer and you query that buffer's length. I am not sure what you want to achieve. Knowledge about the the amount of readable data is rarely required in Go programs.

Your payload can contain ff bytes and make the receiver search from frames in the middle of frames. Note whatever the framing scheme, it is not possible to guard against all possible errors.

soypat commented 1 year ago

The following article has a nice overview: https://eli.thegreenplace.net/2009/08/12/framing-in-serial-communications

Thank you a billion. This was a 10/10 read.

Sers is a library focussed on providing access to serial ports, not to framing

After reading the article above this sounds super reasonable heh.

Thanks for all the tips- super valuable info. Gonna go try write a few serial framing libraries and see what the fuss is all about

distributed / sers