linux/unix read timeouts vs multi-thread signal delivery

Summary of email conversations with Tomasz K. There are issues with linux/unix timeouts in diag_tty_read() and diag_tty_write().

SIGUSR1 is used with timer_settime() to interrupt the read() and write() calls, this is so far the best way to get precise timeouts
SIGALRM periodically triggers a handler to call the timeout functions diag_l2_timer(), diag_l3_timer(). It is set to run every ~300ms (ALARM_TIMEOUT defined in diag_os.h)

Tomasz was having issues with some read/write calls not being interrupted, he notes

issue A) Yep – the problem is with the signal, but it’s not that it isn’t delivered at all, but it is being caugth by incorrect thread. [...]

Some background info: http://pubs.opengroup.org/onlinepubs/7908799/xsh/sigaction.html

the periodic callback timer has sigev_notify=SIGEV_THREAD, which calls diag_os_periodic() thus : " The function will be executed in an environment as if it were the start_routine for a newly created thread with thread attributes specified by sigev_notify_attributes."
we're using sigev_notify=SIGEV_SIGNAL for the timeout timer, which means "The signal specified in sigev_signo will be generated for the process when the event of interest occurs."
POSIX:

"At the time of generation, a determination is made whether the signal has been generated for the process or for a specific thread within the process."

That sounds like a bad sign - the timeout SIGUSR1 is delivered "to the process" because of SIGEV_SIGNAL; this seems to agree 100% with your findings : the "wrong" thread receives SIGUSR1.

Maybe the main thread (and whole process) could be set to ignore SIGUSR1; but inside the diag_tty_read() and _write() functions, we could temporarily set the signal mask to unblock SIGUSR1 for the current thread; and reset the masks to SIG_IGN before returning.

POSIX:

Each thread has a signal mask that defines the set of signals currently blocked from delivery to it. The signal mask for a thread is initialised from that of its parent or creating thread. [...] The sigaction(), sigprocmask() and sigsuspend() functions control the manipulation of the signal mask.

As well as setting signal masks, a robust solution would probably need a different timer + signal for every open tty. POSIX already has provisions for this, defining a range of signal numbers from "SIGRTMIN" to "SIGRTMAX".

Is this ticket originally about freediag/scantool, or some other program that uses our diag_* as a library?

freediag is not multithreaded. Don't be fooled by the pthreads mutex calls in diag_os_unix, they are just an ineffectual offering to the concurrency god. We never create any threads, so we're not multithreaded. In particular, signal handlers are not an implicit thread; they are analogous to hardware interrupts. A signal is a forced branch to the signal handler from whatever you were doing before. As such, you can't call any non-reentrant functions from a signal handler. For example, you can't call malloc or free, which means that basically any of the keepalive functions will end up corrupting the heap. It works for a few minutes, an hour, whatever, and then a signal comes in at an unlucky time and it segfaults. Generally, best practice for signal handlers is to just set a variable in the signal handler and return, and then check the variable in your mainline code.

I think SIGALRM for keepalives is basically an untenable approach.

We have a few choices:

Fork a subprocess that handles all communication, and talks to the main process over pipes. No threads anywhere. This is the traditional UNIX approach. Unfortunately, it won't work if we ever want to run on Mac OS: making an innocent-seeming standard library call in Mac OS may start threads or open Mach ports, after which calling fork() and then doing anything other than an immediate exec() after the fork can cause a deadlock.
Have readline periodically time out while waiting for console input and check if it's time to send a keepalive. This is the most straightforward approach, but it's kludgy, and it won't work if we're doing some other long running operation that doesn't involve talking to the ECU or waiting for a command.
Implement ISO9141 and the other L2 protocols as line disciplines in the kernel, and have the line discipline take care of keepalives. Architecturally, this is the only correct solution. But, it's impractical. It only works on Linux, it means users would have to update to a kernel with the new line disciplines, and someone would have to write the kernel code and get it merged, code for protocols that are already obsolescent now that recent vehicles are using CANbus.
Use actual threads, including one dedicated to keepalives. We get all the headaches associated with threads. We won't be able to run on very archaic UNIX flavors that lack threads, but Linux will be OK. In case of future ports: Mac OS and iOS also support pthreads, and Android supports a pthreads subset. (iOS and Android would probably also be a nonissue because on those devices, we would be using a smart interface rather than a tty.)

So I think for keepalives, the readline timeouts are the easiest option, a keepalive thread is a tougher but maybe better option, and in an ideal world, line disciplines could also be added as a selectable L0 that would (as with ELM and other smart interfaces) move the keepalives out of the userspace process.

As far as read/write timeouts, I think we can get rid of SIGUSR1 with some combination of select/poll, nonblocking read/write, and the MIN/TIME termios. On the other hand, if we're not using threads (using readline timeouts, and not dealing with a program other than freediag/scantool that uses our diag_* and is multithreaded) then our current use of SIGUSR1 might be fine.

Stirring up old unpleasant tickets... : )

Is this ticket originally about freediag/scantool

mainly yes. I also maintain nisprog which uses libdiag in a similar way. Both have issues with keepalives sometimes causing a segfault or other proglems, IIRC. Interestingly, I was never able to reproduce Tomasz' issue of "some read/write calls not being interrupted"

Keep in mind I haven't given much thought to these issues recently, but at the time (2014-2016) I spent many, many hours researching and testing this.

freediag is not multithreaded

I'm well aware of that, but the POSIX description of signal delivery when SIGEV_THREAD is specified, is not excessively clear. Granted the mutex use is hardly appropriate, but it did alleviate some of the problems. The annoying problem is really the infamous keepalives. I think I was at the point where I was considering moving keepalive management at the CLI level ( similar to what you described with interrupting readline) , with a lock on "is any user command running". So far there are no user commands that take a significant amount of time, so it may have been a viable, if ugly, solution.

I think SIGALRM for keepalives is basically an untenable approach.

Agreed. This part of the code has been flawed since the start, and I only really dug into it starting in 2014.

pipes [...] Unfortunately, it won't work if we ever want to run on Mac OS

I have zero interest in that platform but good to know.

* Implement ISO9141 and the other L2 protocols as line disciplines in the kernel

Not familiar with 'line disciplines', but based on your description this has no chance of happening. Even if someone steps up to do the work, few (if any) will be willing to tweak their kernel just for this.

* Use actual threads, including one dedicated to keepalives. We get all the headaches associated with threads.

Agreed, but probably would be the way to go if (as I originally intended, but cannot undertake anymore) the entire API is changed to conform to J2534, and then the current L0..L2 become a kind of standalone J2534 emulation layer.

As far as read/write timeouts, I think we can get rid of SIGUSR1 with some combination of select/poll, nonblocking read/write, and the MIN/TIME termios.

Unlikely, unless things have changed since I worked on those timeouts. IIRC termios didn't have enough granularity; select/poll was not nearly accurate enough. There was even an... "interesting" technique that wrote directly to /dev/rtc or something, not sure if I eventually deleted that implementation. You probably noticed the other implementations are still in there, just need to tweak diag_tty_unix.h and recompile. Then the "debug l0test" submenu has read / write timeout selftests that simply require a plain K line adapter (not ELM). As I said earlier, I put in substantial effort to get accurate timeouts and break patterns (necessary for iso14230 with dumb L0 interfaces), they work "well enough", and it will take extremely compelling arguments (and exhaustive tests) to convince me to change those parts of the code. On the other hand, I agree the current keepalive implementation is a dead-end and can absolutely be improved.

Is this ticket originally about freediag/scantool

mainly yes. I also maintain nisprog which uses libdiag in a similar way. Both have issues with keepalives sometimes causing a segfault or other proglems, IIRC. Interestingly, I was never able to reproduce Tomasz' issue of "some read/write calls not being interrupted"

Was Tomasz using freediag too, or his own libdiag-based program?

freediag is not multithreaded

I'm well aware of that, but the POSIX description of signal delivery when SIGEV_THREAD is specified, is not excessively clear.

Oh! I was so obsessed with SIGALRM that I overlooked SIGEV_THREAD, even though it was clearly mentioned. So freediag is multithreaded when it's built with SEL_PERIODIC=S_POSIX: whenever the keepalive timer fires, diag_os_periodic runs in its own thread. This means that we can, at least, call malloc and free from the keepalive functions. And it also provides an explanation for SIGUSR1 not interrupting read(): We start a read, and then the keepalive timer fires, and then during diag_os_periodic, the SIGUSR1 timer fires. SIGUSR1 gets delivered to the diag_os_periodic thread and it doesn't interrupt the main thread's read syscall.

Was Tomasz using freediag too, or his own libdiag-based program?

freediag I believe.

read(): We start a read, and then the keepalive timer fires, and then during diag_os_periodic, the SIGUSR1 timer fires. SIGUSR1 gets delivered to the diag_os_periodic thread and it doesn't interrupt the main thread's read syscall.

That sounds about right, yes.

fenugrec / freediag

linux/unix read timeouts vs multi-thread signal delivery #5