Open fredericbonnet opened 6 years ago
I've just published TIP #509 here:
https://core.tcl.tk/tips/doc/trunk/tip/509.md
The implementation is available on branch tip-509 in the Fossil repo. Feedback welcome.
In the unfortunate case that the TIP be rejected by the TCT, I have an alternate solution that addresses the original issue by adding reentrancy to the async handler mutex alone. This would be a simple fix to the original ticket f4f44174, and as such it wouldn't require any TIP to reach the core (only proper review from the maintainer). But I think the TIP is a more generic solution to this class of problems.
Further ideas regarding my last comment on https://core.tcl.tk/tcl/tktview/f4f44174
Instead of directly calling sigprocmask(2), the tclXsignal.c module needs to call a new (to be defined) Tcl core function which interacts with the notifier thread in order to query or alter its signal mask.
The code path of Tcl_AsyncMark() currently needs to acquire 3 (three!) mutexen before it sets a thread specific flag and carries out a pthread_cond_broadcast() on that thread. No single call of this path seems to be async-signal-safe according to http://man7.org/linux/man-pages/man7/signal-safety.7.html
Safe alternatives are possibly write(2) and sem_post(3), but this most likely requires a full and incompatible redesign (of which I have no idea yet).
A follow up on my last post. Check-in https://www.androwish.org/index.html/info/41498ca0a57c2aa7 now implements most of the outlined ideas. POSIX signals (per Tclx and Tcl_AsyncMark) are routed through the notifier thread which delivers it synchronously as part of its main loop. Tcl_AsyncMark() uses (in case it runs in the notifier thread) only the pthread_self() and write() system calls (which are async-signal-safe). A rendez-vous approach is implemented which allows the signal command of Tclx to invoke pthread_sigmask() in the notifier thread in order to block and unblock individual signals.
The test case given in the https://core.tcl.tk/tcl/tktview/f4f44174 ticket seems to work on x86_64 linuxen.
Forwarding a message from Christian Werner (AKA undroidwish on GitHub and chw on core.tcl.tk) on the Tcl-Core list:
Hello Fred,
IMO TIP #509 is consistent for one definite mutex semantic on all platforms. However, I believe it is not part of the signal bounty problem. Thus, I've tried another approach on AndroWish/undroidwish which follows core-8-6-branch:
use the notifier task to handle signals exclusively by having the first (main) thread blocking all signals, remembering its signal mask, and in the notifier thread re-install this signal mask before blocking in select() and blocking all signals after select() returned. This gives a very clear view at runtime when signals are to be handled.
add function TclAsyncInNotifier() in tclUnixNotfy.c which sets an integer to a specified value and writes a single byte to the trigger pipe when called from the notifier thread. Both pthread_self() and write() may be used in a signal handler.
change tclAsync.c to call the function from point 2 before going the old fashioned way of taking mutexes and alert the target thread (which shouldn't happen anyway, due to all signals being blocked in all threads except the notifier).
add function TclAsyncMarkFromNotifier() in tclUnixNotfy.c which deals with those async handlers which where flagged by the function from point 2. This is called as part of the notifier's thread main loop and Tcl_AsyncMarks() all async handlers caught by the notifier thread.
changed TSD specific locking in tclAsync.c to a single process wide mutex since signals are a process characteristic
So far, the signal handler path should be free of mutex locks, provided everything signal ever is handled in the notifier thread.
Now, the left over Tclx signal features are the "signal block" and "signal unblock" commands. A solution implemented is to add a new platform stub called TclpSigProcMask which has the signature of sigprocmask/pthread_sigmask and carries out an RPC in the notifier thread to modify its signal mask.
The currently implemented code is available for both, unix and macosx platforms with the last check-in on
https://www.androwish.org/index.html/info/28bd3791feec27a5
With these changes the code snippet from ticket https://core.tcl.tk/tcl/tktview/f4f44174 doesn't lock up after few hundred or thousand iterations anymore.
Now for the problems, questions, ideas, etc.
when the new epoll/kqueue approach is in effect, there's no more notifier thread. Would a new signal handler thread be needed?
Linux supports a signalfd(2) file descriptor type which allows to synchronously deal with signals. Would that fit the epoll approach? Have BSDen/MacOS something similar?
the AndroWish beast is different: the Tcl main thread is spit out of a threaded black hole whose signal disposition remains in the darkest darkness. How to deal with it?
Comments welcome, Christian
I respectually disagree on the point that TIP #509 doesn't address the original problem. Let me explain.
As you wrote yourself in your last section, some thread-enabled Tcl cores don't have a notifier thread. This means that in simple, single-threaded cases, a thread-enabled core behaves the same as a thread-disabled one, mutexes aside. The code paths are the same but Tcl_Mutex functions are:
I managed to reproduce the problem described in ticket https://core.tcl.tk/tcl/tktview/f4f44174 by splitting the original 2-threaded scripts into one single-threaded signal target process and one signal emitter process. The signal target script deadlocks eventually because standard Tcl mutexes are not reentrant. Here is the signal receiver script, notice that it no longer uses the Thread package:
package require Tclx
signal trap SIGHUP handler
proc handler args {
puts stderr "SIGHUP [incr ::HUPS] $args"
}
proc gotinput {chan} {
gets $chan d
puts "hupity $d"
}
puts [pid]
vwait result
And here is the signal emitter, written in Tcl though it could be written in any other language actually:
package require Tclx
set pid [lindex $argv 0]
while 1 {
kill HUP $pid
puts [incr ::HUP]
}
The test procedure is as follows:
tclsh
version you want to testtclsh
version you want, passing the PID as first
argumentAnd the results:
tclsh
, the first script runs just fine for millions of
signals sent.tclsh
, it deadlocks eventually when the signal is
caught in a mutex-protected section of the async handler (just as the bounty
states).tclsh
with reentrant mutexes, it behaves exactly the
same as the thread-disabled version.On my standard Linux system, the Tcl notifier uses epoll, and all 3 instances above are single-threaded as they should be. On systems that use select you have one extra notifier thread but I don't think this changes anything to the observed behavior (the deadlock occurs before the notifier thread is triggered). Your proposed approach is interesting but implies that we reinstall a notifier thread on epoll-based cores, which adds an extra layer of complexity and potential performance problems.
Your analysis of the signal-safety of the async handler's code path is very valuable, however :
Tclx signalling has been there for years on thread-disabled Tcl so we can safely assume that it works without major trouble, and it's outside the scope of the bounty anyway. And I don't think signal-safety is at play in the scope of the Tcl core (see below).
But if we are really concerned with signal-safety then we'd better start
working on improving Tclx (see below) ; for example it uses sigprocmask
even
though POSIX states that:
The use of the sigprocmask() function is unspecified in a multi-threaded process.
The only difference between the 3 single-threaded test contexts above is Tcl_Mutex semantics. So multithreading IS NOT the root cause of the deadlock problem ; mutex-reentry (or lack thereof) IS the root cause. And removing mutexes implies a significant refactoring of the existing Tcl core.
I strongly believe that signal handling should be kept outside of the Tcl core, as it will open a whole new can of worms. Tcl doesn't deal with signals itself, it only provides the facilities to deal with asynchronous events (even though the man page cites signal handling as an example). In our case, Tclx is the sole responsible for signal handling and Tcl should know nothing about it IMHO. If Tcl starts dealing with signals then it should do it completely and provide all the necessary features, this would amount to merging parts of Tclx into the core but it's waaaay outside the scope of the bounty. More on that below.
You also suggests that only the notifier thead should receive signals and that
they should be blocked everywhere else. I don't think that's going to work
because there is nothing that prevents client code from installing its own
signal handlers, especially given that Tcl is totally signal-agnostic (hence
Tclx). Nor does it prevent client code from starting its own threads without
using Tcl API (a common situation when Tcl is embedded as a library or when
integrating with third-party libraries that spawn their own threads). At the
very least, the notifier thread (when it exists), and all the 'utility' threads
in general, should rather block all signals as a defensive measure using
pthread_sigmask
to ensure that they get routed to the right recipient and not
to itself.
Moreover, centralizing signals on the notifier thread introduces a single point of failure and a potential bottleneck with performance issues, on all versions of the core even those with thread-local notification (epoll).
A last point: both Tcl notifier and async APIs are public, and specifically designed to be used in a variety of situations (see the introduction of the Notifier man page and the mention of external event loops). So you can make no assumption on the way these APIs will be called in client code. Implementors can create their own notifiers, and the Async man page explicitly states that:
Applications may also call Tcl_AsyncInvoke at interesting times for that application
That's how the NRE engine deals with async events during tailcall actually.
So you can make no assumption on how the async and notifier features will be used and what kind of impact your changes will have on existing code, because nothing prevents client code from building apps that are very different from Tclsh from an architectural point of view and yet use the same public API. Tclsh and Wish are just client apps among others.
About Tclx: TOP #509 only addresses the bounty issue on the Tcl core side.
I'm currently working on the Tclx side to ensure that there are no other
problems with signal handling in multi-threaded contexts (such as the
sigprocmask
issue described above) and plan to propose a patch soon, but this
has nothing to do with the Tcl core and thus doesn't require a TIP or the
involvement of the Tcl Core team.
Fred, I see your point regarding assumptions about the runtime environment. And I'd like your TIP to be adopted, soon, not later. However, can we please agree on these points:
Now the implications as interpreted by my humble self:
Points 2 and 3: the thread addressed by Tcl_AsyncMark is not necessarily the thread which runs the signal handler.
Point 1: pthread_mutex_lock/pthread_mutex_unlock are not part of the safe set, independent of being recursive, non-recursive, process-shared, or whatever. And this is my main argument, that TIP#509 is not part of the solution (and I should have been more precise here) how a signal handler achieves to safely set a mark on the async handler.
Best, Christian
One more thought while glimpsing of the list of signal safe syscalls (http://man7.org/linux/man-pages/man7/signal-safety.7.html):
Instead of handling signals in a dedicated thread it should be possible in a special signal safe version of Tcl_AsyncMark to determine the current thread using pthread_self, compare it against the originThreadId of the AsyncHandler struct. In case the signal was caught by a different thread, it is now time to do a pthread_kill with the same signal number but with the owning originThreadId of the AsyncHandler struct as the target.
But the root problem remains: the mark on the AsyncHandler must be set using means which are guaranteed to be async-signal-safe.
Thus, at least two core changes seem unavoidable:
I don't have the time to post a longer answer to both your messages right now but the solution I'm working on works along these lines: let Tclx trap async signals in a dedicated thread (using sigwait for example), then forwards them asynchronously to an other thread using async-safe functions such as sem_post. That second thread can then call Tcl_AsyncMark safely. At present Tclx calls Tcl_AsyncMark directly from the signal handler.
More on that later.
The longer I think over Tcl_AsyncMark reading the first sentence of the manpage ("These procedures provide a safe mechanism for dealing with asynchronous events such as signals...") the more am I convinced that the current implementation in Tcl 8.6 is plain wrong when the core is built with threading enabled. Of course, there's no problem when the core is un-threaded, and for this the design still holds. However, now that un-threaded builds are phased out, there's a pressing need for reform, IMHO.
Your outline using a thread which does sigwait sounds interesting, and BTW isn't that different to my approach in the notifier thread. But then, do both solutions not both introduce a bottleneck and SPOF? And still seems sigwait only half the solution, since it does not channel the kernel's signal delivery, but pthread_sigmask does.
Meanwhile, I think we're in desperate need of a Tcl_AsyncMarkEx(Tcl_AsyncHandler async, int sigNumber), which can do on POSIX something like
#if TCL_THREADS
if (Tcl_GetCurrentThread() != async->originThreadId) {
/* Redirect the signal to the target thread owning the async handler. */
pthread_kill((pthread_t) async->originThreadId, sigNumber);
return;
}
#endif
/*
* Do some async-signal-safe magic to wake up the thread, which
* must be now the thread owning the async handler,
* e.g. by writing to the eventFd or triggerPipe
*/
Still something more is needed to allow proper semantic of the "signal block" and "signal unblock" Tclx commands, but this could boil down to a simple per async handler mask and flag when a little more reform is carried out.
To extrapolate a flawed interface or to introduce binary incompatibility, that is the fine question.
Meanwhile my first approach evolved, see check-in https://www.androwish.org/index.html/info/40790af1e8e4ec9f
The basic idea is now to redirect signals to the notifier thread using a new core function Tcl_AsyncMarkFromSignal(), which compared to Tcl_AsyncMark() has the signal number added. For the normal notifier this is the implementation:
void TclAsyncNotifier(int sigNumber, Tcl_ThreadId threadId, int *flagPtr, int value)
{
if (pthread_equal(pthread_self(), (pthread_t) notifierThread)) {
if (notifierThreadRunning) {
*flagPtr = value;
asyncPending = 1;
write(triggerPipe, "S", 1);
}
return;
}
pthread_kill((pthread_t) notifierThread, sigNumber);
}
The notifier thread is run with all signals blocked but unblocks/reblocks all signals when performing it's select() system call.
For the new epoll/kqueue notifier the function above needs to test for the target thread of the AsyncHandler, if unequal it has to redirect the signal to the target thread. The signal handling in the target thread then finally writes a single byte to the pipe/event_fd of the target thread's notifier file descriptor.
Overall result: the code path in signal contexts is clean regarding the use of async-signal-safe system calls.
Regarding the "signal" command in Tclx, I've replaced the sigprocmask() calls with few global variables which track the blocked/unblocked/pending state for all signals. Still needs the "signal" command a major redesign to proper support threading.
Here is the patch that I sent to Peter Da Silva from FlightAware last week, I couldn't post it here directly because I had ISP issues and no access to the web.
This solution is much simpler and it involves no changes to the Tcl Core (as it should be IMHO). On MT builds we use a dedicated thread for signal processing in order to avoid deadlocking and signal-unsafe calls in Tcl_AsyncMark
:
sigwait()
and calls Tcl_AsyncMark()
safely each time it wakes up.pthread_kill()
.sigprocmask()
behavior is undefined in multithreaded processes according to POSIX, and were replaced with pthread_sigmask()
.I've validated this architecture with success in a separate proof-of-concept implementation that tested a variety of configurations (threads created/destroyed and signals blocked/unblocked dynamically at different times), and it behaved reliably even under heavy load and hundreds of millions of signals sent. It showed that no further changes were needed for neither TclX nor the Tcl core, especially anything involving signal handlers and masks. More specifically, there is no need to block signals explicitly in newly spawned threads, as the global signal handlers are thread-agnostic, and thus no need for e.g. thread hook procs for signal setup.
I haven't noticed any difference in performances of behavior (deadlocks notwithstandinn obviously) between the unpatched and patched versions.
On the Tcl side this solution is independent from the actual technique used by the Tcl notifier (select, epoll...).
The side effect is that Tcl #509 no longer relates to this bounty issue so I'm going to change the title accordingly.
PS: the patch also adds some bits of thread-safety in TclX global state management but it's unrelated to the root issue.
A couple of comments:
% package require Tclx
8.4
% vwait forever
SIGINT signal received
SIGINT signal received
SIGINT signal received
package require Tclx
signal trap ALRM {puts stderr "got an alarm signal in process [pid]"}
if {[fork] == 0} {
puts "[pid] is the child"
alarm 5
vwait forever
}
puts "[pid] is the parent"
alarm 5
vwait forever
13675 is the parent
13677 is the child
got an alarm signal in process 13675
[... nothing more from the child ...]
Sorry I can't reproduce the first test case: First shell:
vagrant@vagrant:~$ tclsh8.7
% package require Tclx
8.4
% pid
3308
% vwait forever
vagrant@vagrant:~$
Second shell:
vagrant@vagrant:~$ kill -SIGINT 3308
vagrant@vagrant:~$
However if this causes problems then I have a fix ready. It involves restarting the signal-handling thread each time the mask is modified, it works very well. Actually, the first version of my PoC used this method before I realized the current implementation was sufficient (well, maybe not after all).
About the second test case, I have to admit I didn't see it coming :) However the fix is super-easy, it uses pthread_atfork()
to reinstall the signal-handling thread in the newly forked child:
vagrant@vagrant:~$ tclsh8.7
% package require Tclx
8.4
% signal trap ALRM {puts stderr "got an alarm signal in process [pid]"}
% if {[fork] == 0} {
puts "[pid] is the child"
alarm 5
vwait forever
}
3317 is the child
% puts "[pid] is the parent"
3315 is the parent
% alarm 5
0.0
% vwait forever
got an alarm signal in process 3317
got an alarm signal in process 3315
Here is the new patch: signalthread.v2.patch.zip
For the first case, you need to test with a signal that is delivered to the signal-handling thread initially. If it goes to another thread, then you get the correct behaviour, it's only if it happens to go to the signal-handling thread that you get the incorrect behaviour. On Linux, at least, you can do a kill by tid (e.g. probably 3309 in your example above) to force that
(the real case that this affects is things like SIGSEGV..)
Thanks for the explanation, I could reproduce the bug using pid+1 as the thread ID.
The fix was easier than expected at first, it uses the same signal-forwarding technique but in reverse. There are 4 scenarios to consider:
Signal with a TclX handler (SignalTrap
)
a. Signal received on a regular thread: SignalTrap()
forwards the signal to the signal-handling thread (see b)
b. Signal received on the signal-handling thread: SignalThread()
calls SignalReceived()
directly
Signal without a TclX handler (any other handler or no handler at all)
a. Signal received on a regular thread: regular processing (caught by handler or default behavior)
b. Signal received on the signal-handling thread: SignalThread()
forwards the signal to the main thread (see a) and resumes sigwait()
Cases 1. and 2.a. were already in place, here is a new patch that implements 2.b. :
This technique is easier & more robust that the one mentioned above as it requires no synchronization or bookkeeping. I don't see any circumstance why this shouldn't work, however if you have a failure scenario in mind please let me know.
Impressive progress, congratulations.
OTOH, I'd still like to read some comment by FA on my last sentence in my last posting:
"To extrapolate a flawed interface or to introduce binary incompatibility, that is the fine question."
My only concern with keeping everything blocked in the signal-handling thread and forwarding unhandled signals to the main thread, is what happens if the signal-handling thread gets a SEGV or other similar usually-fatal signal (SIGBUS SIGILL etc). I don't actually know how this behaves, but we want a core dump here, not spinning infinitely.
Shouldn't the signal handling thread limit itself to handling signals that TclX has actually requested handling for?
@mutability this shouldn't happen with synchronous signals such as SEGV because according to POSIX sigwait()
only deals with asynchronous signals. I've made some tests just to be sure (dereferencing an invalid pointer in the signal-handling thread) and the implementation behaves as expected (it dumps a core). See also the following Linux man page:
http://man7.org/linux/man-pages/man2/sigtimedwait.2.html
sigwaitinfo() or sigtimedwait(), can't be used to receive signals that are synchronously generated, such as the SIGSEGV signal that results from accessing an invalid memory address or the SIGFPE signal that results from an arithmetic error. Such signals can be caught only via signal handler.
@resuna that's a good point, the current implementation catches all signals including those it has no business with. So it's suboptimal even though it should behave as expected because it forwards unhandled signals back to the main thread.
To address both your concerns I've implemented the previously mentioned technique:
pthread_sigmask()
(in BlockSignals()
)sigwait()
still waits on all signals but should only catch those matching its mask.Here is the new patch:
With the help of dkf TIP#511 aka https://core.tcl.tk/tips/doc/trunk/tip/511.md is now available and hopefully clarifies my arguments somewhat. For me, it seems impossible to safely catch signals synchronously, however, to channel delivery by using redirection using pthread_kill() in the respective handlers seems a feasible approach.
Still there's something left: what is the exact expectation of TclX's signal command regarding multiple threads. Shall it work on the calling thread (if yes, the corresponding Tcl_AsyncHandlers need be TSD aware and pthread_at_fork() has to do something reasonable in the child process). What about MacOSX? The current implementation of it requires the CoreFoundation to be a building block for the entire event system. But this conflicts with fork() semantic.
And what is the exact expectation of TclX's "signal block/unblock" semantic when applied on multiple threads and/or forking?
Another remark: now that TIP#511 is out, IMHO discussion should be further go on the core.tcl.tk mailing list.
@fredericbonnet - What version of tclx is signalthread.v4.patch against?
@resuna it's version 8.4.1.
I see there's a TclX repo here on GitHub. Is it up-to-date? If so I could submit my patch as a pull request instead of a file if you prefer.
Porting my changes to version 8.4.2 should be trivial, it seems that's only a matter of s/CONST84/const/g
You can assume that the https://github.com/flightaware/tclx is current.
That’s the primary repo. Yeh, if you could do a fork and a PR that would be great.
Done!
Can you look at https://travis-ci.org/flightaware/tclx/jobs/503274603 ?
I don't see how your changes could have caused the signal-1.42 error since they don't touch the parsing for "signal trap" but the other is odd.
I've managed to reproduce the signal-1.9 failure consistently using the test suite, but not interactively or from the command line. It also passes when I execute the test in isolation. So I believe there is a side effect in the previous tests that breaks its behavior. It looks like signal-1.8.3 might be the culprit, as executing signal-1.8.3 before signal-1.9 breaks the latter.
I'm investigating the issue.
I think I have found the source of the problem. Ironically, it is linked to the signal thread restart technique introduced in patch 4 (see https://github.com/flightaware/Tcl-bounties/issues/32#issuecomment-397824269). Reverting this change makes the signal test suite pass.
So restarting the thread changes the timing of the delivery so they fire after the test has completed?
No, I've run some tests adding extra delays in the main thread to make sure that the signal thread had enough time to start, and it didn't change a thing.
I think that the patch 3 approach was right (all tests are green BTW) but patch 4 introduced unexpected side effects while trying to be too conservative. Signal mask inheritance doesn't seem to work the way I thought.
Hi, I've just realized that I forgot to push my local changes last time, so there they are. It triggered a new build/test on Travis, some tests fail but they have nothing to do with the signal handling (recursive_glob-2.7 / recursive_glob-2.10).
This issue is not a TclX-specific problem but a bug/shortcoming in the Tcl core, TclX just happens to be the simplest way to expose it at script level. The ticket below provides such a script, however it is entirely possible to reproduce this phenomenon using the C API (more specifically, a combination of Tcl async and low-level signal handlers on which TclX relies).
https://core.tcl.tk/tcl/tktview/f4f44174
I was able to reproduce the problem eventually after several hours, and the deadlock is indeed caused by the async thread self-deadlocking while attempting to lock its mutex twice.
The bug is difficult to reproduce because it's very timing-sensitive, however one can force the hand of destiny by artificially slowing down the async thread, that's what I did and it makes the process deadlock immediately.
Tcl_AsyncInvoke
has a mutex-protected loop over its registered handlers, so I've just added asleep(0)
within the loop and the test script hangs immediately for the exact same reason (double locking).Tcl uses pthreads on Unix and Win32 critical sections on Windows. Win32 CSs are reentrant but pthread mutexes are not by default (you have to use the
PTHREAD_MUTEX_RECURSIVE
and this feature is not available on all systems). So when the async thread is interrupted by a signal while in the middle of a mutex-protected operation, it deadlocks itself on Unix but not on Windows.The Tcl docs say:
So this is the expected behavior, however nothing prevents the core from using reentrant mutexes on all platforms that Tcl supports. Implementing reentrancy using non-reentrant mutexes is a trivial task, so OS support is a non-issue. I'm going to write a TIP to make
Tcl_Mutex
reentrant on all platforms starting at version 8.7.I've already implemented a quick hack to make mutexes reentrant on Unix:
PTHREAD_MUTEX_RECURSIVE
,Both versions fix the deadlock problem with or without my extra
sleep(0)
, and with no impact on Tcl tests.