bitwiseworks / libc

LIBC Next (kLIBC fork)
9 stars 4 forks source link

Synchronize SIGCHLD and waitpid #10

Closed dmik closed 6 years ago

dmik commented 6 years ago

kLIBC needs to guarantee that by the time SIGCHLD is delivered, waitpid(WNOHANG) for that child will return its status instead of 0 (which means the child is still running and no status info is available). Currently, it's not the case — depending on how the OS/2 task scheduler switches threads, it may not work as such. I believe this is a violation of POSIX.

From https://github.com/bitwiseworks/qtbase-os2/issues/42#issuecomment-431535368:

Okay, I guess I know what happens in kLIBC and results in such a behavior. kLIBC implements waitpid with a worker thread that calls DosWaitChild in a loop waiting for some child to terminate. When it happens, it first delivers SIGCHLD (usually to thread 1) and only then adds an internal WAITINFO structure to the list of dead children be reaped sometime later by a waitpid call.

Delivering signals in kLIBC is done via DosKillThread and then handling XCPT_ASYNC_PROCESS_TERMINATE on that thread which checks if there is a signal handler registered for the given signal, and calls it if so. Sometimes the OS/2 task scheduler switches from the worker thread doing DosWaitChild to the thread being killed with DosKillThread before the worker thread inserts WAITINFO. In this case, if the thread that receives SIGCHLD also happens to call waitpid(WNOHANG) before the scheduler switches back to the worker thread, this waitpid will simply return 0 as it will not find a WAITINFO structure for the given child process.

This explains why removing WNOHANG or using DosSleep(1) solves the problem. The former will cause the thread to enter the infinite wait state until the WAITINFO structure gets finally inserted by the worker thread. The latter will simply force a context switch which in many cases will result in the worker thread gaining control back and inserting WAITINFO into the list.

dmik commented 6 years ago

Note that using DosSleep(1) in the waitpid(WNOHANG) would only be a workaround because there is no guarantee that the scheduler will switch right away to the worker thread to let it add WAITINFO. Although DosSleep(1) will give up the current time slice to threads with even a lower priority, there is no guarantee that all other threads will get their chance to execute and that the worker thread, even if it gets executed, will be able to insert WAITINFO before its timeslice ends. Some other approach needs to be found.

dmik commented 6 years ago

BTW, although I can't find a direct indication that by the time SIGCHLD is delivered, waitpid must succeed for that pid, there is an indirect proof: many SIGCHLD handler examples include a call to wait or waitpid. E.g. here https://stackoverflow.com/questions/2377811/tracking-the-death-of-a-child-process or here https://www.linuxquestions.org/questions/programming-9/how-a-father-process-know-which-child-process-send-the-signal-sigchld-266290/.

Anyway, currently this scenario is not guaranteed to work in kLIBC. If you compile a test case with the SIGCHLD handler doing waitpid, you will get the latter returning 0 quite often (at least, it happens here with Qt and the testcase from https://github.com/bitwiseworks/qtbase-os2/issues/42. (Note that normally we don't do waitpid in the SIGCHLD handler in Qt but I added it there for testing purposes).

dmik commented 6 years ago

With the above commit, both the simple Qt test case and the real tst_QDir test case work like a charm. And I haven't got any regressions so far.