Zombie mplayer processes left after crash

quigybo commented 10 years ago

After experiencing a few crashes, I ran pgrep mplayer and noticed several mplayer processes left running. These should be caught and killed before orochi exits, even after a crash.

dbrgn commented 10 years ago

Hm, this is strange. I'm not using subprocess directly. Instead I'm using asyncproc.py.

Asyncproc seems to deal with killing itself:

def __del__(self, __killer=os.kill, __sigkill=signal.SIGKILL):
    if self.__exitstatus is None:
        __killer(self.pid(), __sigkill)

I wonder why it doesn't always work... I've also experienced some issues with zombie processes.

Maybe we could try to wrap client.cmdloop() in a try-except block and then check for subprocesses to kill them...

quigybo commented 10 years ago

Sometimes it pays to actualy read the output, this exception is raised inside MPlayer.del:

Exception TypeError: TypeError('signal handler must be signal.SIG_IGN, signal.SIG_DFL, or a callable object',) in <bound method MPlayer.__del__ of <orochi.player.MPlayer object at 0xb7060a4c>> ignored

If there was no alarm set, signal.signal returns None. cbcf173 first checks if there is something that needs to be restored before attempting to do so.

This is probably something that should be reported upstream. Once this is merged into master I can send an email to the author (Thomas Bellman).

dbrgn commented 10 years ago

Ah, nice catch! Do you think you could write a small regression test (add test_asyncproc.py in https://github.com/dbrgn/orochi/tree/master/tests) before the merge? That'll probably help Thomas too.

quigybo commented 10 years ago

Ok I think I have finally figured this issue out.

According to the documentation, None will only be returned if the previous signal handler was not installed from Python, which means something is setting the alarm on the C side. The following code will induce the error (on the master branch):

import sys
from orochi.asyncproc import with_timeout

class DelFunc(object):
    def __init__(self, func, *args):
        self.func = func
        self.args = args
    def __del__(self):
        self.func(*self.args)

df = DelFunc(with_timeout, 1, lambda:1)
sys.exit()

It seems that Python sets SIGALRM when the destructors are being called during system exit, which makes it quite difficult to test for this issue as no exception is actually raised, only a warning is printed to sys.stderr. Possible ways around this could be using the xdist plugin to run the tests in a separate Python process, or setting the signal handler directly in C/ctypes. What do you think, is this something worth pursuing further?

Just to let you know, I amended the commit to give more detail on this issue and changed the test to explicitly check for None because the default value, SIG_DFL, is actually the int 0.

dbrgn commented 10 years ago

My first thought was that we should just ignore the test, but I think this issue is hard to catch and should therefore be covered by a regression test, so it won't happen again in the future :)

Possible ways around this could be using the xdist plugin to run the tests in a separate Python process, or setting the signal handler directly in C/ctypes. What do you think, is this something worth pursuing further?

If you think it'll work, using ctypes is probably better than to add a new dependency. What do you think? I've never used ctypes, so I can't judge it.

quigybo commented 10 years ago

I agree that we should try and figure out a way of inducing the error and testing for it, however for some reason the below code is not behaving as I would expect:

import ctypes
import signal

libc = ctypes.cdll.LoadLibrary('libc.so.6')
SIGFUNC = ctypes.CFUNCTYPE(None, ctypes.c_int)

def f(sig):
    print "SIGALRM caught"

libc.signal(signal.SIGALRM, SIGFUNC(f))

print(signal.getsignal(signal.SIGALRM))

I would expect that getsignal returns None as we have set the signal in C, but instead it returns SIG_DFL as if no signal was set. You can use libc.alarm(1); libc.sleep(2) to check that the signal is set properly.

And just in case something unexpected is going on in ctypes, here is a C extension that effectively does the same thing.

#include <Python.h>
#include <signal.h>

static void
alarm_handler(int sig) {
    printf("SIGALRM caught\n");
}

static PyObject *
sigtest_signal(PyObject *self) {
    signal(SIGALRM, alarm_handler);
    Py_RETURN_NONE;
}

static PyObject *
sigtest_alarm(PyObject *self, PyObject *args) {
    int secs;
    if (!PyArg_ParseTuple(args, "i", &secs))
        return NULL;
    alarm(secs);
    Py_RETURN_NONE;
}

static PyMethodDef Sigtest_Methods[] = {
    {"signal",  sigtest_signal, METH_NOARGS, "Sets signal SIGALRM."},
    {"alarm",  sigtest_alarm, METH_VARARGS, "Sets alarm."},
    {NULL, NULL, 0, NULL}
};

PyMODINIT_FUNC
initsigtest(void) {
    (void) Py_InitModule("sigtest", Sigtest_Methods);
}

After building and importing it call sigtest.signal() to setup the handler, and signal.getsignal(signal.SIGALRM) to see which handler is installed. As expected this module behaves the same as the ctypes example where signal.getsignal(signal.SIGALRM) doesn't return None.

According to this bug report, it has something to do with signal handlers being run in a subprocess. I will check it out when I have time.

dbrgn commented 10 years ago

OK, if we ran into a bug here we can't do much about it.

When I find time, I'll check out Twisted. Maybe an event based model would work for Orochi (see #21), which would also solve this issue and make the code overall easier to write/understand.

In the meantime, should I just merge this? Or do you want to do more research about the signal handling?

quigybo commented 10 years ago

It isn't actually a bug in python, it is expected and documented behaviour; the bug report was for the test suite.

I will briefly look into signal handlers in a subprocess first and get back to you, so don't merge just yet.

With regards to an event-based model, I'm not sure that it will solve this issue as it will still arise whenever signals are used while Python is shutting down, like in a destructor. Anyway the difficulty isn't in fixing this issue, it lies in inducing it in a test suite.

dbrgn commented 10 years ago

With regards to an event-based model, I'm not sure that it will solve this issue as it will still arise whenever signals are used while Python is shutting down, like in a destructor.

Yes, it's not "fixing" this issue, but I think it could make code easier, without a lot of signals and threads and asynchronous calls... But that's more a paradigm question.

quigybo commented 10 years ago

Ok maybe you should just merge this for now, I had another go at it without much success. If I come up with something in the future I will let you know.

dbrgn commented 10 years ago

Alright! Thanks for the work!

dbrgn commented 10 years ago

@quigybo, your branch is out of sync with current master. Could you please rebase your branch and create a pull request against current master? (If I'd cherry-pick your commit I'd lose the author information.)

dbrgn commented 10 years ago

@quigybo Hm, the test works on my system, but fails on Travis: https://travis-ci.org/dbrgn/orochi/builds/12075290 Could that be an issue with the coverage script? Or is it the exception we're testing for?

quigybo commented 10 years ago

This is the output when applying the test to asyncproc.py pre 9aa69c3 on my machine:

============================= test session starts ==============================
platform linux2 -- Python 2.7.5 -- pytest-2.3.4
plugins: pep8, cache, cov
collected 16 items 

orochi/__init__.py s
orochi/api.py s
orochi/client.py s
orochi/colors.py s
orochi/meta.py s
orochi/player.py s
tests/__init__.py s
tests/test_asyncproc.py .F
tests/test_client.py s...
tests/test_colors.py s.
tests/test_testing.py s

=================================== FAILURES ===================================
________________________________ test_asyncproc ________________________________
tests/test_asyncproc.py:30: in test_asyncproc
>       assert "Exception" not in output
E       assert u'Exception' not in 'Error processing lin...b72d3b8c>> ignored\n'
E         u'Exception' is contained here:
E           e ignored
E           Exception TypeError: 'signal handler must be signal.SIG_IGN, signal.SIG_DFL, or a callable object' in <bound method DelFunc.__del__ of <__main__.DelFunc object at 0xb72d3b8c>> ignored
E         ? +++++++++
--------------- coverage: platform linux2, python 2.7.5-final-0 ----------------
Name              Stmts   Miss  Cover
-------------------------------------
orochi/__init__       0      0   100%
orochi/api           63     46    27%
orochi/client       281    204    27%
orochi/colors         6      0   100%
orochi/meta           6      6     0%
orochi/player       107     82    23%
-------------------------------------
TOTAL               463    338    27%
================ 1 failed, 5 passed, 10 skipped in 1.34 seconds ================

And for reference, here is the relevant output from Travis:

tests/test_asyncproc.py:30: in test_asyncproc
>       assert "Exception" not in output
E       assert u'Exception' not in 'Coverage.py warning:...x24d2dd0>> ignored\n'
E         u'Exception' is contained here:
E           ollected.
E           Exception SystemError: 'error return without exception set' in <bound method DelFunc.__del__ of <__main__.DelFunc object at 0x24d2dd0>> ignored
E         ? +++++++++

To be honest I'm really not sure what that error is, the test works fine here both with and without virtualenv.

The output suggests something is going on with the coverage tests, but I'm not sure why the python subprocess is affected by that. In particular I pass the -E flag to the subprocess, which, according to the manpage, should "ignore environment variables like PYTHONPATH and PYTHONHOME that modify the behaviour of the interpreter". Unless it really is just PYTHONPATH and PYTHONHOME.

If I explicitly set env={} in check_output, or remove the coverage tests from pytest.ini, the warning goes away but not the exception:

tests/test_asyncproc.py:30: in test_asyncproc
>       assert "Exception" not in output
E       assert u'Exception' not in 'Exception SystemErro...a246e810>> ignored\n'
E         u'Exception' is contained here:
E           Exception SystemError: 'error return without exception set' in <bound method DelFunc.__del__ of <__main__.DelFunc object at 0x7f1ea246e810>> ignored
E         ? +++++++++

The remaining error seems to suggest that some part of the test suite is still in effect (in particular the SystemError: 'error return without exception set' part), but I don't know why. Any ideas?

dbrgn commented 10 years ago

By checking explicitly for TypeError, the failing test is fixed.

quigybo commented 10 years ago

Haha it's funny how sometimes you don't see the simplest and most obvious solution.

Not sure if it is worth the effort, but I renamed the test to more accurately reflect what we are testing (with_timeout) in d3595d2.

dbrgn commented 10 years ago

+1, I added it here: 865a788c07d409b210541d7af8eab1e7e6cc712c

dbrgn / orochi

Zombie mplayer processes left after crash #27