anfedorov / psutil

Automatically exported from code.google.com/p/psutil
Other
0 stars 0 forks source link

Can't properly handle zombie processes on UNIX #428

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
What steps will reproduce the problem?
1. start Photoshop CS6 on a Mountain Lion OSX
2. import psutil; [x.as_dict() for x in psutil.process_iter()] # (in .py file, 
ipython)

What is the expected output?
A long list of processes and related information

What do you see instead?
$ python test.py
Traceback (most recent call last):
  File "test.py", line 3, in <module>
    [x.as_dict() for x in psutil.process_iter() if x.is_running()]
  File "/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/__init__.py", line 225, in as_dict
    ret = attr()
  File "/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/__init__.py", line 414, in get_nice
    return self._platform_impl.get_process_nice()
  File "/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/_psosx.py", line 151, in wrapper
    raise NoSuchProcess(self.pid, self._process_name)
psutil._error.NoSuchProcess: process no longer exists (pid=46244)

or within iPython notebook:
[x.as_dict() for x in psutil.process_iter() if x.is_running()]
---------------------------------------------------------------------------
NoSuchProcess                             Traceback (most recent call last)
<ipython-input-108-a71c6dffe397> in <module>()
----> 1 [x.as_dict() for x in psutil.process_iter() if x.is_running()]

/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/__init__.pyc 
in as_dict(self, attrs, ad_value)
    223                         ret = attr(interval=0)
    224                     else:
--> 225                         ret = attr()
    226                 else:
    227                     ret = attr

/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/__init__.pyc 
in get_nice(self)
    412     def get_nice(self):
    413         """Get process niceness (priority)."""
--> 414         return self._platform_impl.get_process_nice()
    415 
    416     @_assert_pid_not_reused

/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/_psosx.pyc in 
wrapper(self, *args, **kwargs)
    149             err = sys.exc_info()[1]
    150             if err.errno == errno.ESRCH:
--> 151                 raise NoSuchProcess(self.pid, self._process_name)
    152             if err.errno in (errno.EPERM, errno.EACCES):
    153                 raise AccessDenied(self.pid, self._process_name)

NoSuchProcess: process no longer exists (pid=46243, name='adobe_licutil')

What version of psutil are you using? What Python version?
python 2.7.4
psutil 1.0.1

On what operating system? Is it 32bit or 64bit version?
64bit

Please provide any additional information below.
When I close Photoshop, the error will not show up. When starting it again the 
error reappears.
An additional is_running() check within the list comprehension does not change 
a thing and running the code several times will not change the reported pid.

Original issue reported on code.google.com by rico.moo...@gmail.com on 16 Sep 2013 at 7:48

GoogleCodeExporter commented 9 years ago
Are you sure this doesn't happen simply because Photoshop process terminates 
(maybe quickly)?

Does this happen with get_process_nice() only?

Are you able to isolate a test case similar to this and post the result?

try:
    p.get_nice()
except psutil.NoSuchProcess:
    print(p.is_running())

My best guess is that the process is *actually* terminated (maybe it's a 
Photoshop worker subprocess which terminates very quickly) and using 
is_running() within the list comprehension doesn't help because it's subject to 
a race condition.

Original comment by g.rodola on 16 Sep 2013 at 8:37

GoogleCodeExporter commented 9 years ago
I just tried the following script:

import psutil

for process in psutil.process_iter():
    print "\n\n----------------------------"
    print "process: {}".format(process)
    print process.as_dict()

Which gave me the following output (after emitting a lot of other processes of 
course):

----------------------------
process: psutil.Process(pid=46776, name='adobe_licutil')
Traceback (most recent call last):
  File "test.py", line 6, in <module>
    print process.as_dict()
  File "/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/__init__.py", line 225, in as_dict
    ret = attr()
  File "/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/__init__.py", line 414, in get_nice
    return self._platform_impl.get_process_nice()
  File "/Users/rico/.virtualenvs/temp/lib/python2.7/site-packages/psutil/_psosx.py", line 151, in wrapper
    raise NoSuchProcess(self.pid, self._process_name)
psutil._error.NoSuchProcess: process no longer exists (pid=46776, 
name='adobe_licutil')

But the strange thing is... when I run ps:
$ ps aux | grep adobe
rico           46804   0.0  0.0  2432768    596 s007  R+   10:47PM   0:00.00 
grep adobe
rico           46776   0.0  0.0        0      0   ??  Z    10:42PM   0:00.00 
(adobe_licutil)

The process seems to be there.

When I integrate your code that would read like this I guess:
$ cat test.py
import psutil

for process in psutil.process_iter():
    print "\n\n----------------------------"
    print "process: {}".format(process)
    try:
        process.get_nice()
    except psutil.NoSuchProcess:
        print "got NoSuchProcess Exception"
        print process.is_running()

The output is:
----------------------------
process: psutil.Process(pid=46776, name='adobe_licutil')
got NoSuchProcess Exception
True

Original comment by rico.moo...@gmail.com on 16 Sep 2013 at 8:52

GoogleCodeExporter commented 9 years ago
Mmm... this is weird.
Can you please try this C program and paste the output?

#include <stdio.h>
#include <errno.h>
#include <sys/time.h>
#include <sys/resource.h>

int main()
{
    int ret;
    ret = getpriority(PRIO_PROCESS, getpid());
    printf("ret %i\n", ret);
    printf("errno %i\n", errno);

    ret = getpriority(PRIO_PROCESS, 46776);  // adobe_licutil PID
    printf("ret %i\n", ret);
    printf("errno %i\n", errno);
}

In case you don know how to do that: save that into a file named "a.c" then run 
"gcc a.c && ./a.out" in your shell.

Original comment by g.rodola on 16 Sep 2013 at 9:05

GoogleCodeExporter commented 9 years ago
of course!

First I put the program in place and compiled it as requested:

$ cat > a.c <<EOF
> #include <stdio.h>
> #include <errno.h>
> #include <sys/time.h>
> #include <sys/resource.h>
>
> int main()
> {
>    int ret;
>    ret = getpriority(PRIO_PROCESS, getpid());
>    printf("ret %i\n", ret);
>    printf("errno %i\n", errno);
>
>    ret = getpriority(PRIO_PROCESS, 46776);  // adobe_licutil PID
>    printf("ret %i\n", ret);
>    printf("errno %i\n", errno);
> }
> EOF
$ gcc a.c
$ ./a.out
ret 0
errno 0
ret -1
errno 3

Verification that process is still there:
$ ps aux | grep adobe
rico           46893   0.0  0.0  2432768    596 s006  S+   11:08PM   0:00.00 
grep adobe
rico           46776   0.0  0.0        0      0   ??  Z    10:42PM   0:00.00 
(adobe_licutil)

Original comment by rico.moo...@gmail.com on 16 Sep 2013 at 9:10

GoogleCodeExporter commented 9 years ago
 I have CS6 and Mountain Lion, so let me see if I can reproduce/debug further also. 

Original comment by jlo...@gmail.com on 16 Sep 2013 at 9:15

GoogleCodeExporter commented 9 years ago
Interesting, with or without CS6 open, I can reproduce an error but in my case 
it's for Google Chrome first: 

----------------------------
process: psutil.Process(pid=24798, name='Google Chrome He')
Traceback (most recent call last):
  File "test.py", line 6, in <module>
    print process.as_dict()
  File "build/bdist.macosx-10.8-intel/egg/psutil/__init__.py", line 225, in as_dict
  File "build/bdist.macosx-10.8-intel/egg/psutil/__init__.py", line 414, in get_nice
  File "build/bdist.macosx-10.8-intel/egg/psutil/_psosx.py", line 151, in wrapper
psutil._error.NoSuchProcess: process no longer exists (pid=24798, name='Google 
Chrome He')

user@host cs6]$ ps aux | grep 24798 | grep -v grep 
jloden         24798   0.0  0.0        0      0   ??  Z     5Sep13   0:00.00 
(Google Chrome He)

Note that in both my case and rico's example above, the missing process is 
marked as "Z" a.k.a Zombie process. So That's why the NoSuchProcess exception. 
I'm guessing the adobe_licutil is for checking the license before launching the 
main UI so it probably execs or otherwise forks into a new process and leaves 
it behind as a zombie. 

I'm not sure how to handle this "correctly" in psutil. We could possibly remove 
it from process_iter if it's a zombie process, or maybe there's another more 
elegant way to treat these. Thoughts? 

Original comment by jlo...@gmail.com on 16 Sep 2013 at 9:24

GoogleCodeExporter commented 9 years ago
Ah I see, a zombie process (didn't notice that).
Hmmm... what other info can you extract from that process?
It would be interesting to know if there are other methods raising NSP in the 
same manner, then we can decide what to do, although I think letting NSP bubble 
up might be legitimate.
...And it seems clear we don't have a test case for this (I'm currently looking 
into how to create a zombie process on purpose =)).

Original comment by g.rodola on 16 Sep 2013 at 9:46

GoogleCodeExporter commented 9 years ago
Replying to my own question: get_open_files() and get_num_fds() on OSX behave 
the same way.
Tried the same test on Linux and the issue simply does not exist there (a 
zombie processes is querable just fine).
I think there are 2 possibilities on the table:

#1 return bogus values
#2 let NSP exception propagate

Although I'm not particularly thrilled with the idea, I think #1 is the best 
way to go because "practicality beats purity" and we would also be consistent 
with other platforms (AFAIK OSX is the only one behaving like this, but I'll 
make sure of this and report back).

I think removing the zombie processes from process_iter() is a bad idea because 
looking for them only (e.g. in order to kill them) is a perfectly reasonable 
use case.

Original comment by g.rodola on 16 Sep 2013 at 10:55

GoogleCodeExporter commented 9 years ago
Is there something else I could provide to help you solve this issue?

Original comment by rico.moo...@gmail.com on 17 Sep 2013 at 3:18

GoogleCodeExporter commented 9 years ago
No thanks. I already figured it out.
Will provide a fix and a test later today or tomorrow, then I think we're ready 
to release a new version.

Original comment by g.rodola on 17 Sep 2013 at 3:20

GoogleCodeExporter commented 9 years ago
I agree, the NSP exception isn't particularly elegant. Bogus values aren't 
appealing but it probably makes more sense to have null/empty values rather 
than unexpected exceptions.

Ideally we'd have some mechanism to easily identify the process as a zombie 
also, if you're iterating processes. Maybe a separate property (process state), 
or a bogus value that only appears for zombie procs? 

Original comment by jlo...@gmail.com on 17 Sep 2013 at 3:29

GoogleCodeExporter commented 9 years ago
We already have Process.status property.

Original comment by g.rodola on 17 Sep 2013 at 3:31

GoogleCodeExporter commented 9 years ago
Oops, good point, I forgot about that one. I just checked, and it is properly 
reporting STATUS_ZOMBIE for the process, at least in my example so we're all 
set there. 

Another thought - maybe we could add a filter option to process_iter() so that 
you could iterate only processes matching certain status(es). That way someone 
could easily iterate zombie processes to kill them as in your example, or 
ignore certain statuses like zombie processes if they're only interested in 
other stats. 

Original comment by jlo...@gmail.com on 17 Sep 2013 at 3:38

GoogleCodeExporter commented 9 years ago
It seems an unnecessary API complication to me (why not add other filter 
arguments then?).  Plus you'll only save one line (if p.status == 
STATUS_ZOMBIE) which is even better if left explicit.

Original comment by g.rodola on 17 Sep 2013 at 3:47

GoogleCodeExporter commented 9 years ago
Fair enough. You could use a list comprehension to build a list as well so 
there are other options. 

I was just thinking it'd might be a nice feature to have a filter option on the 
iter function (as you noted, there are other filter options besides status that 
could make sense). It seems like a lot of use cases I see on the mailing list 
and sample code snippets are iterating through processes searching for items so 
I was thinking it might be useful in a general case. 

Original comment by jlo...@gmail.com on 17 Sep 2013 at 4:06

GoogleCodeExporter commented 9 years ago
Yeah. Well, generally speaking I think it's better if we remain as simple and 
minimalist as possible as long as "something" is already easily implementable 
in user's code, as this is the case.

Original comment by g.rodola on 17 Sep 2013 at 11:03

GoogleCodeExporter commented 9 years ago
Update: it seems on FreeBSD we cannot instantiate a new Process instance for a 
zombie process (NSP gets raised in __init__ because we try to get process 
creation time).
This should also be fixed because of the use case I was mentioning before 
(looking for all zombie processes in order to kill them).

Original comment by g.rodola on 18 Sep 2013 at 6:34

GoogleCodeExporter commented 9 years ago
This appears to be more complicated and profound than I initially thought.

It seems FreeBSD deletes all process information after it's gone zombie as 
*all* Process methods (ppid, name, nice, cmdline, etc.) raise NSP, so it 
appears that faking return values is not a great idea after all, at least on 
FreeBSD.
Even Process.status will raise NSP instead of returning STATUS_ZOMBIE, which is 
strange since "ps aux" manages to show the process status somehow.

So far the only platform where a zombie process is indistinguishable from 
regular ones is Linux (Windows does not have them).
That implies life will be easier for whoever wants to filter them:

zombies = [p for p in psutil.process_iter() if p.status == psutil.STATUS_ZOMBIE]

On BSD, where this is not possible, one would have to do something (nasty) like 
this:

def get_zombies():
    for pid in psutil.get_pid_list():
        try:
            p = psutil.Process(pid)      
            if p.status == psutil.STATUS_ZOMBIE:  # for platforms != BSD
                yield pid 
        except psutil.NoSuchProcess:
            if psutil.pid_exists(pid):  # <-- race condition
                yield pid

In the meantime I investigated further and it appears a zombie process cannot 
be killed ('cause it's already dead !-)).
The only way to get rid of it would be making its parent call wait() against it.
Will think this through further tomorrow. 

Original comment by g.rodola on 18 Sep 2013 at 9:03

GoogleCodeExporter commented 9 years ago
Further update: I took a look at ps source code for FreeBSD and it seems it 
manages to get process status (and also ppid) by using kvm_getprocs() whereas 
we use sysctl():
https://code.google.com/p/psutil/source/browse/psutil/_psutil_bsd.c?spec=svn51a5
0962614e02f1426da55012f01ca8e1fd53ed&r=83165d10041d7306798dcc400df5d64a57fb58f0#
63

Assuming sysctl() is faster we might use that one first and then fall back on 
using kvm_getprocs() at least for retrieving process status, ppid and 
creation_time (in order to ensure process univocity over time).

Original comment by g.rodola on 18 Sep 2013 at 9:23

GoogleCodeExporter commented 9 years ago
Bumping up priority.

Original comment by g.rodola on 9 Mar 2014 at 10:35