DataBiosphere / toil

A scalable, efficient, cross-platform (Linux/macOS) and easy-to-use workflow engine in pure Python.
http://toil.ucsc-cgl.org/.
Apache License 2.0
894 stars 241 forks source link

OSError: [Errno 1] Operation not permitted #1397

Closed evan-wehi closed 4 years ago

evan-wehi commented 7 years ago

Hello,

I received the following stack trace:

w/W/jobG9NqPx    Traceback (most recent call last):
w/W/jobG9NqPx      File "/home/thomas.e/home/dev/toil/src/toil/worker.py", line 335, in main
w/W/jobG9NqPx        with fileStore.open(job):
w/W/jobG9NqPx      File "/usr/lib64/python2.7/contextlib.py", line 17, in __enter__
w/W/jobG9NqPx        return self.gen.next()
w/W/jobG9NqPx      File "/home/thomas.e/home/dev/toil/src/toil/fileStore.py", line 1581, in open
w/W/jobG9NqPx        self.findAndHandleDeadJobs(self.workFlowDir)
w/W/jobG9NqPx      File "/home/thomas.e/home/dev/toil/src/toil/fileStore.py", line 1699, in findAndHandleDeadJobs
w/W/jobG9NqPx        if not cls._pidExists(jobState['jobPID']):
w/W/jobG9NqPx      File "/home/thomas.e/home/dev/toil/src/toil/fileStore.py", line 397, in _pidExists
w/W/jobG9NqPx        os.kill(pid, 0)
w/W/jobG9NqPx    OSError: [Errno 1] Operation not permitted

Looking at the code, it seems it is trying to test if the PID exists. Perhaps the PID has been reused (not sure how quickly that might happen). It seems that perhaps this exception should be caught as well.

This is from the current master, not an official release.

Thanks, Evan.

┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-122

cket commented 7 years ago

@evan-wehi what OS did this happen on? I think you're right, that exception should probably be caught too

evan-wehi commented 7 years ago

centOS version 7.5 (I think)

uname -a Linux torquelord.hpc.wehi.edu.auhttp://torquelord.hpc.wehi.edu.au 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu Aug 18 19:05:49 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux

-- Evan Thomas Research Computing Scientist Computational Biology The Walter and Eliza Hall Institute of Medical Research Internal: 2323 | Mobile: +61423000246 | Skype: evanathomas | Email: thomas.e@wehi.edu.aumailto:thomas.e@wehi.edu.au

On 17 Dec. 2016, at 6:54 am, Christopher Ketchum notifications@github.com<mailto:notifications@github.com> wrote:

@evan-wehihttps://github.com/evan-wehi what OS did this happen on? I think you're right, that exception should probably be caught too

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/BD2KGenomics/toil/issues/1397#issuecomment-267681830, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVADDaI443t40IxQDMUb4qwY1nc7Ef-8ks5rIuxxgaJpZM4LOwQo.

evan-wehi commented 7 years ago

This is caused by having the TMP shared across nodes. Nodes a picking up the .jobState for other nodes and hence essentially random and irrelevant PIDs.

cket commented 7 years ago

Ahh, that makes sense - the tempdir is assumed to belong to only 1 worker. Does setting the tempdir to a non-shared location fix this?

arkal commented 7 years ago

@evan-wehi for now i would suggest using a non-shared directory to use as the temp one. A fix for this behavior is described in #1462

DailyDreaming commented 4 years ago

Stale. Please reopen if still an issue.