Closed evan-wehi closed 4 years ago
@evan-wehi what OS did this happen on? I think you're right, that exception should probably be caught too
centOS version 7.5 (I think)
uname -a Linux torquelord.hpc.wehi.edu.auhttp://torquelord.hpc.wehi.edu.au 3.10.0-327.28.3.el7.x86_64 #1 SMP Thu Aug 18 19:05:49 UTC 2016 x86_64 x86_64 x86_64 GNU/Linux
-- Evan Thomas Research Computing Scientist Computational Biology The Walter and Eliza Hall Institute of Medical Research Internal: 2323 | Mobile: +61423000246 | Skype: evanathomas | Email: thomas.e@wehi.edu.aumailto:thomas.e@wehi.edu.au
On 17 Dec. 2016, at 6:54 am, Christopher Ketchum notifications@github.com<mailto:notifications@github.com> wrote:
@evan-wehihttps://github.com/evan-wehi what OS did this happen on? I think you're right, that exception should probably be caught too
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/BD2KGenomics/toil/issues/1397#issuecomment-267681830, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AVADDaI443t40IxQDMUb4qwY1nc7Ef-8ks5rIuxxgaJpZM4LOwQo.
This is caused by having the TMP shared across nodes. Nodes a picking up the .jobState for other nodes and hence essentially random and irrelevant PIDs.
Ahh, that makes sense - the tempdir is assumed to belong to only 1 worker. Does setting the tempdir to a non-shared location fix this?
@evan-wehi for now i would suggest using a non-shared directory to use as the temp one. A fix for this behavior is described in #1462
Stale. Please reopen if still an issue.
Hello,
I received the following stack trace:
Looking at the code, it seems it is trying to test if the PID exists. Perhaps the PID has been reused (not sure how quickly that might happen). It seems that perhaps this exception should be caught as well.
This is from the current master, not an official release.
Thanks, Evan.
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-122