Closed glennhickey closed 1 year ago
@glennhickey Are you 100% sure it is dying the first time due to OOM, if you can't actually give it enough memory that is succeeds?
If you can reproduce this every time, can you send a workflow commit we could use to reproduce it?
Well, the command dies. But goes through if I rerun but giving the job more memory. I will try to package up a way to reproduce and post it here.
OK, my current theory here is that, when we chain from one job to another, we don't delete the job we chained to because we need its body to remain in the job store. We only delete it once the job that chained to it finishes successfully.
If the job that was chained to fails, that never happens. So the chained-to job remains in the job store, as well as the job that chained to it and replaced it.
So I think both jobs end up trying to run from the same body file?
I think I need to take another look at the whole concept of jobsToDelete
, and how we do the commits of job changes to the job store when chaining.
OK, maybe what is happening is, when we chain from one job to the next, we cut the successor relationship to that job. This makes the chained-to job no longer reachable from the root of the workflow, so on restart it is deleted and its body file is removed. But we need the body file to run the job that chained to it.
Thanks @adamnovak !!
I have a job that's running out of memory on slurm and getting killed. When I go to restart it with
singleMachine
(which normally works fine), I get an error like thisThis happens every time: I run the workflow from the beginning; it dies, it won't restart. I rerun from the beginning with a bit more memory, it dies again, won't restart etc.
This is new code on my end, where
vg_to_og
is a FollowOn toclip_vg
but there doesn't seem anything really different about it. I've only ever seen this issue come up now using the latest Toil release.
@adamnovak writes
┆Issue is synchronized with this Jira Story ┆Issue Number: TOIL-1352