Open boyangzhao opened 2 years ago
This happens when a job is still logged in the caching database as running and consuming disk space, and the same job tries to start again on the same machine.
You're sure you didn't manage to have two attempts to run the same job happening at the same time? It's possible for the leader to be forcibly killed without having time to delete all its submitted jobs to the backing scheduler, and then started again, and have a job that is still running from the last attempt still be running while the same job tries to run again. Or you could try to restart the same workflow twice in parallel.
Before we try to add the current job to the database, we already call _removeDeadJobs()
, so if the other instance of the job isn't actually running in a live local process, we should have removed it from the database already.
To fix this, we probably just want to separate the whole cache out by workflow attempt number. We shouldn't be talking to zombie jobs from previous attempts of the same workflow even if they are still running; we're no longer responsible for their disk space.
I've used toil kill
to kill the running job, waited for a pretty long time and then reran the job. What I noticed on the mesos monitoring is that when the job was killed by toil, the framework still existed and the worker node was still kept. So when I started the new run on lead node, it created a new framework in addition to the existing. When I removed all the nodes by destroying the cluster (toil destroy-cluster) (this killed off all running ec2 instances), and then restarted a new cluster
toil launch-cluster`, the error disappeared.
Does the restart feature attempts to run only using the previous cwl workflow and yaml inputs (somehow already cached in the jobstore?) ? I tested it a bit, and it seems if a job failed, I fixed a part of the workflow (or change some of the inputs defined in yaml), the restart is ignoring it. In fact toil-cwl-runner with restart
arg seems to ignore the cwl and yaml args, even though it still required a cwl arg (but it is not looking at it). So this means if I have a big part of the large workflow already completed and won't change, and I make an update to part of workflow where it failed, ideally restart would pick up where it left off but with the updated workflow, but it is using the previous workflow submitted?
Also everything in the jobstore is hashed. If there a way to pull out the outputs (the filename and which part of the workflow it came from) generated so far during a run (or if a run failed halfway) from the jobstore?
When I am trying to restart a run, a worker job failed with the following message (see below). It is mentioning something to do with jobs id not being unique in cache.
┆Issue is synchronized with this Jira Story ┆friendlyId: TOIL-1237