Closed IanSudbery closed 5 years ago
Thanks, tricky to debug. As I understand there should not be any threads. I think the most likely culprit is here:
if status in (drmaa.JobState.DONE, drmaa.JobState.FAILED):
running_job_ids.remove(job_id)
It might be that the job has ended with a different JobState and is thus never removed from the list of running jobs. I will add more log messages and look at the job states. If you are online, could you check the job state of:
qacct -j 2579099
qacct -j 2579098
For example, it might be wise to add : JobState.UNDETERMINED.
hmm, although that would not explain why the log message "waiting for {} jobs" never appears.
exactly.
I think this might be connect to jobs that have @jobs_limit
Thanks, good idea.
This has been fixed in https://github.com/cgat-developers/ruffus/issues/99 and will go away once a new ruffus is released and installed.
Thanks Andreas.
I've had several occurrences recently where pipelines will ned detect when a job has finished. That is the cluster will have no jobs running, but the pipeline will still be running and not showing an error or submitting more jobs. Interestingly in this case the log never shows "Waiting for N jobs to finish".
The
STDIN
job here is the pipeline itself.Interestingly, looking at the code it looks like things should happen something like:
Thus my guess is that at 2 some other thread is taking control, but then never handing it back. As I don't really understand the collaborative multithreading paradigm here, this might be one for @AndreasHeger