Allow explicit job number dependencies

gdevenyi commented 8 years ago

Suggested by Andrew Jankie

We should allow specification of exact job numbers for dependencies in addition to name glob matching.

pipitone commented 8 years ago

SGE does it by either matching the job name with a pattern or the job ID explicitly. I like that approach.

From the SGE man pages:

  wc_job
     The wildcard job specification is a placeholder for job ids,
     job  names  including  job  name  patterns.  A job id always
     references one job, while the name and pattern might  refer-
     ence multiple jobs.

     wc_job := job-id | job-name | pattern

gdevenyi commented 8 years ago

So, SGE already supports it it, pbs needs work though :)

gdevenyi commented 8 years ago

Sidethought, should we rename afterok -> depend?

andrewjanke commented 8 years ago

I found afterany to be the most robust for PBS. It means that a subsequent job can run even if the previous one failed and wasn't set to rerun. At least then users can determine what went wrong from their own logfiles rather than having to figure out how to query the scheduler to find out why a job isn't running.

At this point, they either remove the job or do something magical. In the former case just running the job achieves the same purpose. In the latter we have a smart user who knows stuff and possibly doesn't need to use qbatch!

https://github.com/andrewjanke/qbatch/blob/master/qbatch#L139

gdevenyi commented 8 years ago

@andrewjanke the reason for afterok instead of afterany is that if I have a pipeline built around qbatch which has a dependency chain, I can't run the next stage without the prior stage finishing successfully. If I allow the pipeline to continue I have to debug a failure of the commands downstream, rather than at the true failure point.

andrewjanke commented 8 years ago

@gdevenyi I mustn't have said it right. The situation you describe with dependencies is exactly why I used afterany...

I prefer to have subsequent steps fail but as part of doing so write things to logfiles. This then means that irrespective of the scheduler I'm using (PBS or gridengine) I can use the same heuristics/checker to tell me where things broke. In my case I have a number of things that parse logfiles.

So, if I use afterany I do this after an error:

Check logfiles, find error
Fix error in processing script.
rerun.

If I use afterok I do this to sort an error

Check scheduler for jobs stuck in "hold" state (gridengine) or "wait" state (PBS)
Figure out if the job really is stuck as the queue on our cluster is nigh on 100% full
Figure out the job than runs before it
check that jobs logfile + stuck jobs script, find error
fix error in processing script
Remove jobs stuck in hold by ID as just re-running the preceeding script won't work
Repeat for other subjects that also failed
rerun.

I prefer the former, you may have automated the latter.

gdevenyi commented 8 years ago

@andrewjanke See #112 for implementation of PBS/SGE job number dependencies. It simply extends the XML tree search in PBS to check job numbers as well, and adds them to the depends list if found. The existing SGE implementation already works since it allows for names or IDs using the same mechanism.

gdevenyi commented 8 years ago

Merged

CoBrALab / qbatch

Allow explicit job number dependencies #95