CoBrALab / qbatch

The Unlicense
28 stars 13 forks source link

Allow explicit job number dependencies #95

Closed gdevenyi closed 8 years ago

gdevenyi commented 8 years ago

Suggested by Andrew Jankie

We should allow specification of exact job numbers for dependencies in addition to name glob matching.

pipitone commented 8 years ago

SGE does it by either matching the job name with a pattern or the job ID explicitly. I like that approach.

From the SGE man pages:

  wc_job
     The wildcard job specification is a placeholder for job ids,
     job  names  including  job  name  patterns.  A job id always
     references one job, while the name and pattern might  refer-
     ence multiple jobs.

     wc_job := job-id | job-name | pattern
gdevenyi commented 8 years ago

So, SGE already supports it it, pbs needs work though :)

gdevenyi commented 8 years ago

Sidethought, should we rename afterok -> depend?

andrewjanke commented 8 years ago

I found afterany to be the most robust for PBS. It means that a subsequent job can run even if the previous one failed and wasn't set to rerun. At least then users can determine what went wrong from their own logfiles rather than having to figure out how to query the scheduler to find out why a job isn't running.

At this point, they either remove the job or do something magical. In the former case just running the job achieves the same purpose. In the latter we have a smart user who knows stuff and possibly doesn't need to use qbatch!

https://github.com/andrewjanke/qbatch/blob/master/qbatch#L139

gdevenyi commented 8 years ago

@andrewjanke the reason for afterok instead of afterany is that if I have a pipeline built around qbatch which has a dependency chain, I can't run the next stage without the prior stage finishing successfully. If I allow the pipeline to continue I have to debug a failure of the commands downstream, rather than at the true failure point.

andrewjanke commented 8 years ago

@gdevenyi I mustn't have said it right. The situation you describe with dependencies is exactly why I used afterany...

I prefer to have subsequent steps fail but as part of doing so write things to logfiles. This then means that irrespective of the scheduler I'm using (PBS or gridengine) I can use the same heuristics/checker to tell me where things broke. In my case I have a number of things that parse logfiles.

So, if I use afterany I do this after an error:

  1. Check logfiles, find error
  2. Fix error in processing script.
  3. rerun.

If I use afterok I do this to sort an error

  1. Check scheduler for jobs stuck in "hold" state (gridengine) or "wait" state (PBS)
  2. Figure out if the job really is stuck as the queue on our cluster is nigh on 100% full
  3. Figure out the job than runs before it
  4. check that jobs logfile + stuck jobs script, find error
  5. fix error in processing script
  6. Remove jobs stuck in hold by ID as just re-running the preceeding script won't work
  7. Repeat for other subjects that also failed
  8. rerun.

I prefer the former, you may have automated the latter.

gdevenyi commented 8 years ago

@andrewjanke See #112 for implementation of PBS/SGE job number dependencies. It simply extends the XML tree search in PBS to check job numbers as well, and adds them to the depends list if found. The existing SGE implementation already works since it allows for names or IDs using the same mechanism.

gdevenyi commented 8 years ago

Merged