galaxyproject / galaxy

Data intensive science for everyone.
https://galaxyproject.org
Other
1.38k stars 992 forks source link

Paused job control and other improvements #3436

Open natefoo opened 7 years ago

natefoo commented 7 years ago

Currently jobs (and datasets) only enter the PAUSED state under two circumstances:

  1. One or more inputs terminate in the ERROR state
  2. The user exceeds their disk quota

There are two ways jobs can be unpaused:

  1. "Resume paused jobs" in the history menu
  2. In the case where a job's inputs terminated in ERROR, using the rerun button on the errored inputs in conjunction with the "Resume dependencies" option

Useful features to add would be:

  1. Pause jobs when one or more inputs are deleted or inputs are in other terminal non-OK states (e.g. FAILED_METADATA)
  2. Pause jobs when waiting on a paused job (this should be a separate WAITING_ON_PAUSED state w/ accompanying visual style)
  3. Allow the user to pause/unpause individual jobs - to let later jobs run sooner, for example (although perhaps this is better done with job priorities)?
  4. "Pause" running jobs (at this stage it'd only be practical to de-queue the job and set it back to a runnable state, not attempt to checkpoint and resume)
gregvonkuster commented 7 years ago

@natefoo I hope this is the correct place to add another feature request for the place where a job's inputs terminated in ERROR, where this case is related to jobs resulting from running a workflow. If the job at a workflow step results in an ERROR state, the rerun button on the errored inputs in conjunction with the "Resume dependencies" option only re-runs the immediate dependencies of the errored job. It does not re-execute the entire downstream workflow chain. This feature would be extremely useful.

natefoo commented 6 years ago

I believe #6036 should take care of feature 1.

rokyo401 commented 5 years ago

The ability to pause jobs and resume them later would also be extremely useful for administrators of local Galaxy instances, so users won't lose their running jobs when the Galaxy server is restarted for maintenance. Ideally, the local Galaxy server would automatically pause all running jobs when it is restarted or shut down and automatically resume them once it becomes available again. If that is technically possible, of course?

natefoo commented 5 years ago

@rokyo401 If you run your jobs through any of the cluster job runners, you can restart Galaxy at any time without affecting running jobs.

6036 also implemented feature 2, so that just leaves the last two on the wishlist.

simonbray commented 1 year ago

+1 to feature 3, if I have 1000 paused jobs and want to resume just one I don't believe I have a good option other than BioBlend's resume_job()