Relaunch all failed jobs at once, for a given step

vladvisan commented 2 months ago

Screenshot taken from @ahmedhamidawan's GCC presentation since my instance doesn’t have this feature yet.

Related to

https://github.com/galaxyproject/galaxy/pull/17413
I searched and did not find a similar issue

Description

If multiple jobs fail for a same step (collection), it would be nice to be able to relaunch all the failed jobs for that step, at the same time / with one button.
If not, if there are for example 50 failed jobs, it can be very tedious.
I’m not sure how often such big (>=50) collections are, but it could happen, potentially of even larger size (potentially hundreds or thousands of datasets).

Adjacent ideas

Handle the cases where there is a mix of successful jobs, failed jobs, running jobs, waiting-to-be-run jobs, ..
One talk at GCC mentioned a « multi-select datasets » option when launching a tool, maybe the logic or page could be re-used/pre-populated ?
Maybe allow a multiple-choice checkbox of which jobs to re-execute, by default all selected, with a button to turn them off
Maybe even a regex to select the jobs to be re-executed, maybe re-using the collection filter operation
Should also (as usual) include the « Resume dependencies from this job ? » Additional Option
In the screenshot's workflow, one could just relaunch the previous step as it would in turn relaunch all the failed jobs of the last step. But this solution wouldn't work for a typical workflow (and it is inefficient even when it does work).
Let the user change the step's tool's version before executing - but keep the rest of the invocation ?
- Might be necessary to resolve the underlying error. Not always, sometimes an external resource was unavailable and relaunching the exact same tool/dataset combo works the second time around
- Need to be careful about reproducibility. Might need to duplicate/fork a history, or create a new invocation and combine with https://github.com/galaxyproject/galaxy/pull/4690 ?

Labels

feature-request, area/UI-UX, and maybe area/workflows and area/backend

mvdbeek commented 2 months ago

I would say the most common thing to do is to re-run a single job, this is the default behavior now, and I think that should remain that way.

If not, if there are for example 50 failed jobs, it can be very tedious.

you can select the input collection today and all jobs will re-run. There should probably be a way to switch between those two modes more easily, so you don't need to find the input collection. The information on whether or not the job was part of a mapped over collection is available to the frontend.

Handle the cases where there is a mix of successful jobs, failed jobs, running jobs, waiting-to-be-run jobs, ..

You can rerun the whole collection and enable the job cache, that would the equivalent action

One talk at GCC mentioned a « multi-select datasets » option when launching a tool, maybe the logic or page could be re-used/pre-populated ?

this is an entirely different thing that will result in a different output structure that is flattened by one level

The rest sounds good and we should do it IMO, thanks for writing up the issue.

vladvisan commented 2 months ago

Thanks for the feedback.

I would say the most common thing to do is to re-run a single job, this is the default behavior now, and I think that should remain that way. Good point.

you can select the input collection today and all jobs will re-run

I must have missed something, I tried to do this, but I was not able to see a rerun/"recycle" button for the collection, only for the individual datasets?
I also tried to manually modify the rerun URL of a dataset https://usegalaxy.org/tool_runner/rerun?id=xxxx , and replace the id with the collection's id, but I got a "You are not allowed to access this dataset" page

You can rerun the whole collection and enable the job cache, that would the equivalent action Good point, thanks, I haven't enabled it on my instance yet, I want to test this out soon.

this is an entirely different thing that will result in a different output structure that is flattened by one level I understand.

vladvisan commented 2 months ago

Also a separate comment:

"Resume dependencies from this job" even for re-runs of succesful jobs?

I had assumed this was the case, but I just tested, and this option only appears for failed jobs (whose associated step has downstream steps)
At least for some scientists where I work, the option to re-try parts of the workflow from a given step is useful, with slightly different parameters from that step forwards (but with the same datasets/results from before)
Although this could seemingly also be achieved (assuming job cache is activated) by re-running the whole workflow, and just changing the parameters of that step

mvdbeek commented 2 months ago

individual datasets

yes, that's right, if you click on rerun there you can replace the single input with the higher level input (i.e. the collection input). I agree that this should probably a more direct option in the user interface, but I wanted to point out that you can do this.

vladvisan commented 2 months ago

UI option Ah, I see, I was able to select the collection as you indicated, in the re-run screen:

(t being the name of the collection)

Basic results

All the collection's datasets are regenerated (one job launched per dataset), which is nice
However, ideally only the failed ones would induce new jobs launched
I tested this on a collection with a mix of failed/successful dataset jobs, and they were are all regenerated/relaunched

Advanced results (resume dependencies) When I select the “Resume dependencies from this job?” option, the execution refuses to launch, with the following error screen/message (I crossed out the irrelevant information):

I tested this (on Galaxy version 23.2.2.dev0):

first test: with all the jobs associated to datasets of the collection, being in the failed state
other test: with some of the jobs associated to the datasets of the collection, being in the failed state, and others being in the success state

Both cases led to the above error message.

galaxyproject / galaxy

Relaunch all failed jobs at once, for a given step #18442