ari-apc-lab / croupier

Cloudify plugin for HPCs and batch applications
https://hub.docker.com/repository/docker/marangiop/cloudify-croupier-ari-apc-lab
Apache License 2.0
6 stars 4 forks source link

Introduce ability to resume execution of workflow from specific job that has failed #3

Open marangiop opened 3 years ago

marangiop commented 3 years ago

Cloudify Version 20.02.23~community (Community)

Croupier Version Commit 1eb2f325fb0f4385c772c45fe3264c5ebf9d2e07 of branch grapevine, after merging from permedcoe branch at commit 46239ecccc3fc32a5b1c7cf1b27ed76b45f6ab28

Is your feature request related to a problem? Please describe. This may be a Cloudify-related problem rather than Croupier. In the scenario where there is more than one job in a blueprint and there is some explicit dependency between the jobs, if execution of a specific job fails, the workflow "run_jobs" is automatically stopped and all other ongoing jobs are killed. After this has happened, the only option is to trigger the execution of the "uninstall" workflow, followed by "install" and finally "run_jobs" workflow. Basically we can only start a new execution from the start, and it's not possible to resume the execution of the previous run_jobs from the point where the specific job has failed. In the case when the user does not want to repeat the entire workflow and wants to continue from the failed job (by for example repeating that job after having introduced a code fix), the only option is to modify the blueprint locally from an IDE/Code editor by commenting out all the jobs that have been executed correctly up to the point of the job that has failed, then upload that blueprint to Cloudify, create a new deployment, then finally execute "install" and "run_jobs" workflow.

image

Describe the solution you'd like We need a mechanism whereby if the user notices that the run_jobs workflow has stopped at a specific job (for example, the 17th job), then the user can easily restart the run_jobs workflow from that specific job. It should not restart from the 1st job. Maybe this mechanism can be a different workflow than "run_jobs"., like "resume_run__jobs".