Open waltherg opened 2 years ago
@waltherg thank you for sharing this feedback. We have a feature item in our backlog to address this request. If and when it is approved, it will be implemented as part of the planning and release cycle.
Thank you for your response and I'm excited this is on the roadmap.
I probably just missed it but is there an open source repo I could try and contribute this to?
Do you have any updates on this? I've recently discovered scheduled pipeline runs via Schedule
and ScheduleRecurrence
which seems to work just fine - retry upon failure of these scheduled runs is pretty essential IMO.
I came across the same require, scheduled a few hundreds experiments, and some job failed. I have the run_id, but I don't find any documents or functions can take the job_id and resubmit the job within the SDK.
I use the Python SDK to develop ML pipelines for Azure ML.
How do I get my PythonScriptStep tasks or the encompassing Pipeline object to simply rerun upon failure? I reckon it's pretty common for pipelines to temporarily break upon temporary network, storage, etc. issues so a simple rerun / retry seems pretty basic for task orchestration frameworks to provide (see e.g. Apache Airflow).
I've spent a fair amount of time going over the documentation for Azure ML and I just can't figure out how to get "retry upon failure" behaviour.
The closest there is is the continue_on_step_failure pipeline / task parameter which doesn't really do what's needed.
Any advice please?
I've tried finding a solution on SO over here - a proposed solution uses external tools which just adds more overhead:
https://stackoverflow.com/questions/68647922/azure-machine-learning-pipeline-how-to-retry-upon-failure