Azure / MachineLearningNotebooks

Python notebooks with ML and deep learning examples with Azure Machine Learning Python SDK | Microsoft
https://docs.microsoft.com/azure/machine-learning/service/
MIT License
4k stars 2.49k forks source link

Retry pipeline and/or task on failure #1572

Open waltherg opened 2 years ago

waltherg commented 2 years ago

I use the Python SDK to develop ML pipelines for Azure ML.

How do I get my PythonScriptStep tasks or the encompassing Pipeline object to simply rerun upon failure? I reckon it's pretty common for pipelines to temporarily break upon temporary network, storage, etc. issues so a simple rerun / retry seems pretty basic for task orchestration frameworks to provide (see e.g. Apache Airflow).

I've spent a fair amount of time going over the documentation for Azure ML and I just can't figure out how to get "retry upon failure" behaviour.

The closest there is is the continue_on_step_failure pipeline / task parameter which doesn't really do what's needed.

Any advice please?

I've tried finding a solution on SO over here - a proposed solution uses external tools which just adds more overhead:

https://stackoverflow.com/questions/68647922/azure-machine-learning-pipeline-how-to-retry-upon-failure

shbijlan commented 2 years ago

@waltherg thank you for sharing this feedback. We have a feature item in our backlog to address this request. If and when it is approved, it will be implemented as part of the planning and release cycle.

waltherg commented 2 years ago

Thank you for your response and I'm excited this is on the roadmap.

I probably just missed it but is there an open source repo I could try and contribute this to?

waltherg commented 1 year ago

Do you have any updates on this? I've recently discovered scheduled pipeline runs via Schedule and ScheduleRecurrence which seems to work just fine - retry upon failure of these scheduled runs is pretty essential IMO.

zhongshuai-cao commented 4 months ago

I came across the same require, scheduled a few hundreds experiments, and some job failed. I have the run_id, but I don't find any documents or functions can take the job_id and resubmit the job within the SDK.