Azure / azure-sdk-for-python

This repository is for active development of the Azure SDK for Python. For consumers of the SDK we recommend visiting our public developer docs at https://docs.microsoft.com/python/azure/ or our versioned developer docs at https://azure.github.io/azure-sdk-for-python.
MIT License
4.37k stars 2.71k forks source link

Use latest version of the dataset when triggering scheduled AzureML pipelines #32576

Open grzjab opened 8 months ago

grzjab commented 8 months ago

Is your feature request related to a problem? Please describe. When scheduling the AzureML pipeline job there is no possibility to use latest version of data asset in the moment of pipeline triggering, only latest version during pipeline creation/ modification can be defined.

The example provided https://learn.microsoft.com/en-gb/azure/machine-learning/how-to-schedule-pipeline-job?view=azureml-api-2&tabs=python#change-runtime-settings-when-defining-schedule allows for specyfing input argument of type azure.ai.ml.Input (https://learn.microsoft.com/en-us/python/api/azure-ai-ml/azure.ai.ml.input?view=azure-python) but without the option to select latest version to be used when the pipeline is triggered, not when the pipeline is created.

Code of the example

pipeline_job = pipeline_with_components_from_yaml(
    training_input=Input(type="uri_folder", path=parent_dir + "/data/"),
    test_input=Input(type="uri_folder", path=parent_dir + "/data/"),
    training_max_epochs=20,
    training_learning_rate=1.8,
    learning_rate_schedule="time-based",
)

Describe the solution you'd like Add special option that allows selecting always latest version of the data asset.

Additional context The same issue visible when creating the pipeline using UI. Selecting version 6 (latest) will fix it and in future when new version are available, still version 6 will be used.

image
github-actions[bot] commented 8 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github @Azure/azure-ml-sdk.

catalinaperalta commented 8 months ago

Thanks for reaching out @grzjab! Looping in @azureml-github to investigate

eniac871 commented 4 months ago

@grzjab , thanks for your interest in this. Currently we don't support using latest label when define a AzureML schedule pipeline. The problem here is it's unclear whether user want to use the latest version or a determined version source when define the schedule.

VincBar commented 4 months ago

@eniac871 That sounds rather like a documentation question than a question for the feature? You can easily distinct those two cases.

I think this feature would be very beneficial. In most scenarios you dont want to run the exact same pipeline with the exact same data again and again.

The only way I am aware that you can use scheduled jobs that actually produce new output is by abusing that aml versioning as it is only referencing folders and you can change the underlying data. So you need to overwrite the current data by a sourcing job rendering the versioning completly useless.

grzjab commented 4 months ago

@eniac871 that's why I have created a feature request to support using latest label version. As @VincBar wrote, running the pipeline with the same dataset versions doesn't make sense. In the V1 version there is an option to use latest dataset (label = Use always latest)

sdzunenko commented 3 months ago

I'm also keenly interested in this discussion. With the current implementation of schedules for data import jobs, it appears that the entire MLOps pipeline can now be seamlessly managed within the AML Studio. This eliminates the necessity for external tools such as ADF, GitHub Actions, or Azure DevOps pipelines. However, I've identified two critical features that seem to be missing:

  1. Dynamic Source Versioning in Job Schedules: The ability to use the most recent source version automatically in scheduled jobs.
  2. Model Deployment Scheduling: The option to deploy models either based on a schedule or as an integral part of the pipeline workflow. I'd like to highlight a specific functionality that addresses the first point to some extent. Currently, it's possible to specify the @latest tag for data sources in job configurations, which ensures that the job always uses the most recent version of the specified data source. Here's an example of how this is configured in a YAML job definition:
raw_data:
  type: mltable
  path: azureml:TrainingData@latest

This feature provides flexibility for users who wish to schedule jobs with dynamic data source versioning. They can specify the @latest tag in their YAML configuration to automatically use the latest version of the data source. Conversely, if users need to schedule jobs with a specific version of the data source, they can directly specify that version in the job configuration.

@eniac871, given your expertise, I believe you can provide valuable insights into distinguishing between these two behaviors more effectively. Specifically, how the current implementation facilitates both dynamic source versioning and the potential for model deployment scheduling.

AndreasAF commented 1 month ago

I have the same issues when I need to rerun a pipeline using a scheduler that some of its dependent input data assets have been updated following the creation of the scheduler which it is not able to pick up on since the use of "latest" version of the data assets from trigger to trigger event is currently not available. This feature would be crucial in our setup using Azure Machine Learning scheduled ML pipelines.