Azure / azure-cli

Azure Command-Line Interface
MIT License
4.03k stars 3.01k forks source link

ML extension: Sweep step not triggered in pipeline (status: not started) and indicates fail after while #25101

Open leonieroos opened 1 year ago

leonieroos commented 1 year ago

Dear team,

I have a pipeline with a sweep component that stopped working and gives a overall failure because the sweep step never initiates so it leaves me without error message or logs.

The command with extension: az ml job create --file ./pipelines/pipeline_demandmodel_hp.yml

az version: { "azure-cli": "2.42.0", "azure-cli-core": "2.42.0", "azure-cli-telemetry": "1.0.8", "extensions": { "ml": "2.12.1" } } within the environment I have azure-ai-ml==1.1.0

I'm expecting the pipeline to produce child runs and trials for the parameters as it did a month ago but instead it gets stuck on never initiating the sweep step at all and after a while will 'fail'. I tried with a registered data set as in put as well as the data passed on from previous step (which will complete with green tick) and both have the same issue.

image

this is the sweep step:

$schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json

display_name: Demand Hyperparameter tuning Pipeline
description: Pipeline prepares data and finds best set of parameters
experiment_name: demand_model_demo
type: pipeline

settings:
  default_compute: azureml:train
  default_datastore: azureml:spot_train

inputs:
  model_input:
    type: uri_folder
    path: azureml:test_input_hp@latest
    mode: ro_mount

jobs:
  sweep_step:
    type: sweep
    inputs:
      data: ${{parent.inputs.model_input}}
      start_new_run: True
      register_model: False
      gamma: 0
      sample_weights: True
      reg_alpha: 0
      reg_lambda: 1
    outputs:
      data_out:
        mode: rw_mount
    sampling_algorithm: bayesian
    trial: ../components/component_train_extraparam.yaml
    search_space:
      learning_rate:
        type: choice
        values: [0.05, 0.1, 0.15]
      max_depth:
        type: choice
        values: [5, 7, 10, 15, 20]
      n_estimators:
        type: choice
        values: [70, 100, 120, 150]
      max_delta_step:
        type: uniform
        min_value: 0.0
        max_value: 3.0
    objective:
      goal: minimize
      primary_metric: probability_difference
    limits:
      max_total_trials: 50
      max_concurrent_trials: 4
      timeout: 14400

#######

Component:

$schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json
name: train_demandmodel
display_name: Training Demand Model
type: command
inputs:
  data:
    type: uri_folder
  start_new_run:
    type: string
    default: True
  register_model:
    type: string
    default: False
  learning_rate:
    type: number
    default: 0.1
  n_estimators:
    type: integer
    default: 130
  max_depth:
    type: integer
    default: 10
  max_delta_step:
    type: number
    default: 0
  sample_weights:
    type: string
    default: True
  gamma:
    type: number
    default: 0
  reg_alpha:
    type: number
    default: 0
  reg_lambda:
    type: number
    default: 1
outputs:
  data_out:
    type: uri_folder
code: ../
environment: azureml:optimiser@latest
is_deterministic: false
command: >-
  python aml_train.py  
    --data ${{inputs.data}}  
    --data_out ${{outputs.data_out}}
    --start_new_run ${{inputs.start_new_run}}
    --register_model ${{inputs.register_model}}
    --learning_rate ${{inputs.learning_rate}}
    --n_estimators ${{inputs.n_estimators}}
    --max_depth ${{inputs.max_depth}}
    --max_delta_step ${{inputs.max_delta_step}}
    --sample_weights ${{inputs.sample_weights}}
    --gamma ${{inputs.gamma}}
    --reg_alpha ${{inputs.reg_alpha}}
    --reg_lambda ${{inputs.reg_lambda}}
yonzhan commented 1 year ago

route to CXP team

ghost commented 1 year ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github.

Issue Details
Dear team, I have a pipeline with a sweep component that stopped working and gives a overall failure because the sweep step never initiates so it leaves me without error message or logs. The command with extension: az ml job create --file ./pipelines/pipeline_demandmodel_hp.yml az version: { "azure-cli": "2.42.0", "azure-cli-core": "2.42.0", "azure-cli-telemetry": "1.0.8", "extensions": { "ml": "2.12.1" } } within the environment I have azure-ai-ml==1.1.0 I'm expecting the pipeline to produce child runs and trials for the parameters as it did a month ago but instead it gets stuck on never initiating the sweep step at all and after a while will 'fail'. I tried with a registered data set as in put as well as the data passed on from previous step (which will complete with green tick) and both have the same issue. image this is the sweep step: ##### ``` $schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json display_name: Demand Hyperparameter tuning Pipeline description: Pipeline prepares data and finds best set of parameters experiment_name: demand_model_demo type: pipeline settings: default_compute: azureml:train default_datastore: azureml:spot_train inputs: model_input: type: uri_folder path: azureml:test_input_hp@latest mode: ro_mount jobs: sweep_step: type: sweep inputs: data: ${{parent.inputs.model_input}} start_new_run: True register_model: False gamma: 0 sample_weights: True reg_alpha: 0 reg_lambda: 1 outputs: data_out: mode: rw_mount sampling_algorithm: bayesian trial: ../components/component_train_extraparam.yaml search_space: learning_rate: type: choice values: [0.05, 0.1, 0.15] max_depth: type: choice values: [5, 7, 10, 15, 20] n_estimators: type: choice values: [70, 100, 120, 150] max_delta_step: type: uniform min_value: 0.0 max_value: 3.0 objective: goal: minimize primary_metric: probability_difference limits: max_total_trials: 50 max_concurrent_trials: 4 timeout: 14400 ``` ####### Component: ``` $schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json name: train_demandmodel display_name: Training Demand Model type: command inputs: data: type: uri_folder start_new_run: type: string default: True register_model: type: string default: False learning_rate: type: number default: 0.1 n_estimators: type: integer default: 130 max_depth: type: integer default: 10 max_delta_step: type: number default: 0 sample_weights: type: string default: True gamma: type: number default: 0 reg_alpha: type: number default: 0 reg_lambda: type: number default: 1 outputs: data_out: type: uri_folder code: ../ environment: azureml:optimiser@latest is_deterministic: false command: >- python aml_train.py --data ${{inputs.data}} --data_out ${{outputs.data_out}} --start_new_run ${{inputs.start_new_run}} --register_model ${{inputs.register_model}} --learning_rate ${{inputs.learning_rate}} --n_estimators ${{inputs.n_estimators}} --max_depth ${{inputs.max_depth}} --max_delta_step ${{inputs.max_delta_step}} --sample_weights ${{inputs.sample_weights}} --gamma ${{inputs.gamma}} --reg_alpha ${{inputs.reg_alpha}} --reg_lambda ${{inputs.reg_lambda}} ```
Author: leonieroos
Assignees: -
Labels: `Service Attention`, `Machine Learning`, `customer-reported`, `Auto-Assign`
Milestone: -
navba-MSFT commented 1 year ago

Adding Service team to look into this.

@azureml-github Could you please look into this and provide an update ?

leonieroos commented 1 year ago

Hi team, any news on this? Thank you

luigiw commented 1 year ago

@wangchao1230 Can you help to triage this issue? Thank you.

ghost commented 1 year ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @shbijlan.

Issue Details
Dear team, I have a pipeline with a sweep component that stopped working and gives a overall failure because the sweep step never initiates so it leaves me without error message or logs. The command with extension: az ml job create --file ./pipelines/pipeline_demandmodel_hp.yml az version: { "azure-cli": "2.42.0", "azure-cli-core": "2.42.0", "azure-cli-telemetry": "1.0.8", "extensions": { "ml": "2.12.1" } } within the environment I have azure-ai-ml==1.1.0 I'm expecting the pipeline to produce child runs and trials for the parameters as it did a month ago but instead it gets stuck on never initiating the sweep step at all and after a while will 'fail'. I tried with a registered data set as in put as well as the data passed on from previous step (which will complete with green tick) and both have the same issue. image this is the sweep step: ##### ``` $schema: https://azuremlschemas.azureedge.net/latest/pipelineJob.schema.json display_name: Demand Hyperparameter tuning Pipeline description: Pipeline prepares data and finds best set of parameters experiment_name: demand_model_demo type: pipeline settings: default_compute: azureml:train default_datastore: azureml:spot_train inputs: model_input: type: uri_folder path: azureml:test_input_hp@latest mode: ro_mount jobs: sweep_step: type: sweep inputs: data: ${{parent.inputs.model_input}} start_new_run: True register_model: False gamma: 0 sample_weights: True reg_alpha: 0 reg_lambda: 1 outputs: data_out: mode: rw_mount sampling_algorithm: bayesian trial: ../components/component_train_extraparam.yaml search_space: learning_rate: type: choice values: [0.05, 0.1, 0.15] max_depth: type: choice values: [5, 7, 10, 15, 20] n_estimators: type: choice values: [70, 100, 120, 150] max_delta_step: type: uniform min_value: 0.0 max_value: 3.0 objective: goal: minimize primary_metric: probability_difference limits: max_total_trials: 50 max_concurrent_trials: 4 timeout: 14400 ``` ####### Component: ``` $schema: https://azuremlschemas.azureedge.net/latest/commandComponent.schema.json name: train_demandmodel display_name: Training Demand Model type: command inputs: data: type: uri_folder start_new_run: type: string default: True register_model: type: string default: False learning_rate: type: number default: 0.1 n_estimators: type: integer default: 130 max_depth: type: integer default: 10 max_delta_step: type: number default: 0 sample_weights: type: string default: True gamma: type: number default: 0 reg_alpha: type: number default: 0 reg_lambda: type: number default: 1 outputs: data_out: type: uri_folder code: ../ environment: azureml:optimiser@latest is_deterministic: false command: >- python aml_train.py --data ${{inputs.data}} --data_out ${{outputs.data_out}} --start_new_run ${{inputs.start_new_run}} --register_model ${{inputs.register_model}} --learning_rate ${{inputs.learning_rate}} --n_estimators ${{inputs.n_estimators}} --max_depth ${{inputs.max_depth}} --max_delta_step ${{inputs.max_delta_step}} --sample_weights ${{inputs.sample_weights}} --gamma ${{inputs.gamma}} --reg_alpha ${{inputs.reg_alpha}} --reg_lambda ${{inputs.reg_lambda}} ```
Author: leonieroos
Assignees: -
Labels: `Service Attention`, `Machine Learning`, `customer-reported`, `ML-Pipelines`, `Auto-Assign`
Milestone: -
wangchao1230 commented 1 year ago

@leonieroos Could you share your pipeline job name (id)?

A screenshot of the Job Overview panel will be helpful. Click the "Job overview" button on upper right of your current screenshot.

leonieroos commented 1 year ago

Hi @wangchao1230 job id: mango_turtle_zzbgpfx7r2

image

Thanks!

brynn-code commented 1 year ago

@leonieroos Hi, the syntax when writing a one-line command into multiple lines in yaml shall be:

command: >-
  python aml_train.py  
  --data ${{inputs.data}}  
  --data_out ${{outputs.data_out}}

But in your yaml there are some leading white spaces:

command: >-
  python aml_train.py  
    --data ${{inputs.data}}  
    --data_out ${{outputs.data_out}}

Could you please try to remove those leading whitespaces before arguments? They will be converted as \n in yaml. You could refer to https://stackoverflow.com/questions/3790454/how-do-i-break-a-string-in-yaml-over-multiple-lines for more details, there is a table in the question answer. image

wangchao1230 commented 1 year ago

@leonieroos And for the UI not showing status/error message issue, I am wondering if it's an issue with UI or run history index refresh issue. Could you confirm: if wait for a few mins and refresh UI will show the error message/status for you?

leonieroos commented 1 year ago

Hi @brynn-code , thank you for picking that up. I have changed it exactly to match the normal train step (command instead of sweep which works) and it still leaves the same issue.

@wangchao1230 , the refresh is not showing any thing different. However, the one I ran now is just stalled between the steps and indicates status not started at the pipeline job overview. This is been stalled now for 2 hours:

image

It seems like I am missing a detail? The UI is recognizing the sweepstep in the canvas as such and has the search space - when I leave out a variable from the component it does raises an error with the command of the component: so seems to seeing it as a command

image
brynn-code commented 1 year ago

@leonieroos Could you please elaborate more about 'changed it to match the normal train step'? The issue about multi-line command is not related to the component type, which means no matter the step is a command step or a sweep step, the 'command' field shall be right format for execution.

leonieroos commented 1 year ago

So I changed the type from sweep to command in the pipeline to the same component command and that works with putting the search space back to inputs.