Azure / bicep-types-az

Bicep type definitions for ARM resources
MIT License
86 stars 27 forks source link

I am trying to schedule my azure ml command job. It runs fine the first time but from second occurrence, it directly goes into completed state! #2063

Open MakarandBatchu opened 8 months ago

MakarandBatchu commented 8 months ago

I am trying to schedule a simple azure ml command job to print hello world. Below is the bicep code I am using.

@description('Specifies the name of the Azure Machine Learning workspace where the command job will be created.')
param workspaceName string = 'mlw-platform-azureml'

@description('Specifies the name of the Azure Machine Learning compute instance/cluster on which job will be run.')
param computeName string = 'platform-azureml-cluster001'

@description('Specifies the name of the Azure Machine Learning experiment under which job will be created.')
param experimentName string = 'cosmosdb-example'

@description('Specifies the environment to run command job.')
param environmentName string = 'platform-azureml-cosmosdb-env'

@description('Specifies the name of the Azure Machine Learning job to be created.')
param jobName string = 'new_schedule_job_200224_8'

resource environment 'Microsoft.MachineLearningServices/workspaces/environments@2023-10-01' existing = {
  name: '${workspaceName}/${environmentName}'
}

resource environmentVersion 'Microsoft.MachineLearningServices/workspaces/environments/versions@2023-10-01' existing = {
  parent: environment
  name: environment.properties.latestVersion
}

resource compute 'Microsoft.MachineLearningServices/workspaces/computes@2023-10-01' existing = {
  name: '${workspaceName}/${computeName}'
}

resource jobResource 'Microsoft.MachineLearningServices/workspaces/schedules@2023-10-01' = {
  name: '${workspaceName}/${jobName}'
  properties: {
    action: {
      actionType: 'CreateJob'
      jobDefinition: {
        jobType: 'Command'
        command: 'echo hello world'
        environmentId: environmentVersion.id
        experimentName: experimentName
        computeId: compute.id
        description: 'Schedule for running model training'
        displayName: 'Model Train Job'
      }
    }
    trigger: {
      triggerType: 'Cron'
      expression: '42,52 13 20 * *'
    }
  }
}

The expectation is to run the command job 42 and 52 minutes past the hour, at 01:00 PM, on day 20 of the month but that does not happen. Only the first time the job is properly triggered and from second time the job directly goes into 'Completed' state without running.

image

image

image

I thought to try with a pipeline azure ml job scheduling instead of a command job but the documentation for Azure ML pipeline job is not proper and very vague therefore, I was unable to implement pipeline job.

microsoft-github-policy-service[bot] commented 8 months ago

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @azureml-github. Please see https://aka.ms/biceptypesinfo for troubleshooting help.

Uncle-Yuanl commented 8 months ago

Hi I should have replied to your post on learn.microsoft One possibility is that the output settings hit the re-use strategy. We can check the properties - outputs + logs of pipeline component. If the log of this time run is the same with the first run, it probably re-use. Here's the reference:https://github.com/Azure/MachineLearningNotebooks/issues/270

And if still failed in schedule scenario you can check whether the job definition was successfully attached to the schedule. If not you can set job definition via azure cli or sdk of your language.

I think it's maybe a bug that we can see job definition of pipeline in portal/UI, but when list the schedules by sdk we can see that no job definition in the schedule. image

Here's the code:

from azure.ai.ml import MLClient, Input 
from azureml.pipeline.core import Pipeline, PublishedPipeline 
from datetime import datetime 
from azure.ai.ml.entities import JobSchedule 
from azure.ai.ml.entities import RecurrenceTrigger, RecurrencePattern 
from azure.ai.ml.constants import TimeZone 

ml_client = MLClient(DefaultAzureCredential(), subscription_id, resource_group, workspace) 
pipeline_jobs = ml_client.jobs.list() 
for pj in pipeline_jobs: 
    print(pj.display_name, pj.name) 

schedule_name = "sdk_schedule" 
schedule_start_time = datetime.utcnow() 
schedule_pattern = RecurrencePattern( 
    hours=12, 
    minutes=0, 
    week_days=["friday"] 
) 

recurrence_trigger = RecurrenceTrigger( 
    frequency="week", 
    interval=1, 
    schedule=schedule_pattern, 
    start_time=schedule_start_time, 
    time_zone=TimeZone.ROMANCE_STANDARD_TIME, 
) 

job_schedule = JobSchedule( 
    name=schedule_name, 
    trigger=recurrence_trigger, 
    # create_job="FR_Deployment_Up",  # this is display name not name 
    create_job="76c7ae5e-496f-4de9-b5f5-48904acc64a3" 
) 

job_schedule = ml_client.schedules.begin_create_or_update( 
    schedule=job_schedule 
).result() 
print(job_schedule) 

# check job definition again 
schedules = ml_client.schedules.list() 
[s.name for s in schedules] 
MakarandBatchu commented 8 months ago

Hi @Uncle-Yuanl

I think it is going in to automatic re-use strategy in my scenario as well but I won't be able to alter this parameter as I am using Bicep for scheduling the command job and Bicep does not have the option to set this parameter.