Azure / batch-shipyard

Simplify HPC and Batch workloads on Azure
MIT License
277 stars 121 forks source link

Job Scheduling (shipyard-jmtask) Error Message: "ModuleNotFoundError: No module named 'ruamel'" #306

Closed speschl closed 5 years ago

speschl commented 5 years ago

I'm looking to create a scheduled job and have had no problem with the configuration in the past. Recently, I've had a problem with 'shipyard-jmtask' with nearly the same script (changes in environmental variables). The error associated with the task is:

2019-08-19 23:28:03,087Z DEBUG __main__:_create_credentials:109 creating batch client for account url: https://oippltanabatch410q.eastus2.batch.azure.com/
2019-08-19 23:28:03,098Z DEBUG __main__:main:245 loading pickled task map
Traceback (most recent call last):
  File "/opt/batch-shipyard/recurrent_job_manager.py", line 273, in <module>
    main()
  File "/opt/batch-shipyard/recurrent_job_manager.py", line 247, in main
    task_map = pickle.load(f, fix_imports=True)
ModuleNotFoundError: No module named 'ruamel'

Here is my configuration scripts for the scheduled job. Actual names are changed for security reasons.

config.yaml:

batch_shipyard:
  storage_account_settings: myBlobStorage
  storage_entity_prefix: shipyard
  generated_sas_expiry_days: null
  autogenerated_task_id:
    prefix: qa-task-
    zfill_width: 5
  delay_docker_image_preload: false
global_resources:
  docker_images:
    - repository.io/ETL_image:v1
    - repository.io/Modeling_image:v1

credentials.yaml

credentials:
  batch:
    account_key: ALl************************************************************
    account_service_url: https://**************.eastus2.batch.azure.com
  storage:
    MyBlobStorage:
      account: MyBlobStorage
      account_key: tM*************************************************************
      endpoint: core.windows.net     
  docker_registry:
    repo.azurecr.io:
      username: oip************
      password: rO*********************

job.yaml

job_specifications:
- id: myJobId
  auto_complete: true
  environment_variables:
    version: version 2019.08.21.12:00-2.0
    logging_level: INFO
    jobName: Myjob
    store_name: oip***********
    vault_uri: https://**********.vault.azure.net/
    app_id: b65**********************
    app_secret: Kw+**********************
    tenant_ID: c1e**********************
    parquet_filename: Release1_
  max_task_retries: 1
  max_wall_time: 24:00:00
  retention_time: 24:00:00
  priority: 100
  user_identity:
    default_pool_admin: false
  auto_pool:
    keep_alive: false
    pool_lifetime: job
  recurrence:
    schedule:
      do_not_run_after: null
      do_not_run_until: null
      recurrence_interval: 24:00:00
      start_window: null
    job_manager:
      allow_low_priority_node: true
      monitor_task_completion: true
      run_exclusive: false
  allow_run_on_missing_image: true
  remove_container_after_exit: true
  tasks:
  - id: etl_task
    docker_image: repository.io/ETL_image:v1
    environment_variables:
      server_address: oip******************.database.windows.net
      database_name: DatabaseName
      sql_secret_name: DatabaseAuthenticationSecret
      api_client_id: 62************************
      api_client_secret: d************************
      api_tenant_id: c************************
      api_authority_url: https://login.microsoftonline.com/c************************
      api_url: https://************************
      output_path: wsi/Release1/Input/
      input_path: /IOT/Measurement/
      prp_log_path: /PRP20/etl_logs/
      number_of_days_back: 7
      number_of_days_back_temperature: 60
      end_date_override: ''
      csv_header: '[...]'      
      result_design: '[...]'
    max_task_retries: 1
    max_wall_time: 10:00:00
    retention_time: 24:00:00
    output_data:
      azure_storage:
      - storage_account_settings: MyBlobStorage
        remote_path: output/dir
        local_path: null
        is_file_share: false
        blobxfer_extra_options: null   
    remove_container_after_exit: true
    additional_docker_run_options: [-w=/app]     
    default_working_dir: container
  - id: modeling_task
    docker_image: repository.io/Modeling_image:v1
    environment_variables:
      input_path: wsi/Release1/Input/Release1_
      output_path: wsi/Release1/Output
      file_name: RiskModelResult_  
      threshold: 0.4
      pickle_path: wsi/*
      client_secret: Kw+****************************
      client_ID: b65*****************
    max_task_retries: 1
    max_wall_time: 02:00:00
    retention_time: 24:00:00
    depends_on:
    - etl_task

pool.yaml

pool_specification:
  id: qa-pool
  vm_configuration:
    platform_image:
      publisher: microsoft-azure-batch
      offer: ubuntu-server-container
      sku: 16-04-lts
      version: latest
      native: false
      license_type: null
  vm_size: Standard_E2s_v3 
  vm_count:
    dedicated: 1
    low_priority: 0
  resize_timeout: 00:15:00
  inter_node_communication_enabled: false
  reboot_on_start_task_failed: false
  attempt_recovery_on_unusable: false
  upload_diagnostics_logs_on_unusable: true
  block_until_all_global_resources_loaded: true
  transfer_files_on_pool_creation: false

Any help with this is greatly appreciated!

alfpark commented 5 years ago

Would you happen to have upgraded your local installation of Batch Shipyard, but the pool you're using was created with an older version?

speschl commented 5 years ago

Yes, most of the scripts were created end of 2018/beginning months of 2019. I recently had to get a new machine, so I have 3.7.0 version. However, we spin up a new pool instance each time the schedule runs.

alfpark commented 5 years ago

I am unable to repro this failure.

Can you perform an upgrade to 3.8.1? Instructions: https://github.com/Azure/batch-shipyard/blob/master/docs/01-batch-shipyard-installation.md#upgrading-to-new-releases

Please re-submit your job (sorry I missed your auto pool spec) and see if you can repro.

speschl commented 5 years ago

I updated my instance to 3.8.1 and re-submitted my job. I still get the error. Here is the shipyard-jmtask configuration.

{
  "id": "shipyard-jmtask",
  "jobId": "***_Manual_NewVersion:job-1",
  "odata.metadata": "https://***.eastus2.batch.azure.com/$metadata#tasks/@Element",
  "url": "https://***.eastus2.batch.azure.com/jobs/***_Manual_NewVersion:job-1/tasks/shipyard-jmtask",
  "eTag": "0x701CE1722770000",
  "creationTime": "2019-08-29T20:57:37.1582165Z",
  "lastModified": "1601-01-01T00:00:00Z",
  "state": "completed",
  "stateTransitionTime": "2019-08-29T21:03:16.412917Z",
  "previousState": "running",
  "previousStateTransitionTime": "2019-08-29T21:02:31.463327Z",
  "commandLine": "/opt/batch-shipyard/recurrent_job_manager.sh",
  "resourceFiles": [
    {
      "httpUrl": "https://***.blob.core.windows.net/shipyardprp20qarf-***-qa-pool/jobschedules/oip-prp20-qa-***_Manual_NewVersion/taskmap.pickle?se=2049-08-21T20%3A57%3A36Z&sp=r&sv=2018-11-09&sr=b&sig=redacted",
      "filePath": "taskmap.pickle",
      "fileMode": "0640"
    }
  ],
  "containerSettings": {
    "containerRunOptions": "--rm",
    "imageName": "mcr.microsoft.com/azure-batch/shipyard:3.8.1-cargo"
  },
  "environmentSettings": [
    {
      "name": "version",
      "value": "version 2019.08.21.12:00-2.0"
    },
    {
      "name": "logging_level",
      "value": "INFO"
    },
    {
      "name": "jobName",
      "value": "oip-prp20-qa-job-"
    },
    {
      "name": "store_name",
      "value": "redacted"
    },
    {
      "name": "vault_uri",
      "value": "redacted"
    },
    {
      "name": "app_id",
      "value": "redacted"
    },
    {
      "name": "app_secret",
      "value": "redacted"
    },
    {
      "name": "tenant_ID",
      "value": "redacted"
    },
    {
      "name": "parquet_filename",
      "value": "Release1_"
    }
  ],
  "userIdentity": {
    "autoUser": {
      "scope": "pool",
      "elevationLevel": "admin"
    }
  },
  "authenticationTokenSettings": {
    "access": [
      "job"
    ]
  },
  "constraints": {
    "maxWallClockTime": "P10675199DT2H48M5.4775807S",
    "retentionTime": "P7D",
    "maxTaskRetryCount": 1
  },
  "executionInfo": {
    "startTime": "2019-08-29T21:03:15.272126Z",
    "endTime": "2019-08-29T21:03:16.412917Z",
    "exitCode": 1,
    "containerInfo": {
      "containerId": "6b3113502f561f060325a7978474449547dbd424e50c1f43efa4d2487a6726f3",
      "state": "created"
    },
    "failureInfo": {
      "category": "UserError",
      "code": "FailureExitCode",
      "message": "The task exited with an exit code representing a failure",
      "details": [
        {
          "name": "Message",
          "value": "The task exited with an exit code representing a failure"
        }
      ]
    },
    "result": "failure",
    "retryCount": 1,
    "lastRetryTime": "2019-08-29T21:03:15.215316Z",
    "requeueCount": 0
  },
  "nodeInfo": {
    "affinityId": "TVM:tvmps_e2b58961451c0919b4a698e1b20ddebcb27e195d7814af4b6cfe7e90a262319b_d",
    "nodeUrl": "https://redacted.eastus2.batch.azure.com/pools/prp2-qa-pool_f463abc6-5cce-4182-be7c-23f73262ace4/nodes/tvmps_e2b58961451c0919b4a698e1b20ddebcb27e195d7814af4b6cfe7e90a262319b_d",
    "poolId": "prp2-qa-pool_f463abc6-5cce-4182-be7c-23f73262ace4",
    "nodeId": "tvmps_e2b58961451c0919b4a698e1b20ddebcb27e195d7814af4b6cfe7e90a262319b_d",
    "taskRootDirectory": "workitems/***_Manual_NewVersion/job-1/shipyard-jmtask",
    "taskRootDirectoryUrl": "https://***.eastus2.batch.azure.com/pools/prp2-qa-pool_f463abc6-5cce-4182-be7c-23f73262ace4/nodes/tvmps_e2b58961451c0919b4a698e1b20ddebcb27e195d7814af4b6cfe7e90a262319b_d/files/workitems/***_Manual_NewVersion/job-1/shipyard-jmtask"
  }
}
alfpark commented 5 years ago

@speschl As an aside, I redacted some additional sensitive information above - you may want to consider rotating your keyvault app secret.

alfpark commented 5 years ago

Ok, I was able to repro this.

To mitigate this before a hotfix, please ensure all of your environment variable values are strings (in yaml). For example, the threshold env var:

    environment_variables:
      input_path: wsi/Release1/Input/Release1_
      output_path: wsi/Release1/Output
      file_name: RiskModelResult_ 
      threshold: '0.4'   # <-- wrap in quotes to explicitly make a string
      pickle_path: wsi/*
      client_secret: Kw+****************************
      client_ID: b65*****************
speschl commented 5 years ago

Thanks @alfpark for the heads up about the sensitive info, I changed the secret out. I did try the environment variables with the strings and that seemed to work! The shipyard-jmtask completed and the other two task have begun. Thank you so much for your help!