Azure / batch-shipyard

Simplify HPC and Batch workloads on Azure
MIT License
277 stars 121 forks source link

Deployment of Standard_NC4as_T4_v3 fails if GPU drivers are specified #370

Open fayora opened 3 years ago

fayora commented 3 years ago

Problem Description

If I deploy a pool with Standard_NC4as_T4_v3 without the gpu:nvidia_driver:source specification in pool.yaml, the pool succeeds but the NVIDIA drivers are not installed.

If I specify gpu:nvidia_driver:source, I get an error: local variable 'gpu_driver' referenced before assignment

The same pool.yaml works fine with Standard_NC6s_v3

Batch Shipyard Version

3.9.1

Steps to Reproduce

Try to deploy a pool with Standard_NC4as_T4_v3

Expected Results

Pool is deployed

Actual Results

Error is returned when gpu:nvidia_driver:source specification is provided in pool.yaml:

2021-09-21 09:02:21.573 INFO - uploading file /tmp/_MEIRpaARG/scripts/shipyard_docker_exec_task_runner.sh as 'shipyard_docker_exec_task_runner.sh'
Traceback (most recent call last):
  File "shipyard.py", line 3136, in <module>
  File "site-packages/click/core.py", line 764, in __call__
  File "site-packages/click/core.py", line 717, in main
  File "site-packages/click/core.py", line 1137, in invoke
  File "site-packages/click/core.py", line 1137, in invoke
  File "site-packages/click/core.py", line 956, in invoke
  File "site-packages/click/core.py", line 555, in invoke
  File "site-packages/click/decorators.py", line 64, in new_func
  File "site-packages/click/core.py", line 555, in invoke
  File "shipyard.py", line 1546, in pool_add
  File "convoy/fleet.py", line 3451, in action_pool_add
  File "convoy/fleet.py", line 1849, in _add_pool
  File "convoy/fleet.py", line 1555, in _construct_pool_object
UnboundLocalError: local variable 'gpu_driver' referenced before assignment
[9269] Failed to execute script shipyard

Redacted Configuration

pool.yaml

pool_specification:
  id: test-cluster-gpus-t4
  vm_configuration:
    platform_image:
      publisher: canonical
      offer: ubuntuserver
      sku: 18.04-lts
      native: true
  vm_count:
    dedicated: 1
  vm_size: Standard_NC4as_T4_v3
  autoscale:
    evaluation_interval: 00:05:00
    formula: |-
      startingNumberOfVMs = 1;
      maxNumberofVMs = 4;
      pendingTaskSamplePercent = $PendingTasks.GetSamplePercent(300 * TimeInterval_Second);
      pendingTaskSamples  = 70 > pendingTaskSamplePercent ? startingNumberOfVMs : avg($PendingTasks.GetSample(300 * TimeInterval_Second));
      vmForPendingTask = pendingTaskSamples <= 1 ? 1 : pendingTaskSamples;
      $TargetDedicatedNodes=min(maxNumberofVMs, vmForPendingTask);
      $NodeDeallocationOption = taskcompletion;
  gpu:
    nvidia_driver:
      source: https://us.download.nvidia.com/tesla/470.57.02/NVIDIA-Linux-x86_64-470.57.02.run

config.yaml

batch_shipyard:
  storage_account_settings: storage_source
global_resources:
  docker_images:
  - <<REDACTED>>
  additional_registries:
    docker:
    - <<REDACTED>>.azurecr.io
  volumes:
    shared_data_volumes:
      shared_storage_vol:
        volume_driver: azurefile
        storage_account_settings: storage_mount
        azure_file_share_name: <<REDACTED>>
        container_path: /mnt/integrate/<<REDACTED>>
        mount_options:
        - file_mode=0777
        - dir_mode=0777
        - mfsymlinks
        bind_options: rw

Additional Logs

INSERT ADDITIONAL LOGS HERE

Additonal Comments

I also tried with source: https://us.download.nvidia.com/tesla/460.73.01/NVIDIA-Linux-x86_64-460.73.01.run which deploys without issues other NC series (e.g., NC6s v3) and got the same error.

ziyuang commented 2 years ago

The master seems a bit old; will there be a release in the near future with this fix?

Davidnet commented 2 years ago

I encounter also this, do you think there will be any new releases for batch-shipyard