Azure / batch-shipyard

Simplify HPC and Batch workloads on Azure
MIT License
277 stars 121 forks source link

v3.9.1 nodeprep fails with ERROR - Intel MPI not found #350

Open themorey opened 4 years ago

themorey commented 4 years ago

Problem Description

Pool node fails to execute start task shipyard_nodeprep.sh with the following error:

2020-05-19T19:16:52,483089149+0000 - ERROR - Intel MPI not found

Batch Shipyard Version

3.9.1

Steps to Reproduce

Submit a job, pool attempts to resize but fails

Start task failed
FailureExitCode: The task exited with an exit code representing a failure

Expected Results

Job runs

Actual Results

The shipyard_nodeprep.sh startup script appears to be looking in the wrong location for mpivars.sh per the script:

1597         # check for intel mpi
1598         if [ -f /opt/intel/compilers_and_libraries/linux/mpi/bin64/mpivars.sh ]; then
1599             log INFO "Intel MPI found"
1600         else
1601             log ERROR "Intel MPI not found"
1602             exit 1
1603         fi

I ssh into the node that was created and find mpivars.sh but it is in a different location:

# find /opt/intel/ -name mpivars.sh
/opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/mpivars.sh
/opt/intel/compilers_and_libraries_2018.5.274/linux/mpi/intel64/bin/mpivars.sh

Redacted Configuration

pool.yaml

pool_specification:
  id: ampe-docker-native
  vm_configuration:
    platform_image:
      offer: CentOS-HPC
      publisher: OpenLogic
      sku: '7.7'
      native: true
  vm_count:
    dedicated: 0
    low_priority: 0
  vm_size: STANDARD_HC44rs
  autoscale:
    evaluation_interval: 00:05:00
    scenario:
      name: active_tasks
      maximum_vm_count:
        dedicated: 2
        low_priority: 2
      maximum_vm_increment_per_evaluation:
        dedicated: -1
        low_priority: -1
#  inter_node_communication_enabled: true
  ssh:
    username: shipyard

jobs.yaml

job_specifications:
- id: ampe-docker-shipyard-j5
  tasks:
  - docker_image: stvdwtt/ampe:azure_test
    command:  /home/builduser/AMPE/build/source/ampe2d /home/builduser/AMPE/examples/Dendrite2D/dendrite.input

Additional Logs

INSERT ADDITIONAL LOGS HERE

Additonal Comments

themorey commented 4 years ago

FYI...changing the platform_image sku to 7.6 completes the nodeprep.

alfpark commented 4 years ago

Thanks, most likely this is an intel MPI location change in 7.7+ images.