Open themorey opened 4 years ago
Pool node fails to execute start task shipyard_nodeprep.sh with the following error:
shipyard_nodeprep.sh
2020-05-19T19:16:52,483089149+0000 - ERROR - Intel MPI not found
3.9.1
Submit a job, pool attempts to resize but fails
Start task failed FailureExitCode: The task exited with an exit code representing a failure
Job runs
The shipyard_nodeprep.sh startup script appears to be looking in the wrong location for mpivars.sh per the script:
mpivars.sh
1597 # check for intel mpi 1598 if [ -f /opt/intel/compilers_and_libraries/linux/mpi/bin64/mpivars.sh ]; then 1599 log INFO "Intel MPI found" 1600 else 1601 log ERROR "Intel MPI not found" 1602 exit 1 1603 fi
I ssh into the node that was created and find mpivars.sh but it is in a different location:
# find /opt/intel/ -name mpivars.sh /opt/intel/compilers_and_libraries_2019.5.281/linux/mpi/intel64/bin/mpivars.sh /opt/intel/compilers_and_libraries_2018.5.274/linux/mpi/intel64/bin/mpivars.sh
pool.yaml
pool_specification: id: ampe-docker-native vm_configuration: platform_image: offer: CentOS-HPC publisher: OpenLogic sku: '7.7' native: true vm_count: dedicated: 0 low_priority: 0 vm_size: STANDARD_HC44rs autoscale: evaluation_interval: 00:05:00 scenario: name: active_tasks maximum_vm_count: dedicated: 2 low_priority: 2 maximum_vm_increment_per_evaluation: dedicated: -1 low_priority: -1 # inter_node_communication_enabled: true ssh: username: shipyard
jobs.yaml
job_specifications: - id: ampe-docker-shipyard-j5 tasks: - docker_image: stvdwtt/ampe:azure_test command: /home/builduser/AMPE/build/source/ampe2d /home/builduser/AMPE/examples/Dendrite2D/dendrite.input
INSERT ADDITIONAL LOGS HERE
FYI...changing the platform_image sku to 7.6 completes the nodeprep.
Thanks, most likely this is an intel MPI location change in 7.7+ images.
Problem Description
Pool node fails to execute start task
shipyard_nodeprep.sh
with the following error:Batch Shipyard Version
3.9.1
Steps to Reproduce
Submit a job, pool attempts to resize but fails
Expected Results
Job runs
Actual Results
The
shipyard_nodeprep.sh
startup script appears to be looking in the wrong location formpivars.sh
per the script:I ssh into the node that was created and find
mpivars.sh
but it is in a different location:Redacted Configuration
pool.yaml
jobs.yaml
Additional Logs
Additonal Comments