Better error message when executable does not exist

KrisWilliamson commented 10 months ago

The Launcher script does not check to see if the selected executable exists before trying to launch it. As a result, some strange errors are generated when the executable does not exist. This is most common with the Model MPI executables, but could happen if the model was not compiled at all.

It would be better that a check is added to the launcher script to confirm the executable exists and exit gracefully with a proper error message if it does not.

Example log

mpijob.kubeflow.org/riskpaths-1705419285-9575012 created
riskpaths-1705419285-9575012-launcher   0/1     Init:0/1   0          1s
riskpaths-1705419285-9575012-launcher   0/1     Init:0/1   0          2s
riskpaths-1705419285-9575012-launcher   0/1     Init:0/1   0          3s
riskpaths-1705419285-9575012-launcher   0/1     Init:0/1   0          4s
riskpaths-1705419285-9575012-launcher   0/1     PodInitializing   0          6s
riskpaths-1705419285-9575012-launcher   1/1     Running   0          7s
riskpaths-1705419285-9575012-launcher   1/1     Running   0          8s
Defaulted container "riskpaths-1705419285-9575012-launcher" out of: riskpaths-1705419285-9575012-launcher, kubectl-delivery (init)
+ POD_NAME=riskpaths-1705419285-9575012-worker-0
+ shift
+ /opt/kube/kubectl exec riskpaths-1705419285-9575012-worker-0 -- /bin/sh -c ( test ! -r ./.profile || . ./.profile;  orted -mca ess "env" -mca ess_base_jobid "1094123520" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_node_regex "riskpaths-[10:1705419285]-9575012-launcher,riskpaths-[10:1705419285]-9575012-worker-0@0(2)" -mca orte_hnp_uri "1094123520.0;tcp://10.129.64.165:56737" -mca plm "rsh" --tree-spawn -mca routed "radix" -mca orte_parent_uri "1094123520.0;tcp://10.129.64.165:56737" -mca plm_rsh_agent "/etc/mpi/kubexec.sh" -mca orte_default_hostfile "/etc/mpi/hostfile" -mca pmix "^s1,s2,cray,isolated" )
/bin/bash: line 1: /home/jovyan/buckets/aaw-unclassified/microsim/models/bin/RiskPaths_mpi: No such file or directory
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:
  Process name: [[16695,1],0]
  Exit code:    127
--------------------------------------------------------------------------

In this case, a model run using mpi was launched for Riskpaths, but the mpi version of the executable did not exist.

Souheil-Yazji commented 10 months ago

I was under the impression that the launcher actually says it cannot find the appropriate file but maybe I'm remembering incorrectly, lets keep track of the actual error messages that we encounter.

KrisWilliamson commented 10 months ago

dispatchMPIJob.sh

calls parseCommand.py The executable path is stored in the modelExecutable variable in the python script.

These are populated into the /opt/openM/1.15.5/bin directory when the OpenM UI is launched.

StatCan / openmpp

Better error message when executable does not exist #51