TACC / launcher

A simple utility for executing multiple sequential or multi-threaded applications in a single multi-node batch job
MIT License
63 stars 33 forks source link

Crash when environment contains variables with spaces #5

Closed mortonne closed 7 years ago

mortonne commented 8 years ago

On TACC's Lonestar 5, with launcher 3.0.1, I see a minor issue where paramrun crashes during the environment inheritance process if there is any variable in the environment whose value has spaces. For example:

login2.ls5(1)$ module load launcher
login2.ls5(2)$ export testvar="okayvar crashvar"
login2.ls5(3)$ cd /work/03206/mortonne/lonestar/cyrus/
login2.ls5(4)$ cat launcher.slurm 
#!/bin/bash
#
# Simple SLURM script for submitting multiple serial
# jobs (e.g. parametric studies) using a script wrapper
# to launch the jobs.
#
# To use, build the launcher executable and your
# serial application(s) and place them in your WORKDIR
# directory.  Then, edit the CONTROL_FILE to specify 
# each executable per process.
#-------------------------------------------------------
#-------------------------------------------------------
# 
#         <------ Setup Parameters ------>
#
#SBATCH -J Parametric 
#SBATCH -n 16
#SBATCH -p development
#SBATCH -o Parametric.o%j
#SBATCH -t 00:05:00
#          <------ Account String ----->
# <--- (Use this ONLY if you have MULTIPLE accounts) --->
##SBATCH -A
#------------------------------------------------------

export LAUNCHER_PLUGIN_DIR=$LAUNCHER_DIR/plugins
export LAUNCHER_RMI=SLURM
export LAUNCHER_JOB_FILE=jobfile

$LAUNCHER_DIR/paramrun
login2.ls5(5)$ sbatch launcher.slurm
-----------------------------------------------------------------
           Welcome to the Lonestar 5 Supercomputer
-----------------------------------------------------------------
No reservation for this job
--> Verifying valid submit host (login2)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/03206/mortonne)...OK
--> Verifying availability of your work dir
(/work/03206/mortonne/lonestar)...OK
--> Verifying availability of your scratch dir
(/scratch/03206/mortonne)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (development)...OK
--> Verifying job request is within current queue limits...OK
--> Checking available allocation (ANTS)...OK
Submitted batch job 76622
login2.ls5(8)$ cat Parametric.o76622
WARNING: LAUNCHER_WORKDIR variable not set. Using current directory.
Launcher: Setup complete.
------------- SUMMARY ---------------
    Number of hosts:    1
    Working directory:  /work/03206/mortonne/lonestar/cyrus
    Processes per host: 16
    Total processes:    16
    Total jobs:         600
    Scheduling method:  dynamic
-------------------------------------
Launcher: Starting parallel tasks...
Warning: Permanently added '[nid00009]:6999,[10.128.0.10]:6999' (RSA) to
the list of known hosts.
env: crashvar: No such file or directory
Launcher: Done. Job exited without errors

jobfile just contains a series of echo "Hello, World!" statements. I didn't see this issue on Lonestar 4 using launcher 1.4. A simple workaround, of course, is just to not have any environment variables with spaces in them. I wanted to log this anyway, partly just so other users will know about this.

lwilson commented 8 years ago

This issue doesn't come up too often, but I will update the environment propagation script to quote variable values with spaces in them.

lwilson commented 8 years ago

I have been looking at the previous environment propagation script (from version 1.4), and it simply doesn't propagate variables with spaces in them.

Also, I do not see a way to escape the spaces so that env will not interpret them as separate arguments. Surrounding in quotations and using backslashes do not work.

Going forward, pass_env will simply ignore all variables which contain spaces.

lwilson commented 7 years ago

I have fixed the pass_env script so that variables with spaces are not propagated.