ReproNim / reproman

ReproMan (AKA NICEMAN, AKA ReproNim TRD3)
https://reproman.readthedocs.io
Other
24 stars 14 forks source link

slurm: without batch parameters creates array of size 0 which throws slurm off-guard #550

Open yarikoptic opened 3 years ago

yarikoptic commented 3 years ago

following workaround for #549 I got to

(reproman-dev) login2.ls5(10)$ git clean -dfx; reproman run --follow -r local --sub slurm --orc datalad-no-remote python -m nose -s -v datalad
Removing .reproman/jobs/local/
2020-10-09 15:09:00,775 [INFO   ] Submitting 20201009-150859-3cf0
2020-10-09 15:09:00,779 [INFO   ] No root directory supplied for local; using '/home1/03372/yoh/.reproman/run-root'
2020-10-09 15:09:00,819 [INFO   ] Submitting /home1/03372/yoh/testslurm/.reproman/jobs/local/20201009-150859-3cf0/submit
2020-10-09 15:09:00,975 [ERROR  ] CommandError: command '['sbatch', '-p', 'normal', '-n', '1', '-N', '1', '-t', '90', '/home1/03372/yoh/testslurm/.reproman/jobs/local/20201009-150859-3cf0/submit']' failed with exitcode 1
| Failed to run ['sbatch', '-p', 'normal', '-n', '1', '-N', '1', '-t', '90', '/home1/03372/yoh/testslurm/.reproman/jobs/local/20201009-150859-3cf0/submit'] under None. Exit code=1. out=
| -----------------------------------------------------------------
|           Welcome to the Lonestar 5 Supercomputer
| -----------------------------------------------------------------
|
| No reservation for this job
| --> Verifying valid submit host (login2)...OK
| --> Verifying valid jobname...OK
| --> Enforcing max jobs per user...OK
| --> Verifying availability of your home dir (/home1/03372/yoh)...OK
| --> Verifying availability of your work dir (/work/03372/yoh/lonestar)...OK
| --> Verifying availability of your scratch dir (/scratch/03372/yoh)...OK
| --> Verifying valid ssh keys...OK
| --> Verifying access to desired queue (normal)...OK
| --> Verifying job request is within current queue limits...OK
| --> Checking available allocation (Analysis_Lonestar)...OK
|  err=sbatch: error: Batch job submission failed: Invalid job array specification
|  [cmd.py:run:292] (CommandError)

and here is the submit file (note -- no trailing new line)

(reproman-dev) login2.ls5(11)$ cat .reproman/jobs/local/20201009-150859-3cf0/submit
#!/bin/sh

#SBATCH --output=/home1/03372/yoh/testslurm/.reproman/jobs/local/20201009-150859-3cf0/stdout.%a
#SBATCH --error=/home1/03372/yoh/testslurm/.reproman/jobs/local/20201009-150859-3cf0/stderr.%a
#SBATCH --array=0

/home1/03372/yoh/testslurm/.reproman/jobs/local/20201009-150859-3cf0/runscript $SLURM_ARRAY_TASK_ID(reproman-dev) login2.ls5(11)$ 

so I thought to workaround by specifying --bp but that seemed to have no effect:

(reproman-dev) login2.ls5(16)$ git clean -dfx; reproman run --follow -r local --sub slurm --orc datalad-no-remote --bp m=datalad python -m nose -s -v '{p[m]}' 
Removing .reproman/jobs/local/
2020-10-09 15:17:48,665 [INFO   ] Submitting 20201009-151747-255c 
2020-10-09 15:17:48,668 [INFO   ] No root directory supplied for local; using '/home1/03372/yoh/.reproman/run-root' 
2020-10-09 15:17:48,703 [INFO   ] Submitting /home1/03372/yoh/testslurm/.reproman/jobs/local/20201009-151747-255c/submit 
2020-10-09 15:17:50,008 [ERROR  ] CommandError: command '['sbatch', '-p', 'normal', '-n', '1', '-N', '1', '-t', '90', '/home1/03372/yoh/testslurm/.reproman/jobs/local/20201009-151747-255c/submit']' failed with exitcode 1
| Failed to run ['sbatch', '-p', 'normal', '-n', '1', '-N', '1', '-t', '90', '/home1/03372/yoh/testslurm/.reproman/jobs/local/20201009-151747-255c/submit'] under None. Exit code=1. out=
| -----------------------------------------------------------------
|           Welcome to the Lonestar 5 Supercomputer          
| -----------------------------------------------------------------
| 
| No reservation for this job
| --> Verifying valid submit host (login2)...OK
| --> Verifying valid jobname...OK
| --> Enforcing max jobs per user...OK
| --> Verifying availability of your home dir (/home1/03372/yoh)...OK
| --> Verifying availability of your work dir (/work/03372/yoh/lonestar)...OK
| --> Verifying availability of your scratch dir (/scratch/03372/yoh)...OK
| --> Verifying valid ssh keys...OK
| --> Verifying access to desired queue (normal)...OK
| --> Verifying job request is within current queue limits...OK
| --> Checking available allocation (Analysis_Lonestar)...OK
|  err=sbatch: error: Batch job submission failed: Invalid job array specification
|  [cmd.py:run:292] (CommandError) 
(reproman-dev) login2.ls5(17)$ cat .reproman/jobs/local/20201009-*/submit
#!/bin/sh

#SBATCH --output=/home1/03372/yoh/testslurm/.reproman/jobs/local/20201009-151747-255c/stdout.%a
#SBATCH --error=/home1/03372/yoh/testslurm/.reproman/jobs/local/20201009-151747-255c/stderr.%a
#SBATCH --array=0

/home1/03372/yoh/testslurm/.reproman/jobs/local/20201009-151747-255c/runscript $SLURM_ARRAY_TASK_ID(reproman-dev) login2.ls5(18)$ 
yarikoptic commented 3 years ago

@effigies do you see something obviously wrong with that simplistic submit recipe for slurm? changing to --array=1 also did not provide remedy. Removing it and hardcoding job id for our run shim worked. So the problem is somewhere in --array specification but sbatch doesn't tell what is wrong really

effigies commented 3 years ago

I can have a look, but I'm no slurm expert.

effigies commented 3 years ago

https://github.com/ReproNim/reproman/issues/557