ReproNim / reproman

ReproMan (AKA NICEMAN, AKA ReproNim TRD3)
https://reproman.readthedocs.io
Other
24 stars 14 forks source link

ENH: slurm launcher #574

Closed jbwexler closed 3 years ago

jbwexler commented 3 years ago

Sorry for the delay, I've finally been able to work on this. This fix turns out to be pretty simple and seems to work well so far but I haven't done any kind of extensive testing yet. Just thought I would let you guys take a look. Let me know if you have questions. I think there's probably a more elegant way of creating the SlurmLauncherSubmitter class than how did it, which was simply to copy and paste the SlurmSubmitter class and make a few minor edits.

As I write this it occurs to me that we should also add some of the if statements in slurm_launcher.template to slurm.template in order to fix https://github.com/ReproNim/reproman/issues/549

Some example output:

(main) jbwexler@login2>reproman run --follow -r local --sub slurm_launcher --orc datalad-no-remote --batch-parameter sub=1,2,3,4 --input sub-0{p[sub]} --jp num_processes=4 --jp num_nodes=4 --jp walltime=00:02:00 --jp queue=normal sh ../test.sh 2021-02-25 13:57:49,928 [INFO ] Making inputs available 2021-02-25 13:57:50,153 [INFO ] Submitting 20210225-135748-648d 2021-02-25 13:57:50,158 [INFO ] No root directory supplied for local; using '/home1/03201/jbwexler/.reproman/run-root' 2021-02-25 13:57:50,314 [INFO ] Submitting /scratch/03201/jbwexler/reproman_test/ds000001/.reproman/jobs/local/20210225-135748-648d/submit 2021-02-25 13:57:50,561 [INFO ] Job 20210225-135748-648d submitted as slurm_launcher job 3286023 2021-02-25 13:57:50,571 [INFO ] Registered job 20210225-135748-648d 2021-02-25 13:57:50,596 [INFO ] Waiting on job 3286023: PENDING. Next heartbeat in 5 seconds 2021-02-25 13:57:55,856 [INFO ] Waiting on job 3286023: PENDING. Next heartbeat in 5 seconds 2021-02-25 13:58:01,730 [INFO ] Waiting on job 3286023: PENDING. Next heartbeat in 12 seconds 2021-02-25 13:58:14,224 [INFO ] Waiting on job 3286023: PENDING. Next heartbeat in 26 seconds 2021-02-25 13:58:45,751 [INFO ] Waiting on job 3286023: PENDING. Next heartbeat in 60 seconds 2021-02-25 13:59:49,777 [INFO ] Waiting on job 3286023: PENDING. Next heartbeat in 131 seconds 2021-02-25 14:02:06,773 [INFO ] Waiting on job 3286023: PENDING. Next heartbeat in 281 seconds 2021-02-25 14:06:53,677 [INFO ] Waiting on job 3286023: PENDING. Next heartbeat in 597 seconds 2021-02-25 14:16:51,819 [INFO ] Waiting on job 3286023: PENDING. Next heartbeat in 1255 seconds 2021-02-25 14:25:43,272 [INFO ] Final state of job 3286023: COMPLETING 2021-02-25 14:25:43,339 [INFO ] Unregistered job 20210225-135748-648d

codecov[bot] commented 3 years ago

Codecov Report

Merging #574 (36167f1) into master (6c49cfe) will decrease coverage by 0.33%. The diff coverage is 36.36%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #574      +/-   ##
==========================================
- Coverage   89.22%   88.88%   -0.34%     
==========================================
  Files         149      149              
  Lines       13054    13087      +33     
==========================================
- Hits        11647    11632      -15     
- Misses       1407     1455      +48     
Impacted Files Coverage Δ
reproman/interface/run.py 100.00% <ø> (ø)
reproman/support/jobs/submitters.py 65.29% <36.36%> (-4.07%) :arrow_down:
reproman/distributions/tests/test_venv.py 86.32% <0.00%> (-9.41%) :arrow_down:
reproman/distributions/tests/test_docker.py 94.73% <0.00%> (-5.27%) :arrow_down:
reproman/resource/aws_ec2.py 59.06% <0.00%> (-2.60%) :arrow_down:
reproman/tests/skip.py 95.55% <0.00%> (-2.23%) :arrow_down:
reproman/resource/docker_container.py 92.35% <0.00%> (-1.92%) :arrow_down:
reproman/resource/ssh.py 88.09% <0.00%> (-0.80%) :arrow_down:
reproman/utils.py 86.97% <0.00%> (-0.17%) :arrow_down:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 6c49cfe...6bfb104. Read the comment docs.

yarikoptic commented 3 years ago

Great, thank you @jbwexler !

I wonder if it is easyiesh to install/"provision" the submitter on top of an existing slurm instance?
If so -- it would be great to get some "unit" test to ensure that it is all working. We have https://github.com/ReproNim/reproman/blob/master/tools/ci/setup-slurm-container.sh and some testing of interactions with slurm in https://github.com/ReproNim/reproman/blob/master/reproman/support/jobs/tests/test_orchestrators.py

@kyleam could you please review the PR?

Re

As I write this it occurs to me that we should also add some of the if statements in slurm_launcher.template to slurm.template in order to fix #549

would be much appreciated -- might better be in a separate PR to not make it contingent on this one.

jbwexler commented 3 years ago

Thanks for the feedback! @kyleam I like your solution and am working on it now.

One question: I need to add --ntasks to slurm.template, which can take a job parameter called 'num_tasks'. My slight concern is that some documentation describes --ntasks as the number of processes, which is confusing considering reproman has a different job parameter called num_processes, which is assigned to --cpus-per-task. For what it's worth, the above documentation describes --cpus-per-task as the number of threads. My understanding of the differences between procs/threads/tasks is limited so I thought I would get your opinion on this.

kyleam commented 3 years ago

That's a good question, and I think in general how to handle these parameters across the different submitters is an unresolved issue:

https://github.com/ReproNim/reproman/blob/6c49cfeca9ebb6217a1f2de8d1152a9b61af5768/reproman/support/jobs/job_templates/submission/slurm.template#L5-L11

I think I made a bad choice in mapping num_processes to --cpus-per-task. The Slurm template came after PBS and Condor. For PBS it maps to ppn, so I guess --ntasks-per-node (without --ntasks) would be the closest match. For Condor it is request_cpus and maybe --ntasks would be the best fit. My suggestion for now would be to drop the use of --cpus-per-tasks and go with --ntasks or --ntasks-per-node.

kyleam commented 3 years ago

Fwiw I've added launcher to the container that we test the current slurm submitter against. Based on quick testing, it looks like that may be enough to work with your new template. I'll try to wire that up after you push the next iteration.

https://github.com/ReproNim/reproman-slurm/commit/1cd5e8859e31d72fec4c1b3542d7659e018c7b2e

jbwexler commented 3 years ago

I ended up creating a new branch here: https://github.com/ReproNim/reproman/pull/577