ReproNim / reproman

ReproMan (AKA NICEMAN, AKA ReproNim TRD3)
https://reproman.readthedocs.io
Other
24 stars 14 forks source link

run: Add Slurm support #494

Closed kyleam closed 4 years ago

kyleam commented 4 years ago

This is an initial stab at Slurm support (gh-484). There are still things to flesh out (most of the known ones should have to-do comment placeholders), and I bet someone familiar with Slurm could suggest better ways to do things (e.g., is it possible to get the job status as a JSON record?). But I was able to submit simple commands (with a single job and with multiple subjobs), so things seem to be wired up correctly at least at a basic level.


Given that I don't have access to an environment with Slurm, setting that up was the more involved part. Here are the details:

Slurm setup * Clone * Apply the patch at the end of this post. It configures the image for ssh. A few of the changes in that diff might not be strictly necessary, but I didn't spend much time fiddling with it once I got something working. * Build the image: `docker build -t slurm-docker-cluster:19.05.1 .` This takes a bit of time because it builds Slurm from source. * Run `docker-compose up -d` and then `./register_cluster.sh`. * I didn't specify a host port in docker-compose.yml, so check the assigned port (`docker port slurmctld 22`). * Add an entry to .ssh/config for the slurmctld container along the lines of ``` ... Host slurm HostName User root ControlMaster no Port ``` * `ssh-copy-id slurm` so that you don't have to worry about entering the password ("root"). Then we're to the standard reproman stuff: * `reproman create sl -t ssh -b host=slurm` * Create some datalad dataset and run something like `reproman run -r sl --follow --orc datalad-local-run --submitter=slurm --jp root_directory=/data --bp say=a,b sh -c "sleep 10; echo i say {p[say]} >{p[say]}"` Notice the `root_directory`. slurm-docker-cluster's README.md says to submit the jobs from that mount point, and I think that's necessary to make things work, but I haven't really looked into it.
sshd patch ```diff diff --git a/Dockerfile b/Dockerfile index d143635..197e92d 100644 --- a/Dockerfile +++ b/Dockerfile @@ -34,6 +34,8 @@ RUN set -ex \ psmisc \ bash-completion \ vim-enhanced \ + openssh-clients \ + openssh-server \ && yum clean all \ && rm -rf /var/cache/yum @@ -83,10 +85,21 @@ RUN set -x \ && chown -R slurm:slurm /var/*/slurm* \ && /sbin/create-munge-key +RUN echo 'root:root' |chpasswd + +RUN sed -ri 's/^#?PermitRootLogin\s+.*/PermitRootLogin yes/' /etc/ssh/sshd_config +RUN sed -ri 's/UsePAM yes/#UsePAM yes/g' /etc/ssh/sshd_config + +RUN mkdir /root/.ssh + COPY slurm.conf /etc/slurm/slurm.conf COPY slurmdbd.conf /etc/slurm/slurmdbd.conf COPY docker-entrypoint.sh /usr/local/bin/docker-entrypoint.sh ENTRYPOINT ["/usr/local/bin/docker-entrypoint.sh"] +EXPOSE 22 +ENV NOTVISIBLE "in users profile" +RUN echo "export VISIBLE=now" >> /etc/profile + CMD ["slurmdbd"] diff --git a/docker-compose.yml b/docker-compose.yml index f0862be..74fcb0e 100644 --- a/docker-compose.yml +++ b/docker-compose.yml @@ -37,8 +37,11 @@ services: - etc_slurm:/etc/slurm - slurm_jobdir:/data - var_log_slurm:/var/log/slurm + ports: + - "22" expose: - "6817" + - "22" depends_on: - "slurmdbd" diff --git a/docker-entrypoint.sh b/docker-entrypoint.sh index 9a1203a..1a7a16f 100755 --- a/docker-entrypoint.sh +++ b/docker-entrypoint.sh @@ -23,6 +23,10 @@ fi if [ "$1" = "slurmctld" ] then + echo "---> Starting sshd ..." + ssh-keygen -A + /usr/sbin/sshd + echo "---> Starting the MUNGE Authentication service (munged) ..." gosu munge /usr/sbin/munged ```
satra commented 4 years ago

@kyleam - you can take a look at this here:

https://github.com/nipy/nipype/blob/master/nipype/pipeline/plugins/slurm.py (a much simpler worker in the new engine): https://github.com/nipype/pydra/blob/master/pydra/engine/workers.py#L165

satra commented 4 years ago

also in pydra we are testing slurm in a container: https://github.com/nipype/pydra/blob/master/ci/slurm.sh

kyleam commented 4 years ago

@satra Thanks for the pointers.

Just in terms of my question about getting structured output about jobs, what I could gather from a quick skim suggests that sadly we're going to have to stick with parsing the unstructured output with a regexp.

satra commented 4 years ago

Just in terms of my question about getting structured output about jobs, what I could gather from a quick skim suggests that sadly we're going to have to stick with parsing the unstructured output with a regexp.

indeed. the slurm output can be customized by a center.

one of the things we will try to do in pydra, unless someone has already done it is to analyze the slurm configuration to find out of resources, qos, partitions. we could also consider a test job to determine how to parse output. but these are all fancy things relative to the user simply saying this is where to go and run.

codecov[bot] commented 4 years ago

Codecov Report

Merging #494 into master will decrease coverage by 5.01%. The diff coverage is 23.30%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #494      +/-   ##
==========================================
- Coverage   89.64%   84.63%   -5.02%     
==========================================
  Files         148      148              
  Lines       12209    12272      +63     
==========================================
- Hits        10945    10386     -559     
- Misses       1264     1886     +622     
Impacted Files Coverage Δ
reproman/support/jobs/tests/test_orchestrators.py 32.37% <10.93%> (-61.07%) :arrow_down:
reproman/support/jobs/submitters.py 51.95% <32.25%> (-24.40%) :arrow_down:
reproman/tests/skip.py 93.25% <87.50%> (-4.28%) :arrow_down:
reproman/resource/tests/test_ssh.py 27.53% <0.00%> (-72.47%) :arrow_down:
reproman/support/jobs/orchestrators.py 46.56% <0.00%> (-45.28%) :arrow_down:
reproman/interface/tests/test_execute.py 71.84% <0.00%> (-28.16%) :arrow_down:
reproman/resource/ssh.py 75.00% <0.00%> (-13.34%) :arrow_down:
reproman/interface/execute.py 86.62% <0.00%> (-8.29%) :arrow_down:
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2ffa175...8456499. Read the comment docs.