cooperative-computing-lab / cctools

The Cooperative Computing Tools (cctools) enable large scale distributed computations to harness hundreds to thousands of machines from clusters, clouds, and grids.
http://ccl.cse.nd.edu
Other
134 stars 120 forks source link

Makeflow: Add LSF Support #2479

Closed dthain closed 3 years ago

dthain commented 3 years ago

Add support for the LSF batch system to Makeflow, and work with Larry Singh to make sure that it works for his particular LSF system and workflow.

At first glance, this looks like a straightforward update of batch_job_cluster.c: add LSF to the batch type enumeration and add cases for running bsub and appropriate commands to execute each makeflow job. Consult the LSF user documentation for the key ideas: https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_admin_foundations/working_lsf.html

dthain commented 3 years ago

Adding @larryns to the ticket.

dthain commented 3 years ago

Documentation for the facility: https://hpcwiki.pmacs.upenn.edu/wiki/index.php/HPC:User_Guide

dthain commented 3 years ago

@larryns can you share the command lines that you use to get your jobs running, along with the output that comes back, showing the job id that was submitted?

larryns commented 3 years ago

Sure, there are two ways I usually do it:

  1. bsub -J -M -n -o -e

$ bsub -J runmeJob -M 42000 -n 12 -o runme.o -e runme.e runme.sh will request 42Gb, 12 cores and run the script runme.sh

  1. Alternatively you can put the requirements in the shell script, e.g.
#!/bin/bash
#BSUB -J runmeJob
#BSUB -e runme.e
#BSUB -o runme.o
#BSUB -M 42000
#BSUB -n 12

# rest of shell script follows...

Submit with: $ bsub < runme.sh For both submissions, you'll get a response like:

Job <67550203> is submitted to default queue <normal>.

I use both methods; I guess it depends on which is more convenient for you to code.

-Larry.

dthain commented 3 years ago

Is that the literal output, or did you add the angle brackets around the job number?

Job <67550203> is submitted to default queue <normal>.
larryns commented 3 years ago

Literal, straight copy and paste. I didn't add the angle brackets.

dthain commented 3 years ago

PR #2481

dthain commented 3 years ago

@larryns we have a prototype here for you to try. There is always some unexpected oddity that comes up with a new system. Please try this out and let us know how it goes for you:

You will have to install from source:

git clone https://github.com/cooperative-computing-lab/cctools
cd cctools
./configure --prefix $HOME/cctools-test
make install
export PATH=$HOME/cctools-test/bin:$PATH

Then, when you run makeflow, please generate a debug output file, which should clarify if anything unusual happens. For example:

makeflow -d all -o debug.log -T lsf test.mf
larryns commented 3 years ago

@dthain: Thanks! I'll give it a shot over the next week and get back to you.

larryns commented 3 years ago

@dthain So I ran makeflow and had some issues. The problem might be on my part, but I can send you the debug.log file. How can I send it to you? It's 440kb.

dthain commented 3 years ago

Perhaps you could post the log file in a gist and link it here? Also, the makeflow command line and the workflow would be helpful too.

larryns commented 3 years ago

@dthain Here you go: https://gist.github.com/larryns/6e4b9140870468c6da1bf9fa29aa2bb0

No jobs were created, and I ended up just hitting Ctrl-C on the makeflow process to kill it. Let me know if you need anything else.

Thanks!

dthain commented 3 years ago

Hmm, here is what I can deduce from the log. Makeflow did submit an LSF job:

2020/12/04 11:34:28.43 makeflow[4598] batch: bsub   -o /dev/null -e /dev/null -env all -J makeflow0 -M 38000 -n 12 lsf.wrapper
2020/12/04 11:34:28.45 makeflow[4598] batch: job 67593160 submitted
2020/12/04 11:34:28.45 makeflow[4598] makeflow: node 0 was successfully submitted.
2020/12/04 11:34:28.45 makeflow[4598] makeflow: node 0 waiting -> running

Then, about 30s passed while makeflow was waiting for the job to come alive:

2020/12/04 11:34:28.45 makeflow[4598] batch: could not open status file "lsf.status.67593160"

Those repeated messages might be a little misleading. Makeflow is looking for that status file, but it won't get created until the job starts to run in LSF. So, a certain number of those are expected. It looks like you cancelled makeflow after about 30s.

Could you determine what happened to job 67593160 in LSF? I believe the bhist -l command does that. (And apologies, I'm just going on the documentation, I haven't done it myself.)

larryns commented 3 years ago

Sorry I probably should've run it longer, but I wanted to just have a short file. This is one of the many runs I had. In some of the runs I let it go for about 30 mins, and same thing. I can run it longer and send you the longer debug.log. As far as I can tel the job died quickly.

Here's the account from bjobs -l 67593160

Job <67593160>, Job Name , User , Project , Status

, Queue , Command , Share grou p charged Fri Dec 4 11:34:28: Submitted from host , CWD <$HOME/Star>, Output File , 12 Task(s); MEMLIMIT 37.1 G Fri Dec 4 11:34:29: Started 12 Task(s) on Host(s) <12*node188.hpc.local>, Allo cated 12 Slot(s) on Host(s) <12*node188.hpc.local>, Execut ion Home , Execution CWD ; Fri Dec 4 11:34:29: Exited with exit code 127. The CPU time used is 0.1 second s. Fri Dec 4 11:34:29: Completed . SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - nfsops uptime gbytesin gbytesout gopens gcloses greads gwrites loadSched - - - - - - - - loadStop - - - - - - - - grdir giupdate gbytesin_fs02exp gbytesout_fs02exp loadSched - - - - loadStop - - - - RESOURCE REQUIREMENT DETAILS: Combined: select[type == any ] order[r15s:pg] span[ptile='!'] same[model] affi nity[thread(1)*1] Effective: select[type == any ] order[r15s:pg] span[ptile='!'] same[model] aff inity[thread(1)*1]
dthain commented 3 years ago

Ok, we are making progress, it looks like the wrapper script lsf.wrapper is exiting right away without running the job. That makes me suspect it is not receiving the desired environment variables. Let's try a few debugging steps. Could you please try running this command directly on the head node?

bsub -o output.out -e error.out -env all -J makeflow0 -M 38000 -n 12 /usr/bin/env

And let me know what the output is?

btovar commented 3 years ago

The exit code 127 usually means that an executable was not found (in the PATH or otherwise). One reason is that the lsf.wrapper stars with the incorrect working directory. See how for torque and pbs we have:

 if(q->type == BATCH_QUEUE_TYPE_TORQUE || q->type == BATCH_QUEUE_TYPE_PBS){
        fprintf(file, "cd %s\n", path);
    }
dthain commented 3 years ago

I think that is addressed by the -cwd option to bsub but of course it's good to verify everything...

larryns commented 3 years ago

Okay here we go. No output in error.out. Here's the output.out

Sender: LSF System lsfadmin@node193.hpc.local Subject: Job 67593370: in cluster Done

Job was submitted from host by user in cluster . Job was executed on host(s) <12*node193.hpc.local>, in queue , as user in cluster . </home/larryns> was used as the home directory. </home/larryns/tmp> was used as the working directory. Started at Results reported on Your job looked like:


LSBATCH: User input

/usr/bin/env

Successfully completed.

Resource usage summary:

CPU time :                                   0.07 sec.
Max Memory :                                 -
Average Memory :                             -
Total Requested Memory :                     -
Delta Memory :                               -
Max Swap :                                   -
Max Processes :                              -
Max Threads :                                -
Run time :                                   7 sec.
Turnaround time :                            2 sec.

The output (if any) follows:

RM_CPUTASK9=34 RM_CPUTASK8=72 MANPATH=/usr/share/lsf/10.1/man: LSB_EXEC_CLUSTER=pennhpc HOSTNAME=node193.hpc.local LSB_EFFECTIVE_RSRCREQ=select[type == any ] order[r15s:pg] span[ptile='!'] same[model] affinity[thread(1)1] LSF_LIM_API_NTRIES=1 LSF_LOGDIR=/usr/share/lsf/log LSB_BATCH_JID=67593370 SHELL=/bin/bash HISTSIZE=1000 RM_CPUTASK1=20 SSH_CLIENT=10.212.134.105 54873 22 LS_EXECCWD=/home/larryns/tmp LSB_TRAPSIGS=trap # 15 10 12 2 1 CONDA_SHLVL=1 LS_JOBPID=159915 LSB_ERRORFILE=error.out RM_CPUTASK3=28 CONDA_PROMPT_MODIFIER=(base) LSB_JOBRES_CALLBACK=48664@node193.hpc.local LSB_MAX_NUM_PROCESSORS=12 RM_CPUTASK2=66 LSB_JOB_EXECUSER=larryns LSB_JOBID=67593370 RM_CPUTASK5=30 LSF_SERVERDIR=/usr/share/lsf/10.1/linux3.10-glibc2.17-x86_64/etc LSB_JOBRES_PID=159915 RM_CPUTASK4=68 LSB_JOBNAME=makeflow0 RM_CPUTASK7=32 SSH_TTY=/dev/pts/43 BSUB_BLOCK_EXEC_HOST= RM_CPUTASK6=70 LSF_LIBDIR=/usr/share/lsf/10.1/linux3.10-glibc2.17-x86_64/lib USER=larryns LSB_PROJECT_NAME=default LS_COLORS=rs=0:di=38;5;27:ln=38;5;51:mh=44;38;5;15:pi=40;38;5;11:so=38;5;13:do=38;5;5:bd=48;5;232;38;5;11:cd=48;5;232;38;5;3:or=48;5;232;38;5;9:mi=05;48;5;232;38;5;15:su=48;5;196;38;5;15:sg=48;5;11;38;5;16:ca=48;5;196;38;5;226:tw=48;5;10;38;5;16:ow=48;5;10;38;5;21:st=48;5;21;38;5;15:ex=38;5;34:.tar=38;5;9:.tgz=38;5;9:.arj=38;5;9:.taz=38;5;9:.lzh=38;5;9:.lzma=38;5;9:.tlz=38;5;9:.txz=38;5;9:.zip=38;5;9:.z=38;5;9:.Z=38;5;9:.dz=38;5;9:.gz=38;5;9:.lz=38;5;9:.xz=38;5;9:.bz2=38;5;9:.tbz=38;5;9:.tbz2=38;5;9:.bz=38;5;9:.tz=38;5;9:.deb=38;5;9:.rpm=38;5;9:.jar=38;5;9:.rar=38;5;9:.ace=38;5;9:.zoo=38;5;9:.cpio=38;5;9:.7z=38;5;9:.rz=38;5;9:.jpg=38;5;13:.jpeg=38;5;13:.gif=38;5;13:.bmp=38;5;13:.pbm=38;5;13:.pgm=38;5;13:.ppm=38;5;13:.tga=38;5;13:.xbm=38;5;13:.xpm=38;5;13:.tif=38;5;13:.tiff=38;5;13:.png=38;5;13:.svg=38;5;13:.svgz=38;5;13:.mng=38;5;13:.pcx=38;5;13:.mov=38;5;13:.mpg=38;5;13:.mpeg=38;5;13:.m2v=38;5;13:.mkv=38;5;13:.ogm=38;5;13:.mp4=38;5;13:.m4v=38;5;13:.mp4v=38;5;13:.vob=38;5;13:.qt=38;5;13:.nuv=38;5;13:.wmv=38;5;13:.asf=38;5;13:.rm=38;5;13:.rmvb=38;5;13:.flc=38;5;13:.avi=38;5;13:.fli=38;5;13:.flv=38;5;13:.gl=38;5;13:.dl=38;5;13:.xcf=38;5;13:.xwd=38;5;13:.yuv=38;5;13:.cgm=38;5;13:.emf=38;5;13:.axv=38;5;13:.anx=38;5;13:.ogv=38;5;13:.ogx=38;5;13:.aac=38;5;45:.au=38;5;45:.flac=38;5;45:.mid=38;5;45:.midi=38;5;45:.mka=38;5;45:.mp3=38;5;45:.mpc=38;5;45:.ogg=38;5;45:.ra=38;5;45:.wav=38;5;45:.axa=38;5;45:.oga=38;5;45:.spx=38;5;45:*.xspf=38;5;45: LD_LIBRARY_PATH=/usr/share/lsf/10.1/linux3.10-glibc2.17-x86_64/lib:/usr/share/lsf/10.1/linux2.6-glibc2.3-x86_64/lib SBD_KRB5CCNAME_VAL= LSB_EEXEC_REAL_UID= CONDA_EXE=/home/larryns/miniconda3/bin/conda HOSTTYPE=X86_64 LSF_INVOKE_CMD=bsub LS_EXEC_T=START LS_SUBCWD=/home/larryns/tmp LSF_VERSION=34 LSB_HOSTS=node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local LSB_UNIXGROUP_INT=larryns _CE_CONDA= LSB_JOBFILENAME=/home/larryns/.lsbatch/1607102206.67593370 LSB_JOBINDEX=0 PATH=/usr/share/lsf/10.1/linux3.10-glibc2.17-x86_64/bin:/home/larryns/cctools-test/bin:/home/larryns/miniconda3/bin:/home/larryns/miniconda3/condabin:/usr/share/lsf/10.1/linux2.6-glibc2.3-x86_64/etc:/usr/share/lsf/10.1/linux2.6-glibc2.3-x86_64/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/larryns/.local/bin:/home/larryns/bin MAIL=/var/spool/mail/larryns LSB_EXIT_PRE_ABORT=99 LSB_JOBEXIT_STAT=0 CONDA_PREFIX=/home/larryns/miniconda3 PWD=/home/larryns/tmp LSB_RES_GET_FANOUT_INFO=Y LANG=en_US.UTF-8 LSB_CHKFILENAME=/home/larryns/.lsbatch/1607102206.67593370 LSB_DJOB_HOSTFILE=/home/larryns/.lsbatch/1607102206.67593370.hostfile LSF_JOB_TIMESTAMP_VALUE=1607102207 RM_CPUTASK10=74 LSB_AFFINITY_HOSTFILE=/home/larryns/.lsbatch/1607102206.67593370.hostAffinityFile RM_CPUTASK11=36 LSB_DJOB_NUMPROC=12 LSB_EXEC_HOSTTYPE=X86_64 RM_CPUTASK12=76 LSF_BINDIR=/usr/share/lsf/10.1/linux3.10-glibc2.17-x86_64/bin HISTCONTROL=ignoredups _CE_M= HOME=/home/larryns SHLVL=2 JOB_TERMINATE_INTERVAL=10 LSB_ACCT_FILE=/gpfs/fs02/LSF_JOB_DIRS/LSF_JOB_TMPDIR/node193.hpc.local/67593370.tmpdir/.1607102206.67593370.acct BINARY_TYPE_HPC= LSB_SUB_HOST=consign.hpc.local LSF_JOB_TMPDIR=/gpfs/fs02/LSF_JOB_DIRS/LSF_JOB_TMPDIR/node193.hpc.local/67593370.tmpdir LSB_SUB_USER=larryns LSFUSER=larryns LSB_OUTDIR=/home/larryns/tmp LSB_QUEUE=normal LSB_MCPU_HOSTS=node193.hpc.local 12 LSB_OUTPUTFILE=output.out LOGNAME=larryns CONDA_PYTHON_EXE=/home/larryns/miniconda3/bin/python CVS_RSH=ssh SSH_CONNECTION=10.212.134.105 54873 172.16.103.23 22 LSF_CGROUP_TOPDIR_KEY=pennhpc LESSOPEN=||/usr/bin/lesspipe.sh %s CONDA_DEFAULT_ENV=base LSB_XFER_OP= LSB_EEXEC_REAL_GID= DISPLAY=consign.hpc.local:13.0 LSB_BIND_CPU_LIST=20,28,30,32,34,36,66,68,70,72,74,76 LSF_ENVDIR=/usr/share/lsf/conf LSB_DJOB_RANKFILE=/home/larryns/.lsbatch/1607102206.67593370.hostfile G_BROKENFILENAMES=1 =/usr/bin/env

dthain commented 3 years ago

Ok, that's helpful, and it tells me that the LSB_JOBID is getting set appropriately. Now to check Ben's question above, could you try this:

mkdir xyz123
cd xyz123
bsub -o output.out -e error.out -env all -cwd /usr/bin/pwd`

(And if you are getting sick of debugging by carrier pigeon, let me know and we can do something more interactive.)

larryns commented 3 years ago

You have to specify an argument for cwd, so I ran:

bsub -o output.out -e error.out -env all -cwd ${PWD} /usr/bin/pwd

output.out:

Sender: LSF System lsfadmin@node188.hpc.local Subject: Job 67593587: </usr/bin/pwd> in cluster Done

Job </usr/bin/pwd> was submitted from host by user in cluster . Job was executed on host(s) , in queue , as user <lar ryns> in cluster . </home/larryns> was used as the home directory. </home/larryns/xyz123> was used as the working directory. Started at Results reported on Your job looked like:


LSBATCH: User input

/usr/bin/pwd

Successfully completed.

Resource usage summary:

CPU time :                                   0.06 sec.
Max Memory :                                 -
Average Memory :                             -
Total Requested Memory :                     -
Delta Memory :                               -
Max Swap :                                   -
Max Processes :                              -
Max Threads :                                -
Run time :                                   7 sec.
Turnaround time :                            3 sec.

The output (if any) follows:

/home/larryns/xyz123

dthain commented 3 years ago

Aha, I see the problem: @btovar was right and I missed the -cwd in the first place. Just fixed that in the source.

Ok, you will have to pull and rebuild a new version:

cd [path-to-cctools-src]
git pull origin master
make clean
make all
make install

Then, go back to your workflow directory and delete lsf.wrapper

Then go ahead and re-run your workflow from the beginning, and let me know how it goes.

(Appreciate your patience on this; it's always a bit tricky getting things going when the system isn't directly at hand.)

larryns commented 3 years ago

Okay, same problem. It looks from the job info that the job is running in the right path. Granted, I could be doing something wrong.

bjobs -l 67594031

Job <67594031>, Job Name , User , Project , Status

, Queue , Command , Share grou p charged Fri Dec 4 13:04:24: Submitted from host , CWD <$HOME/Star>, Output File , 12 Task(s); MEMLIMIT 37.1 G Fri Dec 4 13:04:24: Started 12 Task(s) on Host(s) <12*node193.hpc.local>, Allo cated 12 Slot(s) on Host(s) <12*node193.hpc.local>, Execut ion Home , Execution CWD ; Fri Dec 4 13:04:25: Exited with exit code 127. The CPU time used is 0.1 second s. Fri Dec 4 13:04:25: Completed . SCHEDULING PARAMETERS: r15s r1m r15m ut pg io ls it tmp swp mem loadSched - - - - - - - - - - - loadStop - - - - - - - - - - - nfsops uptime gbytesin gbytesout gopens gcloses greads gwrites loadSched - - - - - - - - loadStop - - - - - - - - grdir giupdate gbytesin_fs02exp gbytesout_fs02exp loadSched - - - - loadStop - - - - RESOURCE REQUIREMENT DETAILS: Combined: select[type == any ] order[r15s:pg] span[ptile='!'] same[model] affi nity[thread(1)*1] Effective: select[type == any ] order[r15s:pg] span[ptile='!'] same[model] aff inity[thread(1)*1]
dthain commented 3 years ago

Hmm, do you mind if we sync up and work on this interactively?

larryns commented 3 years ago

Sure no problem. What do you have in mind for chat?

dthain commented 3 years ago

Coordinates coming by email...

On Fri, Dec 4, 2020 at 1:31 PM Larry N. Singh notifications@github.com wrote:

Sure no problem. What do you have in mind for chat?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cooperative-computing-lab/cctools/issues/2479#issuecomment-738946279, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVADERM7CR3RQSNUYHSUC3STETJHANCNFSM4UHVI2JA .

dthain commented 3 years ago

After some leg work, we found several interacting problems:

1 - The wrapper file has a stray EOF produced at the very end, that seems to be a leftover from some prior bit of code -- that needs to be removed.

2 - The new feature that produces alive messages in the log works, but results in the wrapper script having a failure exit value b/c the process is killed. The exit status of the wrapper script should be the exit status of the job itself.

Will take a look at this again on Monday...

dthain commented 3 years ago

2485 updates some of the issues we discovered on Friday.

dthain commented 3 years ago

@larryns please pull the latest version of makeflow, then make clean and make install and try again. If for some reason it doesn't work the first time, then add --disable-heartbeat to the makeflow command line and see if that helps.

larryns commented 3 years ago

@dthain Nope, sorry with or without --disable-heartbeat didn't work. I removed the source entirely and did:

git clone https://github.com/cooperative-computing-lab/cctools cd cctools ./configure --prefix $HOME/cctools-test make install

as before. Then ran makeflow with and without disable-heartbeat, but neither worked. Do you need to see the debug.log?

Thanks.

dthain commented 3 years ago

Darn, yes, please share the various files via gist as before..

On Mon, Dec 7, 2020 at 2:19 PM Larry N. Singh notifications@github.com wrote:

@dthain https://github.com/dthain Nope, sorry with or without --disable-heartbeat didn't work. I removed the source entirely and did:

git clone https://github.com/cooperative-computing-lab/cctools cd cctools ./configure --prefix $HOME/cctools-test make install

as before. Then ran makeflow with and without disable-heartbeat, but neither worked. Do you need to see the debug.log?

Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cooperative-computing-lab/cctools/issues/2479#issuecomment-740125269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVADEVLJSPKTBCSG2RPQQ3STUTDHANCNFSM4UHVI2JA .

larryns commented 3 years ago

@dthain Ok here's the new gist:

https://gist.github.com/larryns/6e4b9140870468c6da1bf9fa29aa2bb0

dthain commented 3 years ago

On a quick call this morning, we narrowed it down to a missing dot-slash on the wrapper file. Apparently LSF does a PATH search on the bsub argument, and so it misses things in the current working directory.

dthain commented 3 years ago

@larryns I'm closing this one out b/c it appears to be working. Of course, let us know if you have any further trouble.

larryns commented 3 years ago

@dthain so far, so good. I've been installing makeflow from conda. Any idea when I can pull a new conda package will be made?

dthain commented 3 years ago

@btovar can you make a Conda release sometime this week to roll up a few recent changes?

btovar commented 3 years ago

will do

larryns commented 3 years ago

Thanks!