Closed dthain closed 3 years ago
Adding @larryns to the ticket.
Documentation for the facility: https://hpcwiki.pmacs.upenn.edu/wiki/index.php/HPC:User_Guide
@larryns can you share the command lines that you use to get your jobs running, along with the output that comes back, showing the job id that was submitted?
Sure, there are two ways I usually do it:
$ bsub -J runmeJob -M 42000 -n 12 -o runme.o -e runme.e runme.sh
will request 42Gb, 12 cores and run the script runme.sh
#!/bin/bash
#BSUB -J runmeJob
#BSUB -e runme.e
#BSUB -o runme.o
#BSUB -M 42000
#BSUB -n 12
# rest of shell script follows...
Submit with:
$ bsub < runme.sh
For both submissions, you'll get a response like:
Job <67550203> is submitted to default queue <normal>.
I use both methods; I guess it depends on which is more convenient for you to code.
-Larry.
Is that the literal output, or did you add the angle brackets around the job number?
Job <67550203> is submitted to default queue <normal>.
Literal, straight copy and paste. I didn't add the angle brackets.
PR #2481
@larryns we have a prototype here for you to try. There is always some unexpected oddity that comes up with a new system. Please try this out and let us know how it goes for you:
You will have to install from source:
git clone https://github.com/cooperative-computing-lab/cctools
cd cctools
./configure --prefix $HOME/cctools-test
make install
export PATH=$HOME/cctools-test/bin:$PATH
Then, when you run makeflow, please generate a debug output file, which should clarify if anything unusual happens. For example:
makeflow -d all -o debug.log -T lsf test.mf
@dthain: Thanks! I'll give it a shot over the next week and get back to you.
@dthain So I ran makeflow and had some issues. The problem might be on my part, but I can send you the debug.log file. How can I send it to you? It's 440kb.
Perhaps you could post the log file in a gist and link it here? Also, the makeflow command line and the workflow would be helpful too.
@dthain Here you go: https://gist.github.com/larryns/6e4b9140870468c6da1bf9fa29aa2bb0
No jobs were created, and I ended up just hitting Ctrl-C on the makeflow process to kill it. Let me know if you need anything else.
Thanks!
Hmm, here is what I can deduce from the log. Makeflow did submit an LSF job:
2020/12/04 11:34:28.43 makeflow[4598] batch: bsub -o /dev/null -e /dev/null -env all -J makeflow0 -M 38000 -n 12 lsf.wrapper
2020/12/04 11:34:28.45 makeflow[4598] batch: job 67593160 submitted
2020/12/04 11:34:28.45 makeflow[4598] makeflow: node 0 was successfully submitted.
2020/12/04 11:34:28.45 makeflow[4598] makeflow: node 0 waiting -> running
Then, about 30s passed while makeflow was waiting for the job to come alive:
2020/12/04 11:34:28.45 makeflow[4598] batch: could not open status file "lsf.status.67593160"
Those repeated messages might be a little misleading. Makeflow is looking for that status file, but it won't get created until the job starts to run in LSF. So, a certain number of those are expected. It looks like you cancelled makeflow after about 30s.
Could you determine what happened to job 67593160 in LSF? I believe the bhist -l
command does that. (And apologies, I'm just going on the documentation, I haven't done it myself.)
Sorry I probably should've run it longer, but I wanted to just have a short file. This is one of the many runs I had. In some of the runs I let it go for about 30 mins, and same thing. I can run it longer and send you the longer debug.log. As far as I can tel the job died quickly.
Here's the account from bjobs -l 67593160
Job <67593160>, Job Name
Ok, we are making progress, it looks like the wrapper script lsf.wrapper
is exiting right away without running the job.
That makes me suspect it is not receiving the desired environment variables.
Let's try a few debugging steps.
Could you please try running this command directly on the head node?
bsub -o output.out -e error.out -env all -J makeflow0 -M 38000 -n 12 /usr/bin/env
And let me know what the output is?
The exit code 127 usually means that an executable was not found (in the PATH or otherwise). One reason is that the lsf.wrapper stars with the incorrect working directory. See how for torque and pbs we have:
if(q->type == BATCH_QUEUE_TYPE_TORQUE || q->type == BATCH_QUEUE_TYPE_PBS){
fprintf(file, "cd %s\n", path);
}
I think that is addressed by the -cwd
option to bsub
but of course it's good to verify everything...
Okay here we go. No output in error.out. Here's the output.out
Sender: LSF System lsfadmin@node193.hpc.local
Subject: Job 67593370:
Job
Successfully completed.
Resource usage summary:
CPU time : 0.07 sec.
Max Memory : -
Average Memory : -
Total Requested Memory : -
Delta Memory : -
Max Swap : -
Max Processes : -
Max Threads : -
Run time : 7 sec.
Turnaround time : 2 sec.
The output (if any) follows:
RM_CPUTASK9=34 RM_CPUTASK8=72 MANPATH=/usr/share/lsf/10.1/man: LSB_EXEC_CLUSTER=pennhpc HOSTNAME=node193.hpc.local LSB_EFFECTIVE_RSRCREQ=select[type == any ] order[r15s:pg] span[ptile='!'] same[model] affinity[thread(1)1] LSF_LIM_API_NTRIES=1 LSF_LOGDIR=/usr/share/lsf/log LSB_BATCH_JID=67593370 SHELL=/bin/bash HISTSIZE=1000 RM_CPUTASK1=20 SSH_CLIENT=10.212.134.105 54873 22 LS_EXECCWD=/home/larryns/tmp LSB_TRAPSIGS=trap # 15 10 12 2 1 CONDA_SHLVL=1 LS_JOBPID=159915 LSB_ERRORFILE=error.out RM_CPUTASK3=28 CONDA_PROMPT_MODIFIER=(base) LSB_JOBRES_CALLBACK=48664@node193.hpc.local LSB_MAX_NUM_PROCESSORS=12 RM_CPUTASK2=66 LSB_JOB_EXECUSER=larryns LSB_JOBID=67593370 RM_CPUTASK5=30 LSF_SERVERDIR=/usr/share/lsf/10.1/linux3.10-glibc2.17-x86_64/etc LSB_JOBRES_PID=159915 RM_CPUTASK4=68 LSB_JOBNAME=makeflow0 RM_CPUTASK7=32 SSH_TTY=/dev/pts/43 BSUB_BLOCK_EXEC_HOST= RM_CPUTASK6=70 LSF_LIBDIR=/usr/share/lsf/10.1/linux3.10-glibc2.17-x86_64/lib USER=larryns LSB_PROJECT_NAME=default LS_COLORS=rs=0:di=38;5;27:ln=38;5;51:mh=44;38;5;15:pi=40;38;5;11:so=38;5;13:do=38;5;5:bd=48;5;232;38;5;11:cd=48;5;232;38;5;3:or=48;5;232;38;5;9:mi=05;48;5;232;38;5;15:su=48;5;196;38;5;15:sg=48;5;11;38;5;16:ca=48;5;196;38;5;226:tw=48;5;10;38;5;16:ow=48;5;10;38;5;21:st=48;5;21;38;5;15:ex=38;5;34:.tar=38;5;9:.tgz=38;5;9:.arj=38;5;9:.taz=38;5;9:.lzh=38;5;9:.lzma=38;5;9:.tlz=38;5;9:.txz=38;5;9:.zip=38;5;9:.z=38;5;9:.Z=38;5;9:.dz=38;5;9:.gz=38;5;9:.lz=38;5;9:.xz=38;5;9:.bz2=38;5;9:.tbz=38;5;9:.tbz2=38;5;9:.bz=38;5;9:.tz=38;5;9:.deb=38;5;9:.rpm=38;5;9:.jar=38;5;9:.rar=38;5;9:.ace=38;5;9:.zoo=38;5;9:.cpio=38;5;9:.7z=38;5;9:.rz=38;5;9:.jpg=38;5;13:.jpeg=38;5;13:.gif=38;5;13:.bmp=38;5;13:.pbm=38;5;13:.pgm=38;5;13:.ppm=38;5;13:.tga=38;5;13:.xbm=38;5;13:.xpm=38;5;13:.tif=38;5;13:.tiff=38;5;13:.png=38;5;13:.svg=38;5;13:.svgz=38;5;13:.mng=38;5;13:.pcx=38;5;13:.mov=38;5;13:.mpg=38;5;13:.mpeg=38;5;13:.m2v=38;5;13:.mkv=38;5;13:.ogm=38;5;13:.mp4=38;5;13:.m4v=38;5;13:.mp4v=38;5;13:.vob=38;5;13:.qt=38;5;13:.nuv=38;5;13:.wmv=38;5;13:.asf=38;5;13:.rm=38;5;13:.rmvb=38;5;13:.flc=38;5;13:.avi=38;5;13:.fli=38;5;13:.flv=38;5;13:.gl=38;5;13:.dl=38;5;13:.xcf=38;5;13:.xwd=38;5;13:.yuv=38;5;13:.cgm=38;5;13:.emf=38;5;13:.axv=38;5;13:.anx=38;5;13:.ogv=38;5;13:.ogx=38;5;13:.aac=38;5;45:.au=38;5;45:.flac=38;5;45:.mid=38;5;45:.midi=38;5;45:.mka=38;5;45:.mp3=38;5;45:.mpc=38;5;45:.ogg=38;5;45:.ra=38;5;45:.wav=38;5;45:.axa=38;5;45:.oga=38;5;45:.spx=38;5;45:*.xspf=38;5;45: LD_LIBRARY_PATH=/usr/share/lsf/10.1/linux3.10-glibc2.17-x86_64/lib:/usr/share/lsf/10.1/linux2.6-glibc2.3-x86_64/lib SBD_KRB5CCNAME_VAL= LSB_EEXEC_REAL_UID= CONDA_EXE=/home/larryns/miniconda3/bin/conda HOSTTYPE=X86_64 LSF_INVOKE_CMD=bsub LS_EXEC_T=START LS_SUBCWD=/home/larryns/tmp LSF_VERSION=34 LSB_HOSTS=node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local node193.hpc.local LSB_UNIXGROUP_INT=larryns _CE_CONDA= LSB_JOBFILENAME=/home/larryns/.lsbatch/1607102206.67593370 LSB_JOBINDEX=0 PATH=/usr/share/lsf/10.1/linux3.10-glibc2.17-x86_64/bin:/home/larryns/cctools-test/bin:/home/larryns/miniconda3/bin:/home/larryns/miniconda3/condabin:/usr/share/lsf/10.1/linux2.6-glibc2.3-x86_64/etc:/usr/share/lsf/10.1/linux2.6-glibc2.3-x86_64/bin:/usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin:/home/larryns/.local/bin:/home/larryns/bin MAIL=/var/spool/mail/larryns LSB_EXIT_PRE_ABORT=99 LSB_JOBEXIT_STAT=0 CONDA_PREFIX=/home/larryns/miniconda3 PWD=/home/larryns/tmp LSB_RES_GET_FANOUT_INFO=Y LANG=en_US.UTF-8 LSB_CHKFILENAME=/home/larryns/.lsbatch/1607102206.67593370 LSB_DJOB_HOSTFILE=/home/larryns/.lsbatch/1607102206.67593370.hostfile LSF_JOB_TIMESTAMP_VALUE=1607102207 RM_CPUTASK10=74 LSB_AFFINITY_HOSTFILE=/home/larryns/.lsbatch/1607102206.67593370.hostAffinityFile RM_CPUTASK11=36 LSB_DJOB_NUMPROC=12 LSB_EXEC_HOSTTYPE=X86_64 RM_CPUTASK12=76 LSF_BINDIR=/usr/share/lsf/10.1/linux3.10-glibc2.17-x86_64/bin HISTCONTROL=ignoredups _CE_M= HOME=/home/larryns SHLVL=2 JOB_TERMINATE_INTERVAL=10 LSB_ACCT_FILE=/gpfs/fs02/LSF_JOB_DIRS/LSF_JOB_TMPDIR/node193.hpc.local/67593370.tmpdir/.1607102206.67593370.acct BINARY_TYPE_HPC= LSB_SUB_HOST=consign.hpc.local LSF_JOB_TMPDIR=/gpfs/fs02/LSF_JOB_DIRS/LSF_JOB_TMPDIR/node193.hpc.local/67593370.tmpdir LSB_SUB_USER=larryns LSFUSER=larryns LSB_OUTDIR=/home/larryns/tmp LSB_QUEUE=normal LSB_MCPU_HOSTS=node193.hpc.local 12 LSB_OUTPUTFILE=output.out LOGNAME=larryns CONDA_PYTHON_EXE=/home/larryns/miniconda3/bin/python CVS_RSH=ssh SSH_CONNECTION=10.212.134.105 54873 172.16.103.23 22 LSF_CGROUP_TOPDIR_KEY=pennhpc LESSOPEN=||/usr/bin/lesspipe.sh %s CONDA_DEFAULT_ENV=base LSB_XFER_OP= LSB_EEXEC_REAL_GID= DISPLAY=consign.hpc.local:13.0 LSB_BIND_CPU_LIST=20,28,30,32,34,36,66,68,70,72,74,76 LSF_ENVDIR=/usr/share/lsf/conf LSB_DJOB_RANKFILE=/home/larryns/.lsbatch/1607102206.67593370.hostfile G_BROKENFILENAMES=1 =/usr/bin/env
Ok, that's helpful, and it tells me that the LSB_JOBID is getting set appropriately. Now to check Ben's question above, could you try this:
mkdir xyz123
cd xyz123
bsub -o output.out -e error.out -env all -cwd /usr/bin/pwd`
(And if you are getting sick of debugging by carrier pigeon, let me know and we can do something more interactive.)
You have to specify an argument for cwd, so I ran:
bsub -o output.out -e error.out -env all -cwd ${PWD} /usr/bin/pwd
output.out:
Sender: LSF System lsfadmin@node188.hpc.local
Subject: Job 67593587: </usr/bin/pwd> in cluster
Job </usr/bin/pwd> was submitted from host
Successfully completed.
Resource usage summary:
CPU time : 0.06 sec.
Max Memory : -
Average Memory : -
Total Requested Memory : -
Delta Memory : -
Max Swap : -
Max Processes : -
Max Threads : -
Run time : 7 sec.
Turnaround time : 3 sec.
The output (if any) follows:
/home/larryns/xyz123
Aha, I see the problem: @btovar was right and I missed the -cwd
in the first place. Just fixed that in the source.
Ok, you will have to pull and rebuild a new version:
cd [path-to-cctools-src]
git pull origin master
make clean
make all
make install
Then, go back to your workflow directory and delete lsf.wrapper
Then go ahead and re-run your workflow from the beginning, and let me know how it goes.
(Appreciate your patience on this; it's always a bit tricky getting things going when the system isn't directly at hand.)
Okay, same problem. It looks from the job info that the job is running in the right path. Granted, I could be doing something wrong.
bjobs -l 67594031
Job <67594031>, Job Name
Hmm, do you mind if we sync up and work on this interactively?
Sure no problem. What do you have in mind for chat?
Coordinates coming by email...
On Fri, Dec 4, 2020 at 1:31 PM Larry N. Singh notifications@github.com wrote:
Sure no problem. What do you have in mind for chat?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cooperative-computing-lab/cctools/issues/2479#issuecomment-738946279, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVADERM7CR3RQSNUYHSUC3STETJHANCNFSM4UHVI2JA .
After some leg work, we found several interacting problems:
1 - The wrapper file has a stray EOF
produced at the very end, that seems to be a leftover from some prior bit of code -- that needs to be removed.
2 - The new feature that produces alive
messages in the log works, but results in the wrapper script having a failure exit value b/c the process is killed. The exit status of the wrapper script should be the exit status of the job itself.
Will take a look at this again on Monday...
@larryns please pull the latest version of makeflow, then make clean
and make install
and try again. If for some reason it doesn't work the first time, then add --disable-heartbeat
to the makeflow command line and see if that helps.
@dthain Nope, sorry with or without --disable-heartbeat didn't work. I removed the source entirely and did:
git clone https://github.com/cooperative-computing-lab/cctools cd cctools ./configure --prefix $HOME/cctools-test make install
as before. Then ran makeflow with and without disable-heartbeat, but neither worked. Do you need to see the debug.log?
Thanks.
Darn, yes, please share the various files via gist as before..
On Mon, Dec 7, 2020 at 2:19 PM Larry N. Singh notifications@github.com wrote:
@dthain https://github.com/dthain Nope, sorry with or without --disable-heartbeat didn't work. I removed the source entirely and did:
git clone https://github.com/cooperative-computing-lab/cctools cd cctools ./configure --prefix $HOME/cctools-test make install
as before. Then ran makeflow with and without disable-heartbeat, but neither worked. Do you need to see the debug.log?
Thanks.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cooperative-computing-lab/cctools/issues/2479#issuecomment-740125269, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAVADEVLJSPKTBCSG2RPQQ3STUTDHANCNFSM4UHVI2JA .
@dthain Ok here's the new gist:
https://gist.github.com/larryns/6e4b9140870468c6da1bf9fa29aa2bb0
On a quick call this morning, we narrowed it down to a missing dot-slash on the wrapper file. Apparently LSF does a PATH search on the bsub argument, and so it misses things in the current working directory.
@larryns I'm closing this one out b/c it appears to be working. Of course, let us know if you have any further trouble.
@dthain so far, so good. I've been installing makeflow from conda. Any idea when I can pull a new conda package will be made?
@btovar can you make a Conda release sometime this week to roll up a few recent changes?
will do
Thanks!
Add support for the LSF batch system to Makeflow, and work with Larry Singh to make sure that it works for his particular LSF system and workflow.
At first glance, this looks like a straightforward update of
batch_job_cluster.c
: add LSF to the batch type enumeration and add cases for runningbsub
and appropriate commands to execute each makeflow job. Consult the LSF user documentation for the key ideas: https://www.ibm.com/support/knowledgecenter/SSWRJV_10.1.0/lsf_admin_foundations/working_lsf.html