Closed billyziege closed 10 years ago
Thanks for the report and sorry about the issues. It looks like bcbio-nextgen is sucessfully creating the batch scripts for the controller (sge_controllerfc37f386-a519-4d32-9067-cd7016d164ac
) and engine(sge_engineb81ce772-f85a-4c52-a4d7-7cc7e250e2e9
) but they are not getting started by the cluster within the available timeout, which defaults to 15 minutes.
The best way to debug is to try manually submitting those two batch scripts at the command line (qsub sge_controllerfc37f386-a519-4d32-9067-cd7016d164ac
) from the submission host and following qstat. If SGE allocates and starts up jobs but it takes longer than 15 minutes to get resources, you can increase the --timeout
parameter. If there is a problem starting them up, are there any error messages or details from SGE that would help? You could also try adjusting the batch script to see if any of the SGE options in it are problematic.
In summary, it looks like all the machinery is right but there is some problem with the batch scripts getting picked up by SGE. Hope this helps with debugging it and thanks again.
I've tried submitting the sge_controller and sge_engine scripts from the submit host after they are created. The go into the sge queue and run (I see this with qstat). However, when I wait for the jobs after submitting a script that calls bcbio-nextgen to the submit host, nothing hits the queue.
In summary, the created files look fine and can be manually submitted, but bcbip-nextgen cannot submit something that is seen by sge.
Thanks for the help. I'll be debugging this and the other issue later today.
From: Brad Chapman [notifications@github.com] Sent: Thursday, December 12, 2013 1:49 AM To: chapmanb/bcbio-nextgen Cc: Zerbe, Brandon Subject: Re: [bcbio-nextgen] Ipcluster-cluster-helper jobs not being sent to sge. (#215)
Thanks for the report and sorry about the issues. It looks like bcbio-nextgen is sucessfully creating the batch scripts for the controller (sge_controllerfc37f386-a519-4d32-9067-cd7016d164ac) and engine(sge_engineb81ce772-f85a-4c52-a4d7-7cc7e250e2e9) but they are not getting started by the cluster within the available timeout, which defaults to 15 minutes.
The best way to debug is to try manually submitting those two batch scripts at the command line (qsub sge_controllerfc37f386-a519-4d32-9067-cd7016d164ac) from the submission host and following qstat. If SGE allocates and starts up jobs but it takes longer than 15 minutes to get resources, you can increase the --timeout parameter. If there is a problem starting them up, are there any error messages or details from SGE that would help? You could also try adjusting the batch script to see if any of the SGE options in it are problematic.
In summary, it looks like all the machinery is right but there is some problem with the batch scripts getting picked up by SGE. Hope this helps with debugging it and thanks again.
— Reply to this email directly or view it on GitHubhttps://github.com/chapmanb/bcbio-nextgen/issues/215#issuecomment-30402392.
That is strange, I'm not totally sure what to suggest next. All IPython does is call qsub the batch script
. You can actually see the command in the log/ipython/log/ipcluster.log
file you posted above. It does this from the actual machine where bcbio_nextgen.py
is running. Is that the same as you did your tests on? My knowledge of SGE setup is super limited, but I wonder if there is a difference if you're running this inside of an SGE process versus sshing into the machine. If you do some meta-testing via submitting a job that runs the qsub sge_controller
command, does that work?
Sorry for all the problems. When you're able to identify the problem I'd love to include more documentation on this in the troubleshooting section (https://bcbio-nextgen.readthedocs.org/en/latest/contents/parallel.html#troubleshooting) for future users.
Working on this again.
Good point. I thought qsub was being launched on the same machine as bcbio-nextgen, but I wanted an expert to confirm. I think this is exactly the issue with bcbio-nextgen. I'm contacting my sys admin, since I think this is our configuration issue as well. He'll help too, but any other advice you might have is really appreciated.
So this is what I've tried: qsub sge_controller#
results in:
>>> qstat
job-ID prior name user state submit/start at queue slots ja-task-ID
-----------------------------------------------------------------------------------------------------------------
51085 0.55500 ipcontroll sequencing r 12/12/2013 10:06:16 ihg@ihg-node-44.ihg-internal.n 1 1
However, with the following code, submit_job.sh:
#!/bin/sh
#
#
# (c) 2009 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms.
# request Bourne shell as shell for job
#$ -S /bin/sh
#$ -cwd
#$ -j y
qsub $1
When I qsub -l hostname=ihg-node-27 submit_job.sh sge_controller#
, I get the following error in the output:
/opt/sge625/sge/default/spool/ihg-node-27/job_scripts/51084: line 17: qsub: command not found
Telling me that sge is not active, which is funny because it's on ihg-node-27, a submit host. So I add the following line before qsub $1:
source /opt/sge625/sge/default/common/settings.sh
This sources the commands, qsub, qdel, qstat, qconf, etc., which I thought was automatic, but apparently isn't in this case. This is where my sys admin will be helpful. So, with this change, the output says:
Your job-array 51089.1-1:1 ("ipcontroller") has been submitted.
ipcontroller then hits the queue, runs, and promptly dies without any output, maybe because the script calling it (submit_job.sh) has ended? What do you think? If not, then this is weird because I have to kill sge_controller# when I submit it with qsub.
I'll also look into my python own code that manages submitting bcbio_nextgen.py and some other tasks via qsub.
Thanks again.
FYI: Yesterday, roryk and I debugged an issue with the ihg queue, the queue I use here at UCSF:
https://github.com/roryk/ipython-cluster-helper/issues/15
And we were able to get the example to work from the submit host internal node. So I tried the qsub route you suggested above for example.py, and it works! So this is only an issue for me within bcbio-nextgen?!! Strange. I'll keep plugging away.
New idea: I think the version of ipcluster-cluster-helper is old. I build bcbio_nextgen.py from source (I download it from github, reset to v0.7.5, and then python setup.py build/install.) In the ipythoncluster-cluster-helper egg, the cluster/cluster.py file is OLD and does not have the fix we made yesterday. I tried to copy over a corrected cluster.py file to the egg, but quite trutfully, I don't fully understand what an egg is (I came to programming from physics), so I'm pretty sure I did not connect everything properly. How do I implement that change in issue #15 I pointed to in my previous post?
Good catch on this. I released a new ipython-cluster-helper version with the fixes and bumped the requirement in bcbio-nextgen. If you do:
bcbio_nextgen.py upgrade -u development
(or -u release
if you're running the latest release) it will pull the latest ipython-cluster-helper with the fixes.
That being said, from your earlier description it looks like the blocking issue on your machine is that qsub is not automatically in the PATH when submitted jobs, so the bcbio-nextgen request to qsub batch_script
from within IPython will fail. If you can get that resolved on the cluster I hope these two fixes will get things working for you. Thanks for all the patience debugging this.
Sorry for the long delay. I'm working on something else as well, and I return to the distribution issue when I have time.
So the blocking issue was really stupid. The script I attached was not the one I was submitting. I was lacking the specification of the shell, so of course it didn't work. Sometimes I'm a moron. qsub is in the path though, and your advice would have been spot on if I had provided the correct info. Sorry about that.
So I ran the following command:
bcbio_nextgen.py upgrade -u stable
It didn't fix the issue, in fact it introduced a new one (https://github.com/chapmanb/bcbio-nextgen/issues/231), which I was able to get around. I also tried to python setup.py build; python setup install
, and still the same issue. Looking at the cluster_helper/cluster.py file in my site-packages in my pip's python-2.7/lib directory for the version of bcbio I'm running distributed, cluster_helper.py still does not contain the updated code. So I am obviously doing something incorrectly. I'm working on this right now, and I'll re-comment if I find any resolution.
No, I'm wrong again. The cluster_helper.py IS updated. It must be something else...
I'm comparing cluster_helper's example.py output to bcbio output. I don't know if this makes any difference, but the profile dir for the bcbio-nextgen script lacks the .py files. Specifically:
$ls cluster_helper_profile_dir
ipcluster_config.py ipcontroller_config.py ipengine_config.py iplogger_config.py ipython_config.py log pid security startup
$ls bcbio-nextgen_profile_dir
log pid security startup
Hi Billy,
I was wondering if you could recap what the current issue is? Is the cluster startup still giving the timeout message? If so, do you see the ipengines and ipcontroller jobs running on your queue?
Hi Rory,
Thanks for helping me with this. Here is the current situation --- it really hasn't changed.
Again, our system may be different from others. To prevent misuse, my sys admin has limited the number of submit hosts to the head-node and one internal node only. I run the example.py script on the internal submit host, and I submit the bcbio-nextgen script to the same internal submit host.
Thanks again.
Hi Billy,
No problem-- are you submitting the bcbio-nextgen job itself to the queue? So the bcbio-nextgen -t python -s sge, is that getting submitted?
Hi Rory,
Below is the qsub script I submit to the submit host. In short, what it does is load the appropriate environment and link the appropriate programs and then run bcbio_nextgen.py on the submit host. It's not elegant, but it allows me to version pretty easily. Since I am sending it to the submit host, it has access to qsub. I've double checked that ipython-cluster-helper is actually running on the submit host (with print statements as mention in my last comment):
#
#
# (c) 2009 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms.
# This is a simple example of a SGE batch script
# request Bourne shell as shell for job
#$ -S /bin/sh
#$ -cwd
#$ -j y
#$ -o /mnt/speed/qc/sequencing/sge_records/bcbio.$JOB_ID.stdout
#
#$ -hard -l mem_free=50G
# Send an email to myself when the job fails.
#$ -M zerbeb@humgen.ucsf.edu
#$ -m as
#$ -l q=ihg
echo "Start Date = "`date`
echo "Hostname = "`hostname`
echo "OS = "`uname -a`
echo "job id = $JOB_ID"
echo "job name = $JOB_NAME"
echo "task id = $SGE_TASK_ID"
#Designate the important pathways.
SEQUENCING_SRC=/home/sequencing/src
SEQUENCING_BIN=/home/sequencing/sequencing_pipeline/bin_07312013
ALLENV_BIN=/home/sequencing/.virtual_envs/sequencing_pipeline_devel/bin
USR_BIN=/mnt/speed/usr/bin #Originally for Rscript
JAVA_BIN=/home/sequencing/src/jdk1.7.0_25/bin #For Java 1.7
PYTHON_LIB=/home/sequencing/local/opt/python_2.7.5_ucs4/lib
PYTHON_BIN=/home/sequencing/local/opt/python_2.7.5_ucs4/bin
#Path to the java we're using and all of the other executables.
PATH=$PATH:$USR_BIN
PATH=$JAVA_BIN:$PATH
PATH=$SEQUENCING_BIN:$PATH
PATH=$PYTHON_BIN:$PATH
PATH=$PATH:$SEQUENCING_SRC/qualimap_v0.7.1
export PATH=$PATH
export LD_LIBRARY_PATH=$PYTHON_LIB:$LD_LIBRARY_PATH
#Necessary java classes for some of the scripts.
export CLASSPATH=$SEQUENCING_SRC/FastQC:$SEQUENCING_SRC/picard-1.93/classes:$SEQUENCING_SRC/qualimap_v0.7.1/qualimap.jar
#Activate env.
source $ALLENV_BIN/activate
#The fastq to vcf pipeline
SCRIPT=bcbio_nextgen.py
FLAGS="-t ipython -n 4 -s sge -q ihg -r pename=parallel"
SYSTEM_YAML="/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/system.yaml"
SAMPLE_YAML="/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/sample.yaml"
OUPUT_DIR="/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4"
OUTFILE="/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/bcbio.stdout"
ERRFILE="/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/bcbio.stderr"
cd $OUPUT_DIR
echo $CLASSPATH > $OUTFILE
$SCRIPT $FLAGS $SYSTEM_YAML $SAMPLE_YAML >> $OUTFILE 2> $ERRFILE
#Create the complete file that marks that this process is done
COMPLETE_FILE="/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/bcbio.complete"
echo "$SCRIPT $FLAGS $SYSTEM_YAML $SAMPLE_YAML > $OUTFILE 2> $ERRFILE" > $COMPLETE_FILE
echo "Stop Date = "`date`
exit 0
Hi Billy,
Are you wrapping the example.py from ipython-cluster-helper up in a similar submission script like that? I'm just trying to narrow down where the problem could be.
Yup. It's more streamlined because it's smaller:
#!/bin/sh
#
#
# (c) 2009 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms.
# request Bourne shell as shell for job
#$ -S /bin/sh
#$ -cwd
#$ -j y
source /opt/sge625/sge/default/common/settings.sh
source /home/sequencing/.virtual_envs/sequencing_pipeline_12032013/bin/activate
export LD_LIBRARY_PATH=~/local/opt/python_2.7.5_ucs4/lib
python /home/sequencing/src/ipython-cluster-helper/example/example.py --scheduler sge --queue ihg --num_jobs 2
The bcbio-nextgen script runs fine if I don't ask it to distribute (i.e. I dont have the -t ipython and associated flags).
Thanks.
Thanks-- what if you pass --resources pename=parallel in the ipython-cluster-helper script?
Good catch, but alas ... It still works.
However, this might be the correct route. If I remove the -r pename=parallel from the bcbio-nextgen script, I get an OSError apparently on qconf. This might come back to the sge commands not being in the path --- which is exactly what Brad suggested. I'm trying some stuff now, and I'll let you know the results.
Hi Billy,
Great-- thanks for keeping us posted, we've definitely had the most troubles with SGE so I am interested in whatever the resolution ends up being. Fingers crossed. It would be a good holidays gift to get you rolling.
Let me say, you and Brad are awesome! Brad told me the answer two weeks ago, but I'm too dense sometimes...
This solution has actually been staring me in the face for about 9 months now. qsub should be sourced on the submit host since it is in the bash.rc file, but for some reason it is not. In all other scripts I call qsub from, I need to source the sge commands with the following line: source /opt/sge625/sge/default/common/settings.sh
, I just wasn't doing this with your script.
This may be a system dependent path, but when this line is added, the script works. So with other people using sge, you might want to note this and the fact that they need access to a submit host (If you don't already note this in the always growing docs).
Thanks again for all the awesome help! Feel free to close this issue.
It's not being called because that .bashrc file is user-dependent, so sge doesn't see it!
Gotcha. Sweet, glad it is working! Boom.
I've actually been working on and off again on this issue for some time (since Ipython was added to bcbio-nextgen), and it is one of two issues I can't seem to figure out. Our cluster is set up so that there are only two submit hosts, the headnode and one of the internal nodes (I requested this form our sys admin so I could run bcbio_nextgen.py from that node). I'm able to get the example.py script in the ipcluster-cluster-helper examples directory to work
but specifying the same configuration for bcbio-nextgen.py, v0.7.5, flags
I get the following output from capturing the stderr:
and the ipython/log file:
I interpret that my failure is during _view_from_parallel in distributed/ipython.py which calls cluster_view from cluster_helper. Since the example.py script also calls the this cluster_view function and it works, I was thinking it is not in the configuration of our system, but instead somewhere in the bcbio-nextgen protocol that I can't figure out. The only thing I can think of right now is that I am not properly controlling where the script submits to sge the controller and engine (is this from the same node that bcbio-nextgen runs?) because remember the only submit host I have is the place where bcbio-nextgen is running. Otherwise, any other suggestions would be much appreciated.