bcbio / bcbio-nextgen

Validated, scalable, community developed variant calling, RNA-seq and small RNA analysis
https://bcbio-nextgen.readthedocs.io
MIT License
984 stars 354 forks source link

Ipcluster-cluster-helper jobs not being sent to sge. #215

Closed billyziege closed 10 years ago

billyziege commented 10 years ago

I've actually been working on and off again on this issue for some time (since Ipython was added to bcbio-nextgen), and it is one of two issues I can't seem to figure out. Our cluster is set up so that there are only two submit hosts, the headnode and one of the internal nodes (I requested this form our sys admin so I could run bcbio_nextgen.py from that node). I'm able to get the example.py script in the ipcluster-cluster-helper examples directory to work

python example/example.py --scheduler sge --queue ihg --num_jobs 2

but specifying the same configuration for bcbio-nextgen.py, v0.7.5, flags

bcbio_nextgen.py -t ipython -n 2 -s sge -q ihg -r pename=parallel

I get the following output from capturing the stderr:

[2013-12-11 16:21] ihg-node-27: Using input YAML configuration: /mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/sample.yaml
[2013-12-11 16:21] ihg-node-27: Checking sample YAML configuration: /mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/sample.yaml
[2013-12-11 16:21] ihg-node-27: Preparing 1_20131105_MVC2
[2013-12-11 16:21] ihg-node-27: Preparing 2_20131105_MVC2
[2013-12-11 16:21] ihg-node-27: Resource query function not implemented for scheduler "sge"; submitting job to queue
[2013-12-11 16:21] ihg-node-27: ipython: machine_info
2013-12-11 16:21:40.085 [IPClusterStart] Config changed:
2013-12-11 16:21:40.086 [IPClusterStart] {'BcbioSGEControllerLauncher': {'mem': '2.2', 'resources': 'hostname=ihg-node-27'}, 'SGELauncher': {'queue': 'ihg'}, 'BaseParallelApplication': {'log_to_file': True, 'cluster_id': u'd66b208f-d491-44f6-a4b2-f15f2efecff1'}, 'IPClusterEngines': {'early_shutdown': 240}, 'Application': {'log_level': 10}, 'ProfileDir': {'location': u'/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/log/ipython'}, 'BcbioSGEEngineSetLauncher': {'mem': '2.2', 'cores': 1, 'pename': u'parallel', 'resources': 'hostname=ihg-node-27'}, 'IPClusterStart': {'delay': 10, 'controller_launcher_class': u'cluster_helper.cluster.BcbioSGEControllerLauncher', 'daemonize': True, 'engine_launcher_class': u'cluster_helper.cluster.BcbioSGEEngineSetLauncher', 'n': 1}}
2013-12-11 16:21:40.087 [IPClusterStart] Using existing profile dir: u'/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/log/ipython'
2013-12-11 16:21:40.087 [IPClusterStart] Searching path [u'/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4', u'/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/log/ipython'] for config files
2013-12-11 16:21:40.087 [IPClusterStart] Attempting to load config file: ipython_config.py
2013-12-11 16:21:40.087 [IPClusterStart] Config file ipython_config.py not found 
2013-12-11 16:21:40.087 [IPClusterStart] Attempting to load config file: ipcluster_d66b208f_d491_44f6_a4b2_f15f2efecff1_config.py 
2013-12-11 16:21:40.088 [IPClusterStart] Config file not found, skipping: ipcontroller_config.py
2013-12-11 16:21:40.088 [IPClusterStart] Attempting to load config file: ipcluster_d66b208f_d491_44f6_a4b2_f15f2efecff1_config.py 
2013-12-11 16:21:40.088 [IPClusterStart] Config file not found, skipping: ipengine_config.py
2013-12-11 16:21:40.088 [IPClusterStart] Attempting to load config file: ipcluster_d66b208f_d491_44f6_a4b2_f15f2efecff1_config.py
2013-12-11 16:21:40.088 [IPClusterStart] Config file not found, skipping: ipcluster_config.py
2013-12-11 16:37:14.876 [IPClusterStop] Using existing profile dir: u'/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/log/ipython'
2013-12-11 16:37:14.899 [IPClusterStop] Stopping cluster [pid=32299] with [signal=2]
Traceback (most recent call last):
  File "/home/sequencing/.virtual_envs/sequencing_pipeline_devel/bin/bcbio_nextgen.py", line 5, in <module>
    pkg_resources.run_script('bcbio-nextgen==0.7.5', 'bcbio_nextgen.py')
  File "/home/sequencing/.virtual_envs/sequencing_pipeline_devel/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 489, in run_script
  File "/home/sequencing/.virtual_envs/sequencing_pipeline_devel/lib/python2.7/site-packages/setuptools-0.6c11-py2.7.egg/pkg_resources.py", line 1207, in run_script
  File "/home/sequencing/.virtual_envs/sequencing_pipeline_devel/lib/python2.7/site-packages/bcbio_nextgen-0.7.5-py2.7.egg/EGG-INFO/scripts/bcbio_nextgen.py", line 54, in <module>
    main(**kwargs)
  File "/home/sequencing/.virtual_envs/sequencing_pipeline_devel/lib/python2.7/site-packages/bcbio_nextgen-0.7.5-py2.7.egg/EGG-INFO/scripts/bcbio_nextgen.py", line 38, in main
    run_main(**kwargs)
  File "/home/sequencing/.virtual_envs/sequencing_pipeline_devel/lib/python2.7/site-packages/bcbio_nextgen-0.7.5-py2.7.egg/bcbio/pipeline/main.py", line 49, in run_main
    fc_dir, run_info_yaml)
  File "/home/sequencing/.virtual_envs/sequencing_pipeline_devel/lib/python2.7/site-packages/bcbio_nextgen-0.7.5-py2.7.egg/bcbio/pipeline/main.py", line 90, in _run_toplevel
    pipeline_items = _add_provenance(pipeline_items, dirs, run_parallel, parallel, config)
  File "/home/sequencing/.virtual_envs/sequencing_pipeline_devel/lib/python2.7/site-packages/bcbio_nextgen-0.7.5-py2.7.egg/bcbio/pipeline/main.py", line 99, in _add_provenance
    system.write_info(dirs, run_parallel, parallel, config)
  File "/home/sequencing/.virtual_envs/sequencing_pipeline_devel/lib/python2.7/site-packages/bcbio_nextgen-0.7.5-py2.7.egg/bcbio/provenance/system.py", line 33, in write_info
    minfos = _get_machine_info(parallel, run_parallel, sys_config)
  File "/home/sequencing/.virtual_envs/sequencing_pipeline_devel/lib/python2.7/site-packages/bcbio_nextgen-0.7.5-py2.7.egg/bcbio/provenance/system.py", line 58, in _get_machine_info
    return run_parallel("machine_info", [[sys_config]])
  File "/home/sequencing/.virtual_envs/sequencing_pipeline_devel/lib/python2.7/site-packages/bcbio_nextgen-0.7.5-py2.7.egg/bcbio/distributed/messaging.py", line 38, in run_parallel
    return ipython.runner(parallel, fn_name, items, dirs["work"], sysinfo, config)
  File "/home/sequencing/.virtual_envs/sequencing_pipeline_devel/lib/python2.7/site-packages/bcbio_nextgen-0.7.5-py2.7.egg/bcbio/distributed/ipython.py", line 321, in runner
    with _view_from_parallel(parallel, work_dir, config) as view:
  File "/home/sequencing/local/opt/python_2.7.5_ucs4/lib/python2.7/contextlib.py", line 17, in __enter__
    return self.gen.next()
  File "/home/sequencing/.virtual_envs/sequencing_pipeline_devel/lib/python2.7/site-packages/cluster_helper/cluster.py", line 671, in cluster_view
    raise IOError("Cluster startup timed out.")
IOError: Cluster startup timed out.

and the ipython/log file:

2013-12-11 16:21:40.097 [IPClusterStart] Starting ipcluster with [daemon=True]
2013-12-11 16:21:40.102 [IPClusterStart] Creating pid file: /mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/log/ipython/pid/ipcluster-d66b208f-d491-44f6-a4b2-f15f2efecff1.pid
2013-12-11 16:21:40.103 [IPClusterStart] Starting Controller with cluster_helper.cluster.BcbioSGEControllerLauncher
2013-12-11 16:21:40.103 [IPClusterStart] Starting BcbioSGEControllerLauncher: ['qsub', u'./sge_controllerfc37f386-a519-4d32-9067-cd7016d164ac']
2013-12-11 16:21:40.103 [IPClusterStart] adding PBS queue settings to batch script
2013-12-11 16:21:40.103 [IPClusterStart] adding job array settings to batch script
2013-12-11 16:21:40.104 [IPClusterStart] Writing batch script: ./sge_controllerfc37f386-a519-4d32-9067-cd7016d164ac
2013-12-11 16:21:50.162 [IPClusterStart] Starting 1 Engines with cluster_helper.cluster.BcbioSGEEngineSetLauncher
2013-12-11 16:21:50.162 [IPClusterStart] Starting BcbioSGEEngineSetLauncher: ['qsub', u'./sge_engineb81ce772-f85a-4c52-a4d7-7cc7e250e2e9']
2013-12-11 16:21:50.163 [IPClusterStart] Writing batch script: ./sge_engineb81ce772-f85a-4c52-a4d7-7cc7e250e2e9
2013-12-11 16:37:14.900 [IPClusterStart] SIGINT received, stopping launchers...
2013-12-11 16:37:14.900 [IPClusterStart] ERROR | IPython cluster: stopping
2013-12-11 16:37:17.901 [IPClusterStart] Removing pid file: /mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/log/ipython/pid/ipcluster-d66b208f-d491-44f6-a4b2-f15f2efecff1.pid

I interpret that my failure is during _view_from_parallel in distributed/ipython.py which calls cluster_view from cluster_helper. Since the example.py script also calls the this cluster_view function and it works, I was thinking it is not in the configuration of our system, but instead somewhere in the bcbio-nextgen protocol that I can't figure out. The only thing I can think of right now is that I am not properly controlling where the script submits to sge the controller and engine (is this from the same node that bcbio-nextgen runs?) because remember the only submit host I have is the place where bcbio-nextgen is running. Otherwise, any other suggestions would be much appreciated.

chapmanb commented 10 years ago

Thanks for the report and sorry about the issues. It looks like bcbio-nextgen is sucessfully creating the batch scripts for the controller (sge_controllerfc37f386-a519-4d32-9067-cd7016d164ac) and engine(sge_engineb81ce772-f85a-4c52-a4d7-7cc7e250e2e9) but they are not getting started by the cluster within the available timeout, which defaults to 15 minutes.

The best way to debug is to try manually submitting those two batch scripts at the command line (qsub sge_controllerfc37f386-a519-4d32-9067-cd7016d164ac) from the submission host and following qstat. If SGE allocates and starts up jobs but it takes longer than 15 minutes to get resources, you can increase the --timeout parameter. If there is a problem starting them up, are there any error messages or details from SGE that would help? You could also try adjusting the batch script to see if any of the SGE options in it are problematic.

In summary, it looks like all the machinery is right but there is some problem with the batch scripts getting picked up by SGE. Hope this helps with debugging it and thanks again.

billyziege commented 10 years ago

I've tried submitting the sge_controller and sge_engine scripts from the submit host after they are created. The go into the sge queue and run (I see this with qstat). However, when I wait for the jobs after submitting a script that calls bcbio-nextgen to the submit host, nothing hits the queue.

In summary, the created files look fine and can be manually submitted, but bcbip-nextgen cannot submit something that is seen by sge.

Thanks for the help. I'll be debugging this and the other issue later today.


From: Brad Chapman [notifications@github.com] Sent: Thursday, December 12, 2013 1:49 AM To: chapmanb/bcbio-nextgen Cc: Zerbe, Brandon Subject: Re: [bcbio-nextgen] Ipcluster-cluster-helper jobs not being sent to sge. (#215)

Thanks for the report and sorry about the issues. It looks like bcbio-nextgen is sucessfully creating the batch scripts for the controller (sge_controllerfc37f386-a519-4d32-9067-cd7016d164ac) and engine(sge_engineb81ce772-f85a-4c52-a4d7-7cc7e250e2e9) but they are not getting started by the cluster within the available timeout, which defaults to 15 minutes.

The best way to debug is to try manually submitting those two batch scripts at the command line (qsub sge_controllerfc37f386-a519-4d32-9067-cd7016d164ac) from the submission host and following qstat. If SGE allocates and starts up jobs but it takes longer than 15 minutes to get resources, you can increase the --timeout parameter. If there is a problem starting them up, are there any error messages or details from SGE that would help? You could also try adjusting the batch script to see if any of the SGE options in it are problematic.

In summary, it looks like all the machinery is right but there is some problem with the batch scripts getting picked up by SGE. Hope this helps with debugging it and thanks again.

— Reply to this email directly or view it on GitHubhttps://github.com/chapmanb/bcbio-nextgen/issues/215#issuecomment-30402392.

chapmanb commented 10 years ago

That is strange, I'm not totally sure what to suggest next. All IPython does is call qsub the batch script. You can actually see the command in the log/ipython/log/ipcluster.log file you posted above. It does this from the actual machine where bcbio_nextgen.py is running. Is that the same as you did your tests on? My knowledge of SGE setup is super limited, but I wonder if there is a difference if you're running this inside of an SGE process versus sshing into the machine. If you do some meta-testing via submitting a job that runs the qsub sge_controller command, does that work?

Sorry for all the problems. When you're able to identify the problem I'd love to include more documentation on this in the troubleshooting section (https://bcbio-nextgen.readthedocs.org/en/latest/contents/parallel.html#troubleshooting) for future users.

billyziege commented 10 years ago

Working on this again.

Good point. I thought qsub was being launched on the same machine as bcbio-nextgen, but I wanted an expert to confirm. I think this is exactly the issue with bcbio-nextgen. I'm contacting my sys admin, since I think this is our configuration issue as well. He'll help too, but any other advice you might have is really appreciated.

So this is what I've tried: qsub sge_controller# results in:

>>> qstat
job-ID  prior   name       user         state submit/start at     queue                          slots ja-task-ID 
-----------------------------------------------------------------------------------------------------------------
  51085 0.55500 ipcontroll sequencing   r     12/12/2013 10:06:16 ihg@ihg-node-44.ihg-internal.n     1 1

However, with the following code, submit_job.sh:

#!/bin/sh
#
#
# (c) 2009 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms.  

# request Bourne shell as shell for job
#$ -S /bin/sh
#$ -cwd
#$ -j y

qsub $1

When I qsub -l hostname=ihg-node-27 submit_job.sh sge_controller#, I get the following error in the output:

/opt/sge625/sge/default/spool/ihg-node-27/job_scripts/51084: line 17: qsub: command not found

Telling me that sge is not active, which is funny because it's on ihg-node-27, a submit host. So I add the following line before qsub $1:

source /opt/sge625/sge/default/common/settings.sh

This sources the commands, qsub, qdel, qstat, qconf, etc., which I thought was automatic, but apparently isn't in this case. This is where my sys admin will be helpful. So, with this change, the output says:

Your job-array 51089.1-1:1 ("ipcontroller") has been submitted.

ipcontroller then hits the queue, runs, and promptly dies without any output, maybe because the script calling it (submit_job.sh) has ended? What do you think? If not, then this is weird because I have to kill sge_controller# when I submit it with qsub.

I'll also look into my python own code that manages submitting bcbio_nextgen.py and some other tasks via qsub.

Thanks again.

billyziege commented 10 years ago

FYI: Yesterday, roryk and I debugged an issue with the ihg queue, the queue I use here at UCSF:

https://github.com/roryk/ipython-cluster-helper/issues/15

And we were able to get the example to work from the submit host internal node. So I tried the qsub route you suggested above for example.py, and it works! So this is only an issue for me within bcbio-nextgen?!! Strange. I'll keep plugging away.

billyziege commented 10 years ago

New idea: I think the version of ipcluster-cluster-helper is old. I build bcbio_nextgen.py from source (I download it from github, reset to v0.7.5, and then python setup.py build/install.) In the ipythoncluster-cluster-helper egg, the cluster/cluster.py file is OLD and does not have the fix we made yesterday. I tried to copy over a corrected cluster.py file to the egg, but quite trutfully, I don't fully understand what an egg is (I came to programming from physics), so I'm pretty sure I did not connect everything properly. How do I implement that change in issue #15 I pointed to in my previous post?

chapmanb commented 10 years ago

Good catch on this. I released a new ipython-cluster-helper version with the fixes and bumped the requirement in bcbio-nextgen. If you do:

bcbio_nextgen.py upgrade -u development

(or -u release if you're running the latest release) it will pull the latest ipython-cluster-helper with the fixes.

That being said, from your earlier description it looks like the blocking issue on your machine is that qsub is not automatically in the PATH when submitted jobs, so the bcbio-nextgen request to qsub batch_script from within IPython will fail. If you can get that resolved on the cluster I hope these two fixes will get things working for you. Thanks for all the patience debugging this.

billyziege commented 10 years ago

Sorry for the long delay. I'm working on something else as well, and I return to the distribution issue when I have time.

So the blocking issue was really stupid. The script I attached was not the one I was submitting. I was lacking the specification of the shell, so of course it didn't work. Sometimes I'm a moron. qsub is in the path though, and your advice would have been spot on if I had provided the correct info. Sorry about that.

So I ran the following command:

bcbio_nextgen.py upgrade -u stable

It didn't fix the issue, in fact it introduced a new one (https://github.com/chapmanb/bcbio-nextgen/issues/231), which I was able to get around. I also tried to python setup.py build; python setup install, and still the same issue. Looking at the cluster_helper/cluster.py file in my site-packages in my pip's python-2.7/lib directory for the version of bcbio I'm running distributed, cluster_helper.py still does not contain the updated code. So I am obviously doing something incorrectly. I'm working on this right now, and I'll re-comment if I find any resolution.

billyziege commented 10 years ago

No, I'm wrong again. The cluster_helper.py IS updated. It must be something else...

billyziege commented 10 years ago

I'm comparing cluster_helper's example.py output to bcbio output. I don't know if this makes any difference, but the profile dir for the bcbio-nextgen script lacks the .py files. Specifically:

$ls cluster_helper_profile_dir
ipcluster_config.py  ipcontroller_config.py  ipengine_config.py  iplogger_config.py  ipython_config.py  log  pid  security  startup

$ls bcbio-nextgen_profile_dir
log  pid  security  startup
roryk commented 10 years ago

Hi Billy,

I was wondering if you could recap what the current issue is? Is the cluster startup still giving the timeout message? If so, do you see the ipengines and ipcontroller jobs running on your queue?

billyziege commented 10 years ago

Hi Rory,

Thanks for helping me with this. Here is the current situation --- it really hasn't changed.

  1. Still getting the timeout message.
  2. I do NOT see anything hitting SGE. When I run your example/example.py code, it does hit SGE and runs to completion.

Again, our system may be different from others. To prevent misuse, my sys admin has limited the number of submit hosts to the head-node and one internal node only. I run the example.py script on the internal submit host, and I submit the bcbio-nextgen script to the same internal submit host.

Thanks again.

roryk commented 10 years ago

Hi Billy,

No problem-- are you submitting the bcbio-nextgen job itself to the queue? So the bcbio-nextgen -t python -s sge, is that getting submitted?

billyziege commented 10 years ago

Hi Rory,

Below is the qsub script I submit to the submit host. In short, what it does is load the appropriate environment and link the appropriate programs and then run bcbio_nextgen.py on the submit host. It's not elegant, but it allows me to version pretty easily. Since I am sending it to the submit host, it has access to qsub. I've double checked that ipython-cluster-helper is actually running on the submit host (with print statements as mention in my last comment):

#
#
# (c) 2009 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms.  

# This is a simple example of a SGE batch script

# request Bourne shell as shell for job
#$ -S /bin/sh
#$ -cwd
#$ -j y
#$ -o /mnt/speed/qc/sequencing/sge_records/bcbio.$JOB_ID.stdout
#
#$ -hard -l mem_free=50G
# Send an email to myself when the job fails.
#$ -M zerbeb@humgen.ucsf.edu
#$ -m as
#$ -l q=ihg

echo "Start Date    = "`date`
echo "Hostname      = "`hostname`
echo "OS            = "`uname -a`
echo "job id        = $JOB_ID"
echo "job name      = $JOB_NAME"
echo "task id       = $SGE_TASK_ID"

#Designate the important pathways.
SEQUENCING_SRC=/home/sequencing/src
SEQUENCING_BIN=/home/sequencing/sequencing_pipeline/bin_07312013
ALLENV_BIN=/home/sequencing/.virtual_envs/sequencing_pipeline_devel/bin
USR_BIN=/mnt/speed/usr/bin #Originally for Rscript
JAVA_BIN=/home/sequencing/src/jdk1.7.0_25/bin #For Java 1.7
PYTHON_LIB=/home/sequencing/local/opt/python_2.7.5_ucs4/lib
PYTHON_BIN=/home/sequencing/local/opt/python_2.7.5_ucs4/bin

#Path to the java we're using and all of the other executables.
PATH=$PATH:$USR_BIN
PATH=$JAVA_BIN:$PATH
PATH=$SEQUENCING_BIN:$PATH
PATH=$PYTHON_BIN:$PATH
PATH=$PATH:$SEQUENCING_SRC/qualimap_v0.7.1
export PATH=$PATH
export LD_LIBRARY_PATH=$PYTHON_LIB:$LD_LIBRARY_PATH

#Necessary java classes for some of the scripts.
export CLASSPATH=$SEQUENCING_SRC/FastQC:$SEQUENCING_SRC/picard-1.93/classes:$SEQUENCING_SRC/qualimap_v0.7.1/qualimap.jar

#Activate env.
source $ALLENV_BIN/activate

#The fastq to vcf pipeline
SCRIPT=bcbio_nextgen.py
FLAGS="-t ipython -n 4 -s sge -q ihg -r pename=parallel"
SYSTEM_YAML="/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/system.yaml"
SAMPLE_YAML="/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/sample.yaml"
OUPUT_DIR="/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4"
OUTFILE="/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/bcbio.stdout"
ERRFILE="/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/bcbio.stderr"
cd $OUPUT_DIR
echo $CLASSPATH > $OUTFILE
$SCRIPT $FLAGS $SYSTEM_YAML $SAMPLE_YAML >> $OUTFILE 2> $ERRFILE

#Create the complete file that marks that this process is done
COMPLETE_FILE="/mnt/speed/qc/sequencing/pipeline_testing/multi_reduced_2_4/bcbio.complete"
echo "$SCRIPT $FLAGS $SYSTEM_YAML $SAMPLE_YAML > $OUTFILE 2> $ERRFILE" > $COMPLETE_FILE

echo "Stop Date       = "`date`
exit 0
roryk commented 10 years ago

Hi Billy,

Are you wrapping the example.py from ipython-cluster-helper up in a similar submission script like that? I'm just trying to narrow down where the problem could be.

billyziege commented 10 years ago

Yup. It's more streamlined because it's smaller:

#!/bin/sh
#
#
# (c) 2009 Sun Microsystems, Inc. All rights reserved. Use is subject to license terms.  

# request Bourne shell as shell for job
#$ -S /bin/sh
#$ -cwd
#$ -j y

source /opt/sge625/sge/default/common/settings.sh

source /home/sequencing/.virtual_envs/sequencing_pipeline_12032013/bin/activate
export LD_LIBRARY_PATH=~/local/opt/python_2.7.5_ucs4/lib
python /home/sequencing/src/ipython-cluster-helper/example/example.py --scheduler sge --queue ihg --num_jobs 2

The bcbio-nextgen script runs fine if I don't ask it to distribute (i.e. I dont have the -t ipython and associated flags).

Thanks.

roryk commented 10 years ago

Thanks-- what if you pass --resources pename=parallel in the ipython-cluster-helper script?

billyziege commented 10 years ago

Good catch, but alas ... It still works.

However, this might be the correct route. If I remove the -r pename=parallel from the bcbio-nextgen script, I get an OSError apparently on qconf. This might come back to the sge commands not being in the path --- which is exactly what Brad suggested. I'm trying some stuff now, and I'll let you know the results.

roryk commented 10 years ago

Hi Billy,

Great-- thanks for keeping us posted, we've definitely had the most troubles with SGE so I am interested in whatever the resolution ends up being. Fingers crossed. It would be a good holidays gift to get you rolling.

billyziege commented 10 years ago

Let me say, you and Brad are awesome! Brad told me the answer two weeks ago, but I'm too dense sometimes...

This solution has actually been staring me in the face for about 9 months now. qsub should be sourced on the submit host since it is in the bash.rc file, but for some reason it is not. In all other scripts I call qsub from, I need to source the sge commands with the following line: source /opt/sge625/sge/default/common/settings.sh, I just wasn't doing this with your script.

This may be a system dependent path, but when this line is added, the script works. So with other people using sge, you might want to note this and the fact that they need access to a submit host (If you don't already note this in the always growing docs).

Thanks again for all the awesome help! Feel free to close this issue.

billyziege commented 10 years ago

It's not being called because that .bashrc file is user-dependent, so sge doesn't see it!

roryk commented 10 years ago

Gotcha. Sweet, glad it is working! Boom.