jmiezitis commented 4 years ago

Note I know nothing about how bcbio is run and works. I am an HPC systems engineer trying determine why bcbio is not running on multiple nodes in our cluster. The test data and config files have been provided to me by our researchers.

Version info

bcbio version (bcbio_nextgen.py --version): 1.1.9a
OS name and version: CentOS Linux release 7.7.1908
HPC Cluster PBSPro 19.1.3

To Reproduce

bcbio_nextgen.py ../config/test-gatk.yaml -t ipython -n 140 -s pbspro -q workq
Your sample configuration file: attached test-gatk.yaml.txt

Expected behavior & Observed behavior Given the line in the debug logs:

Configuring 5 jobs to run, using 28 cores each with 112.1g of memory reserved for each job

I expected 5 qsub jobs requesting 28 cores or one job requesting 5 nodes with 28 cores each. What I get is one job for the controller using one cpu another job for an engine with one cpu and one more engine with 28 cpus. This doesn't change throughout the run.

Log files bcbio-nextgen.log bcbio-nextgen.log.gz bcbio-nextgen-debug.log.gz bcbio-nextgen-commands.log.gz

Comments Note that /scratch is a CephFS system shared between all nodes.

roryk commented 4 years ago

Hi @jmiezitis,

Sorry about that-- we don't actually have access to a PBSPro scheduler to test fixes on so we'll have to do some back and forth debugging. In the work directory in bcbio there should be some -engine files which are what bcbio will submit to PBSPro to get that job to run. If you look at those engine files, do you see the one that is asking for these resources? Does the paramerization in the submission script do the right thing for your setup or are we doing it wrong?

roryk commented 4 years ago

Are the extra jobs pending? bcbio will start processing immediately once a single job in a job array is up, so if one job is up and the others are pending, and there is not a lot of data to process, the job might be over by the time the other jobs get resources from the scheduler.

ohofmann commented 4 years ago

I've seen this before when jobs failed to start on PBS Pro - the IPython log files will have pointers, or the psbpro output/error logs (per node) should have some details.

roryk commented 4 years ago

Thanks Oliver!

naumenko-sa commented 4 years ago

+ my 5 cents:

We are extremely proud having a built-in scheduler in bcbio and happy that you are using its multinode functionality! We are interested to get it tested in PBSPro.

However, the majority of bcbio analyses could be run on a single 50G x 7CPU node. Modern clusters have 128G x 24CPU, 256G x 48 CPU nodes. That is more than enough to run a typical analysis even with as many samples as (<100 RNA-seq, <10 WES, < 5WGS).

Using more cores across many nodes would not mean the linear increase in performance, see https://en.wikipedia.org/wiki/Amdahl%27s_law. A typical pipeline is 50% alignment which parellelises well, and many other steps are IO bound rather than CPU bound.

In the profiling effort https://bcbio-nextgen.readthedocs.io/en/latest/contents/internals.html#profiling I saw some variant calling cases with 2X CPU increase improved running time just by 10%. If you could contribute any profiling results, that would be fantastic.

Of course, everything depends on your typical use case. Multinode setup makes sense if you are running big variant calling cohorts and you cannot separate samples into individual bcbio runs.

Otherwise you may try to convince your users to run single node bcbio jobs: bcbio_nextgen.py ../config/bcbio.yaml -n <CORES> and use PBSPro to schedule jobs sample- and user-wise.

This also prevents cluster take over wars described here: https://github.com/bcbio/bcbio-nextgen/issues/2138

On busy clusters running_time = T1_waiting_in_the_queue + T2_actual_running_time Increasing resources requested decreases T2 and increases T1. In some cases one node bcbio jobs with the modest CPU requirements are picked up immediately by the scheduler and finish faster than greedy multinode jobs.

ohofmann commented 4 years ago

@naumenko-sa Single node would be a non-starter for us. A single sample takes ~70h on a 48 core node, well beyond the max queue limit many clusters enforce. Your utilisation comment points at bcbio's static allocation of node (which using CWL/Cromwell with bcbio gets around), but that's a different topic I think.

naumenko-sa commented 4 years ago

Thanks @ohofmann !

Can you share your typical turnaround time / bcbio launch method / use case?

I assume it is WES https://github.com/bcbio/bcbio-nextgen/blob/master/config/templates/tumor-paired.yaml?

DrMcStrange commented 4 years ago

Hi, I coordinate the bioinformatics on the cluster @jmiezitis runs, so I can weigh in on that end of things.

A lot of what we run is GATK variant calling on large WGS datasets (typically around 30-100 samples), so single-node isn't really an option for us.

roryk commented 4 years ago

Thanks, it should definitely be running these in parallel. The submission scripts in the work directory should have some clues; if you can look at those and see if they are doing the right thing for your scheduler that would be super helpful. The line:

Configuring 5 jobs to run, using 28 cores each with 112.1g of memory reserved for each job

should be submitting what it says, 5 jobs with 28 cores each, for a total of 140 cores allocated.

The plumbing for how bcbio sets up the script and submits the jobs for PBSPro is here:

https://github.com/roryk/ipython-cluster-helper/blob/07c2fcc9f17677a559d9566ad2259b47b5586bb4/cluster_helper/cluster.py#L695-L745

Could you post the submission script for the engine job that is running so we can see if it's doing the right thing or not?

jmiezitis commented 4 years ago

Thanks everyone for your comments. Very helpful. I will attempt to respond to all questions below. This cluster is fairly new and not in full use yet so there is plenty of capacity and no wait time for jobs to run.

The qsub scripts syntactically are correct apart from as noted below. In the order they are generated I get the following:

At 16:39 a controller 1 node 1 cpu cluster id a - starts one python process Nothing of note in output and error logs
At 16:39 an engine 1 node 1 cpu cluster id a - starts one python process Error log indicates No heartbeat and finishes with CRITICAL | Maximum number of heartbeats
At 16:40 a controller 1 node 1 cpu cluster id b - starts one python process Nothing of note in output and error logs
At 16:41 an engine 1 node 28 cpus cluster id b - starts one python process Error log indicates No heartbeat and finishes with CRITICAL | Maximum number of heartbeats
At 18:24 a controller 1 node 1 cpu cluster id c - starts one python process Nothing of note in output and error logs
At 18:25 an engine 1 node 28 cpus. - starts 28 python processes in parallel. There is a potential issue here but I don't think it is the cause of my problem more about this later. Error log indicates No heartbeat and finishes with CRITICAL | Maximum number of heartbeats
At 18:50 a controller 1 node 1 cpu cluster id d - starts one python process Nothing of note in output and error logs
At 18:51 an engine 1 node 28 cpus cluster id d - starts one python process Error log indicates No heartbeat and finishes with CRITICAL | Maximum number of heartbeats

So all engines have a heartbeat issue which I guess is an ipython thing. Anyway to debug that further? The nodes can communicate with each other on any high port. Tested with netcat. What mechanism does the heart beat use?

From one of the ipcluster log files I have: 2020-03-24 18:24:55.715 [IPClusterStart] Starting ipcluster with [daemon=True] 2020-03-24 18:24:55.725 [IPClusterStart] Creating pid file: /scratch/johnm/bcb-test/test-gatk/work02/log/ipython/pid/ipcluster-182f220a-a933-463d-83a5-4d594ff9f911.pid 2020-03-24 18:24:55.725 [IPClusterStart] Starting Controller with cluster_helper.cluster.BcbioPBSPROControllerLauncher 2020-03-24 18:24:55.727 [IPClusterStart] Starting BcbioPBSPROControllerLauncher: ['qsub', './pbspro_controllerf6b52a3d-911b-42f0-8b8a-e26c93e50aee'] 2020-03-24 18:24:55.727 [IPClusterStart] adding queue settings to batch script 2020-03-24 18:24:55.727 [IPClusterStart] adding job array settings to batch script 2020-03-24 18:24:55.727 [IPClusterStart] Writing batch script: ./pbspro_controllerf6b52a3d-911b-42f0-8b8a-e26c93e50aee 2020-03-24 18:24:55.804 [IPClusterStart] Job submitted with job id: '636' 2020-03-24 18:24:55.804 [IPClusterStart] Process 'qsub' started: '636' 2020-03-24 18:25:05.815 [IPClusterStart] Starting 5 Engines with cluster_helper.cluster.BcbioPBSPROEngineSetLauncher 2020-03-24 18:25:05.816 [IPClusterStart] adding queue settings to batch script 2020-03-24 18:25:05.816 [IPClusterStart] adding job array settings to batch script 2020-03-24 18:25:05.816 [IPClusterStart] Writing batch script: ./pbspro_engines3322940e-0e2e-4946-a9ac-8017b0db5090 2020-03-24 18:25:05.926 [IPClusterStart] ERROR | Engine start failed Traceback (most recent call last): File "/share/apps/mgcluster/bcbio-nextgen/master/anaconda/lib/python3.6/site-packages/ipyparallel/apps/ipclusterapp.py", line 338, in start_engines self.engine_launcher.start(self.n) File "/share/apps/mgcluster/bcbio-nextgen/master/anaconda/lib/python3.6/site-packages/cluster_helper/cluster.py", line 740, in start output = output.decode('ascii', 'ignore') AttributeError: 'str' object has no attribute 'decode' 2020-03-24 18:50:52.958 [IPClusterStart] SIGINT received, stopping launchers... 2020-03-24 18:50:53.057 [IPClusterStart] ERROR | IPython cluster: stopping

I guess this is associated with the engine heartbeat failed.

The ipcontroller logs seem to show things start to work with tasks arriving and finishing but they all finish with: [VMFixIPControllerApp] CRITICAL | Received signal 15, shutting down [VMFixIPControllerApp] CRITICAL | terminating children...

Probably a different issue

The potential issue with the way the multi-process job is that if the last process finishes before any of the other jobs the scheduler will see the job finish and tidy up any remaining processes. I would put a 'wait' at the end of the script so it waits until all processes have finished before exiting the job.

Here is a simplified version of what is in the qsub which can be used to test the above: ( python -c 'import time; print( "start 1" ); time.sleep( 5 ); print( "end 1" )' & ) && ( python -c 'import time; print( "start 2" ); time.sleep( 5 ); print( "end 2" )' & ) && ( python -c 'import time; print( "start 3" ); time.sleep( 1 ); print( "end 3" )' & )

The last sleep finishes in 1 second so the others stay running but you will be returned to the prompt. PBS will see this as the job has finished and terminate any remaining processes. This may not have been an issue for anybody if the other processes finish before PBS gets around to the tidy up. Should I make this a new issue?

naumenko-sa commented 4 years ago

Thanks for clarifying the use case!

Before running in parallel, I'd suggest to debug your config file using one tumor/normal pair, and scale up once you are confidently execute a small run (yaml is ok, you got to the final folder, no memory issues).

That would help to separate ipython issues from variant calling workflow issues and debug them apart.

in your bcbio-nextgen.log you have an issue:

2020-03-24T08:11Z] rosalind-01: ipython: run_peddy
[2020-03-24T08:12Z] rn249: Uncaught exception occurred
Traceback (most recent call last):
  File "/share/apps/mgcluster/bcbio-nextgen/master/anaconda/lib/python3.6/site-packages/bcbio/provenance/do.py", line 26, in run
    _do_run(cmd, checks, log_stdout, env=env)
  File "/share/apps/mgcluster/bcbio-nextgen/master/anaconda/lib/python3.6/site-packages/bcbio/provenance/do.py", line 106, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
subprocess.CalledProcessError: Command 'set -o pipefail; export LC_ALL=en_US.utf8 && export LANG=en_US.utf8 &&  /share/apps/rosalind/bcbio-nextgen/master/tooldir/bin/peddy -p 28  --plot --prefix /scratch/johnm/bcb-test/test-gatk/work02/bcbiotx/tmpeyvm2g35/VC10110_1 /scratch/johnm/bcb-test/test-gatk/work02/gatk-haplotype/VC10110_1-effects-annotated-ploidyfix-nomissingalt-filterSNP-filterINDEL.vcf.gz /scratch/johnm/bcb-test/test-gatk/work02/gatk-haplotype/VC10110_1-effects-annotated-ploidyfix-nomissingalt-filterSNP-filterINDEL.ped 2> /scratch/johnm/bcb-test/test-gatk/work02/bcbiotx/tmpeyvm2g35/run-stderr.log
' returned non-zero exit status 1.
[2020-03-24T08:12Z] rn249: Skipping peddy because no variants overlap with checks: VC10110_1

We have seen this error before: https://github.com/bcbio/bcbio-nextgen/issues/2671, so hopefully it does not break your case, but still let us explore the yaml.

In test-gatk.yaml there are 10 samples, 5 'cancer' and 5 'control', so I'm assuming you are running tumor/normal somatic variant calling.

The basic element of your yaml is:

- algorithm:
    aligner: bwa
    recalibrate: gatk
    variantcaller: gatk-haplotype
  analysis: variant2
  description: VC10110_1
  files:
  - /scratch/johnm/bcb-test/fastq/VC10110_1_R1.fastq.gz
  - /scratch/johnm/bcb-test/fastq/VC10110_1_R2.fastq.gz
  genome_build: GRCh37
  metadata:
    phenotype: cancer
    sex: female

I see several potential issues here:

phenotype should be tumor/normal, not cancer/control
tumor/normal pairs should be marked with the same batch, i.e. having batch attribute: https://bcbio-nextgen.readthedocs.io/en/latest/contents/pipelines.html#cancer-variant-calling
gatk-haplotype does not understand tumor/normal, it is for germline variant calling, for tumor normal pairs you use variantcaller: mutect2

SN

DrMcStrange commented 4 years ago

This isn't tumour/normal somatic variant calling, but a case/control cohort for a cancer cluster that we think has a strong genetic component (this is a subset of the data for testing). So gatk-haplotype is appropriate.

I have seen that peddy error before when testing single-node runs (before I figured out we should be using the ipython option!). It didn't break the runs, so I don't think it's the root of our issue here.

So just to clarify, I have successfully run single-node jobs using this exact data and yaml, so I'm pretty confident the problem is with ipython and not the variant calling workflow.

ohofmann commented 4 years ago

I can't comment on your specific error, but in general when trying to debug cluster problems I try to:

restart the failed job on an interactive node (with just as many cores as the node has available)
if that works sift through the logs in ./work/log/ipython/log for any errors or tracebacks
try to manually submit the jobs defined in ./work/pbspro_* - the easiest way to get to errors such as 'no such node in queue X' or asking for walltimes that exceed the limit etc. Not suggesting it's something as trivial here

In the past we've also had issues when the networking between worker nodes and controllers was either blocked by the firewall, or in scenarios with multiple networking interfaces ended up using the wrong one.

naumenko-sa commented 4 years ago

Thanks @DrMcStrange for ruling out the workflow issue, so we can focus on the ipython issue! Thanks @ohofmann!

Looking at the logs, I finally understand the issue: the project runs successfully, you got the results in the final, and you are wondering why one engine is working rather than 5 engines.

@jmiezitis

You have:

one main bcbio job (1core)
one controller job (1core), which distributes tasks between engines and
one engine (28 cores)

What you'd expect: 1:1:5, 7 jobs in total.

Have you tried:

to check system limits ulimit -a https://bcbio-nextgen.readthedocs.io/en/latest/contents/parallel.html#open-file-handles
to increase timeout --timeout 600 --retries 3

SN

jmiezitis commented 4 years ago

Thanks @naumenko-sa @ohofmann for continuing to look at this. The pbspro_* qsub scripts work fine if manually submitted. The problem is that there isn't a script requesting as many resources as I thought there should be.

@naumenko-sa you are partially right, what I was really expecting is:

main bcbio job running on the login node. Not a qsubbed job.
one controller job 1 cpu 1 node
one engine job 140 cpus 5 nodes. Although this could also be 5 engines - 5 jobs 28 cpus 1 node each.

As a test I have configured and run ipcluster manually to confirm ipython was ok and basic PBS and network systems would support ipython and was able to get 1 controller job with 1 node 1 cpu + 1 engine job with 2 nodes and 56 cpus (total of 56 ipengines running).

I am confused by what I am seeing in the ipython logs and have attached two examples: ipcontroller-b8ef233c-3ae7-4228-98c1-7db3f8b313e3-2193243.log ipcluster-b8ef233c-3ae7-4228-98c1-7db3f8b313e3-2351980.log

In summary the ipcluster shows and error and says the "Engine start failed" however the corresponding controller clearly shows a series of tasks before terminating.

I have now checked ulimits etc they seem reasonable. I don't think the issue lies with the way the system is setup. It appears that bcbio is never requesting the resources I think it should be perhaps because of a setting in the yaml files or because of something we are missing in starting bcbio. These are just guesses I have been wrong many times with guesses and don't know enough about ipython or bcbio to make good guesses.

If anybody has some test data and config we could try working with I would be happy to give that ago.

Cheers. And thanks again for everyone's support with this. Very much appreciated.

ohofmann commented 4 years ago

I am slightly stumped. The IPython logs indicate worker nodes terminating (or rather being terminated by the cluster). Yet the job actually ran to completion based on the original bcbio log file. Going through the debug logs I see:

[2020-03-24T08:11Z] rn249: Running peddy on /scratch/johnm/bcb-test/test-gatk/work02/gatk-haplotype/VC10110_1-effects-annotated-ploidyfix-nomissingalt-filterSNP-filterINDEL.vcf.gz against /scratch/johnm/bcb-test/test-gatk/work02/gatk-haplotype/VC10110_1-effects-annotated-ploidyfix-nomissingalt-filterSNP-filterINDEL.ped.
[2020-03-24T08:12Z] rn249: Uncaught exception occurred
Traceback (most recent call last):
  File "/share/apps/mgcluster/bcbio-nextgen/master/anaconda/lib/python3.6/site-packages/bcbio/provenance/do.py", line 26, in run
    _do_run(cmd, checks, log_stdout, env=env)
  File "/share/apps/mgcluster/bcbio-nextgen/master/anaconda/lib/python3.6/site-packages/bcbio/provenance/do.py", line 106, in _do_run
    raise subprocess.CalledProcessError(exitcode, error_msg)
subprocess.CalledProcessError: Command 'set -o pipefail; export LC_ALL=en_US.utf8 && export LANG=en_US.utf8 &&  /share/apps/rosalind/bcbio-nextgen/master/tooldir/bin/peddy -p 28  --plot --prefix /scratch/johnm/bcb-test/test-gatk/work02/bcbiotx/tmpeyvm2g35/VC10110_1 /scratch/johnm/bcb-test/test-gatk/work02/gatk-haplotype/VC10110_1-effects-annotated-ploidyfix-nomissingalt-filterSNP-filterINDEL.vcf.gz /scratch/johnm/bcb-test/test-gatk/work02/gatk-haplotype/VC10110_1-effects-annotated-ploidyfix-nomissingalt-filterSNP-filterINDEL.ped 2> /scratch/johnm/bcb-test/test-gatk/work02/bcbiotx/tmpeyvm2g35/run-stderr.log
' returned non-zero exit status 1.

That itself should terminate the whole run but apparently bcbio continues on - but none of this explains why the additional nodes never start up. Or report that the engine failed to start when there's work being done.

Apologies, remotely debugging HPC is hard. The only thing I can think of at this point is to grab some of the toy examples from the bcbio docs and run them in parallel. If those work and the worker nodes come up we can rule a networking / message queue issue and focus on what's happening with this particular workflow...

jmiezitis commented 4 years ago

Using the examples from https://bcbio-nextgen.readthedocs.io/en/latest/contents/intro.html#cancer-tumor-normal-grch37 All I have done is added the following to the end of the cancer-dream-syn3.yaml file: resources: default: memory: 4G cores: 28 jvm_opts: ["-Xms750m", "-Xmx2000m"]

And run using the command: bcbio_nextgen.py ../config/cancer-dream-syn3.yaml -n 140 -s pbspro -q workq

I am seeing very similar behaviour i.e. 1 controller and only 1 engine with 28 cpus. Occasionally I will see a 2nd controller/engine pair startup so that we have 2 controllers and 2 engines but we never get close to 140 cpus being used.

Cheers.

ohofmann commented 4 years ago

Hmm. Here's what my standard PBS Pro submission script looks like:

#!/bin/bash
#PBS -P gx8
#PBS -q normal
#PBS -l walltime=48:00:00
#PBS -l mem=2GB
#PBS -l ncpus=1
#PBS -l software=bcbio
#PBS -l wd
#PBS -l storage=gdata/gx8
export PATH=/g/data3/gx8/local/production/bcbio/anaconda/bin:/g/data/gx8/local/production/bin:/opt/bin:/bin:/usr/bin:/opt/pbs/default/bin
bcbio_nextgen.py ../config/bcbio_system_normalgadi.yaml ../config/WORKFLOW.yaml -n 96 -q normal -s pbspro -t ipython -r 'walltime=48:00:00;noselect;jobfs=100GB;storage=scratch/gx8+gdata/gx8' --retries 1 --timeout 900

Yes, -l wd is less than ideal but was necessary for us to have bcbio make use of environmental variables without having to define them one by one; should probably clean that up.

roryk commented 4 years ago

Hi everyone,

Could you post one of the submission scripts that bcbio generates for these jobs so I can look at it?

I'm guessing the logic failure is here somewhere:

https://github.com/roryk/ipython-cluster-helper/blob/07c2fcc9f17677a559d9566ad2259b47b5586bb4/cluster_helper/cluster.py#L716-L727

but I need to see the submission script to know.

jmiezitis commented 4 years ago

@roryk Thanks for the link to the code. It looks like if you are using select (as we are) then you are always going to be given only one node. "select=1:ncpus=X" means request one node with X cpus per node.

I guess that is ok so long as there are then 5 engine jobs, as in our case, started to meet the number CPUs requested.

Here is the qsub script being generated for one of the engines:

!/bin/sh

PBS -q workq

PBS -V

PBS -S /bin/sh

PBS -N bcbio-e

PBS -l select=1:ncpus=28:mem=114790mb

PBS -l walltime=239:00:00

export LD_LIBRARY_PATH=/share/apps/software/binutils/2.28-GCCcore-6.4.0/lib:/share/apps/openmpi/3.1.1-foss/lib:/share/apps/software/GCCcore/6.4.0/lib/gcc/x86_64-pc-linux-gnu/6.4.0:/share/apps/software/GCCcore/6.4.0/lib64:/share/apps/software/GCCcore/6.4.0/lib:/opt/ohpc/pub/compiler/gcc/8.3.0/lib64 cd $PBS_O_WORKDIR export IPYTHONDIR=/scratch/johnm/bcb-test/test-gatk/work02/log/ipython /share/apps/mgcluster/bcbio-nextgen/master/anaconda/bin/python -E -c 'import resource; cur_proc, max_proc = resource.getrlimit(resource.RLIMIT_NPROC); target_proc = min(max_proc, 10240) if max_proc > 0 else 10240; resource.setrlimit(resource.RLIMIT_NPROC, (max(cur_proc, target_proc), max_proc)); cur_hdls, max_hdls = resource.getrlimit(resource.RLIMIT_NOFILE); target_hdls = min(max_hdls, 10240) if max_hdls > 0 else 10240; resource.setrlimit(resource.RLIMIT_NOFILE, (max(cur_hdls, target_hdls), max_hdls)); from ipyparallel.apps.ipengineapp import launch_new_instance; launch_new_instance()' --timeout=960 --IPEngineApp.wait_for_url_file=960 --EngineFactory.max_heartbeat_misses=120 --profile-dir="/scratch/johnm/bcb-test/test-gatk/work02/log/ipython" --cluster-id="b8ef233c-3ae7-4228-98c1-7db3f8b313e3"

Would be good to have the following line here as well:

PBS -j oe

As per the torque code.

Cheers.

roryk commented 4 years ago

Thanks @jmiezitis, are you not seeing the 5 28 core engine jobs submitted?

jmiezitis commented 4 years ago

@roryk no at the most I have seen only 2 engine qsub jobs running at the same time.

roryk commented 4 years ago

Thanks-- if you submit the engine job you posted to the scheduler multiple times in a row with qsub, does the scheduler schedule all of the jobs or does it reject the later ones somehow? We should be submitting that engine file 5 times there.

jmiezitis commented 4 years ago

@roryk we have just shut the cluster down for electrical work to proceed. Wont be able to test until Friday. I am pretty sure the answer will yes but it is a good question.

In class BcbioPBSPROEngineSetLauncher the start method takes a parameter 'n' which I think is the number of engines to start as it is used in the range function on the for loop. I haven't been able to find where the method is called to see what value is being passed in.

jmiezitis commented 4 years ago

@roryk sorry this has taken so long to get back to. Following on from your suggestion to run the qsub multiple times using the same engine script I can now report back and say I was able to get 5 engines running from 5 attempts:

$ qstat
Job id            Name             User              Time Use S Queue
----------------  ---------------- ----------------  -------- - -----
753.rosalind-pbs- bcbio-e          johnm             00:00:00 R workq           
754.rosalind-pbs- bcbio-e          johnm             00:00:00 R workq           
755.rosalind-pbs- bcbio-e          johnm             00:00:00 R workq           
756.rosalind-pbs- bcbio-e          johnm             00:00:00 R workq           
757.rosalind-pbs- bcbio-e          johnm             00:00:00 R workq

naumenko-sa commented 4 years ago

Thanks @jmiezitis ! Thanks everyone for the discussion on how to debug bcbio parallel runs!

jmiezitis commented 4 years ago

Hello @naumenko-sa,

This issue hasn't been resolved.

I can understand how, from my last post, you may think we have solved this. However my last post was a follow-up to a request from @roryk on Apr 1 asking "if you submit the engine job you posted to the scheduler multiple times in a row with qsub, does the scheduler schedule all of the jobs or does it reject the later ones somehow?".

This was to test the scheduler functionality and the result is the scheduler will happily run five bcbio jobs. The problem remains that either cluster_helper is not starting multiple jobs or is not being told to start multiple jobs.

I am unsure how to proceed from here. Cheers

naumenko-sa commented 3 years ago

Sorry, @jmiezitis I did not get it right.

Were you able to resolve it?

@roryk any additional thoughts?

jmiezitis commented 3 years ago

Hi @naumenko-sa ,

Thank you for reopening this ticket. We haven't resolved the issue I believe the researchers are running tasks manually at the moment which is time consuming. We were also able to allocate a new 128 core node to this group so they didn't need to run jobs across multiple nodes.

I have checked with our researchers and they would like the ability to run jobs on multiple nodes so getting this working would be good for them..

Thank you for your work. Cheers.

roryk commented 3 years ago

Thank you, sorry to drop the ball on this everyone. Could you tell me which version of ipython-cluster-helper is installed?

bcbio_conda list | grep ipython-cluster-helper should show.

If it is 0.6.3 I think I know what the problem is.

DrMcStrange commented 3 years ago

Hi @roryk , it's 0.6.4, so I guess that doesn't help...

jmiezitis commented 3 years ago

While ipython-cluster-helper 0.6.4 was installed it wasn't being used by bcbio which was still using 0.6.3. After rebuilding bcbio we now have it using ipython-cluster-helper 0.6.4 and we can now submit parallel jobs. Thank you everyone for your help.

roryk commented 3 years ago

Thank you for following up! I'm glad the fix worked. Let us know if we can do anything else to help out.

bcbio / bcbio-nextgen

Running bcbio on PBSPro cluster #3147

Probably a different issue

!/bin/sh

PBS -q workq

PBS -V

PBS -S /bin/sh

PBS -N bcbio-e

PBS -l select=1:ncpus=28:mem=114790mb

PBS -l walltime=239:00:00

PBS -j oe