Cluster support - Githubissues

necrolyte2 commented 9 years ago

Mel, @mmelendrez Hope everything is well. I had a question about your reworking of the RIID pipeline for pathosphere. Is it being configured to use a cluster? The way it is configured now on pathosphere prevents each thread from being split onto different nodes. We are trying to push a very large job through, and there are some issues because it all has to be run on one node. Thanks!

Andy @akilianski

necrolyte2 commented 9 years ago

The only part of the pipeline that has some support for job scheduling is the iterative_blast_phylo steps. The pipeline has some support for splitting up the files that are to be blasted and then qsub'ing the chunk'd blast jobs.

Since we don't utilize this feature we have not really tested it out much, but if the pipeline is run with --use-sge it will activate the feature where it splits up the blast jobs. This is not tested at all on our end as we don't have the infrastructure in place.

The Ray assembly stage is run via mpirun which should automatically utilize any mpi infrastructure available.

What issues are you encountering so far?

akilianski commented 9 years ago

Tyghe,

We had some issues with the blast step not being parsed out. Is the --use-sge in the previous version or only in the version you recoded? On Apr 30, 2015 10:58 AM, "Tyghe Vallard" notifications@github.com wrote:

The only part of the pipeline that has some support for job scheduling is the iterative_blast_phylo steps. The pipeline has some support for splitting up the files that are to be blasted and then qsub'ing the chunk'd blast jobs.

Since we don't utilize this feature we have not really tested it out much, but if the pipeline is run with --use-sge it will activate the feature where it splits up the blast jobs. This is not tested at all on our end as we don't have the infrastructure in place.

The Ray assembly stage is run via mpirun which should automatically utilize any mpi infrastructure available.

What issues are you encountering so far?

— Reply to this email directly or view it on GitHub https://github.com/VDBWRAIR/pathdiscov/issues/242#issuecomment-97828495.

necrolyte2 commented 9 years ago

The --use-sge was put into version 4.1

In the old "riidpipeline" you had to specify sge_iterative_blast_phylo as the step instead of iterative_blast_phylo

necrolyte2 commented 9 years ago

Updating issues from email I received from @akilianski that were dropped into a sub folder in outlook which I did not see until today

necrolyte2 commented 9 years ago

@averagehat and I are going to work in a better way to handle how the pipeline runs on clusters to better utilize all available resources that are given to it via the qsub.

Related to #219

We will be combining the mostly redundant steps of iter_blast_phylo and sge_iter_blast_phylo into a single stage.

Currently the iterative blast stages split the file that is given to it up into NUMINST chunks

If it is a non-sge run, then those chunks are all blasted in separate processes
If it is sge run, then each chunk is qsub'd

This is not a good way to handle this as not all qsub environments will immediately execute your job(aka, DOD HPC). These jobs could end up being in the queue for hours which will waste time.

Instead, the stage should check for PBS or SGE variables that indicate that it is running inside of a job and push the chunks to all available nodes in the job.

For example, in PBS, if you run a job with more than 1 node, the following environmental variables are available:

$PBS_NUM_NODES contains number of nodes in job
$PBS_NP contains number of processors total for job
$PBS_NUM_PPN contains number of processors for each host
$PBS_NODEFILE contains list of each host in job

The idea is to split the input blast fasta file into $PBS_NUM_PPN chunks and distribute these chunks out to all machines in $PBS_NODEFILE and have blast run in parallel on all hosts

necrolyte2 commented 9 years ago

Here is a very simplistic way to issue blast in parallel over machines that are all in a PBS job

#!/bin/bash

#PBS -o jobout
#PBS -j oe

cd ${PBS_O_WORKDIR}

export PATH=$PATH:/media/VD_Research/People/tyghe.vallard/Projects/pathdiscov/pathdiscov/parallel-blast/parallel-20150622/bin
module load blast

# Pull top 10,000 reads and convert to fasta
head -n 40000 /media/VD_Research/NGSData/ReadsBySample/A12X1647xB21/A12X1647xB21_S1_L001_R1_001_2015_03_25.fastq | grep '^@M' -A 1 | grep -v '\-\-' | tr '@' '>' > input.fa

# Run in parallel over nodes
# PBS_NODESFILE has one entry for every PPN for each host so if
# nodes=2:ppn=3 is used then there will be 6 entries in the $PBS_NODEFILE file
# This should run blast in parallel on as many ppn as we have utilizing 100% of cpu
/usr/bin/time cat input.fa | parallel -u --sshloginfile ${PBS_NODEFILE} --pipe --block 100k --recstart '>' -P ${PBS_NUM_PPN} "$(which blastn) -max_target_seqs 10 -db /media/VD_Research/databases/ncbi/blast/nt/nt -evalue 0.01 -outfmt  \"6 qseqid sseqid pident length mismatch gapopen qstart qend sstart send evalue bitscore\" -query -" > results.blast

This script essentially pulls 10,000 reads from one of our samples to do the test

Results:

-l nodes=1:ppn=3 38 minutes
-l nodes=2:ppn=3 13 minutes
-l nodes=3:ppn=3 8 minutes

I'll try the full dataset now to see if the numbers stay true. Looks pretty promising though

necrolyte2 commented 9 years ago

More results from the entire file that is about 100k reads

-l nodes=3:ppn=3 1hr 24 minutes
-l nodes=2:ppn=3 2hr 21 minutes
-l nodes=1:ppn=3 still running since 12:00, so at least 4hr 30 minutes

akilianski commented 9 years ago

This is a great start Tyghe! Has RIID signed off on the pipeline yet? We'd love to play around with it on Pathosphere once everyone signs off on it.

Andy

On Fri, Aug 14, 2015 at 4:33 PM, Tyghe Vallard notifications@github.com wrote:

More results from the entire file that is about 100k reads

-l nodes=3:ppn=3 1hr 24 minutes

-l nodes=2:ppn=3 2hr 21 minutes

-l nodes=1:ppn=3 still running since 12:00, so at least 4hr 30 minutes

— Reply to this email directly or view it on GitHub https://github.com/VDBWRAIR/pathdiscov/issues/242#issuecomment-131232641 .

Andy Kilianski, PhD

BioDefense Branch, BioSciences Division Edgewood Chemical Biological Center Aberdeen Proving Ground, MD 21010

akilianski@gmail.com O: 410-436-1927 M: 937-689-5979

necrolyte2 commented 9 years ago

This new change with the updated parallel blast won't be ready until at least next Friday on our end. We may need you to test it before them though as they don't use sge or pbs.

On Fri, Aug 14, 2015, 9:27 PM akilianski notifications@github.com wrote:

This is a great start Tyghe! Has RIID signed off on the pipeline yet? We'd love to play around with it on Pathosphere once everyone signs off on it.

Andy

On Fri, Aug 14, 2015 at 4:33 PM, Tyghe Vallard notifications@github.com wrote:

More results from the entire file that is about 100k reads

-l nodes=3:ppn=3 1hr 24 minutes

-l nodes=2:ppn=3 2hr 21 minutes

-l nodes=1:ppn=3 still running since 12:00, so at least 4hr 30 minutes

— Reply to this email directly or view it on GitHub < https://github.com/VDBWRAIR/pathdiscov/issues/242#issuecomment-131232641> .

Andy Kilianski, PhD

BioDefense Branch, BioSciences Division Edgewood Chemical Biological Center Aberdeen Proving Ground, MD 21010

akilianski@gmail.com O: 410-436-1927 M: 937-689-5979

— Reply to this email directly or view it on GitHub https://github.com/VDBWRAIR/pathdiscov/issues/242#issuecomment-131276488 .

necrolyte2 commented 9 years ago

Alright, so the -l nodes=1:ppn=3 took 6 hours and 48 minutes, seems like this will help us out quite a bit. I'm quite interested in seeing how fast we can move a sample through on the DOD HPC with a bunch of nodes at some point once this is finished

akilianski commented 9 years ago

Me too, that will be very interesting. It should move through really quickly!

Andy

On Mon, Aug 17, 2015 at 8:12 AM, Tyghe Vallard notifications@github.com wrote:

Alright, so the -l nodes=1:ppn=3 took 6 hours and 48 minutes, seems like this will help us out quite a bit. I'm quite interested in seeing how fast we can move a sample through on the DOD HPC with a bunch of nodes at some point once this is finished

— Reply to this email directly or view it on GitHub https://github.com/VDBWRAIR/pathdiscov/issues/242#issuecomment-131796478 .

Andy Kilianski, PhD

BioDefense Branch, BioSciences Division Edgewood Chemical Biological Center Aberdeen Proving Ground, MD 21010

akilianski@gmail.com O: 410-436-1927 M: 937-689-5979

necrolyte2 commented 9 years ago

@akilianski can you confirm a few things about your job scheduling software you use for us

I'm pretty sure you guys are using SGE When you run a parallel job can you confirm that these variables are the correct ones:

$NSLOTS -- I think this is the amount of cpus allocated to each node in the job
$PE_HOSTFILE -- This variable contains the location of a file that has all the nodes information that are part of a job

We are getting very close for this new release that will include a much more improved way to run the iterative_blast_phylo step. This step will now utilize https://github.com/VDBWRAIR/bio_pieces/blob/dev/bio_pieces/parallel_blast.py, but we need the information above to ensure that we are looking for the correct variables for SGE jobs(Right now I can only confirm it works in a PBS environment)

akilianski commented 9 years ago

Tyghe,

You are correct, we are using Grid Engine (which is what the open version of SGE became after Oracle bought Sun). The $NSLOTS variable appears to be the number of slots allocated to a job. The $PE_HOSTFILE appears to only be set if a Parallel Environment is configured (such as MPI). Most of our jobs are run without a parallel environment, so we typically don't have anything set in the $PE_HOSTFILE variable. As a test, I created a job that simply echoes the contents of the two variables. The $NSLOTS variable came back as 1 (for a default job) and the $PE_HOSTFILE came back empty.

Are these variables crucial to getting your software to run? We could set up a parallel environment, if so.

Hope that helps.

Alvin Liem Bioinformatics, ECBC BioDefense Branch OptiMetrics, Inc. a DCS company (410)436-1214 (410)417-5801

On Mon, Aug 24, 2015 at 3:43 PM, Tyghe Vallard notifications@github.com wrote:

@akilianski https://github.com/akilianski can you confirm a few things about your job scheduling software you use for us

I'm pretty sure you guys are using SGE When you run a parallel job can you confirm that these variables are the correct ones:

$NSLOTS -- I think this is the amount of cpus allocated to each node in the job

$PE_HOSTFILE -- This variable contains the location of a file that has all the nodes information that are part of a job

We are getting very close for this new release that will include a much more improved way to run the iterative_blast_phylo step. This step will now utilize https://github.com/VDBWRAIR/bio_pieces/blob/dev/bio_pieces/parallel_blast.py, but we need the information above to ensure that we are looking for the correct variables for SGE jobs(Right now I can only confirm it works in a PBS environment)

— Reply to this email directly or view it on GitHub https://github.com/VDBWRAIR/pathdiscov/issues/242#issuecomment-134352891 .

Andy Kilianski, PhD

BioDefense Branch, BioSciences Division Edgewood Chemical Biological Center Aberdeen Proving Ground, MD 21010

akilianski@gmail.com O: 410-436-1927 M: 937-689-5979

necrolyte2 commented 9 years ago

The way we are reworking the software is that it will auto-detect if it is running inside of either a Grid Engine or PBS/Torque job or if it is just running normally on a multi-processor system.

For PBS/SGE type jobs, we are trying to get it to detect if there are multiple hosts assigned to the job and if so to run in parallel on each host assigned.

If you guys are not actually running these jobs with multiple nodes in the job then it really doesn't matter as the software will just spawn N instances of blast. If you do allocate a few nodes to each job then we will have to figure out which variables Grid Engine sets and we can utilize them.

For PBS/Torque, $PBS_NODEFILE contains X entries for each host in the job, where X is how many CPUS for each host is allocated.

For $NSLOTS, does that tell how many CPUS there are in a job or how many hosts?

akilianski commented 9 years ago

That’s correct, our cluster spans many nodes, but we usually try to avoid running single jobs that span multiple nodes. In our experience, stuff like MPI can be problematic.

For large jobs like BLAST, we usually split the query into pieces and farm out the pieces as individual jobs. That way, it does not matter what machine gets what job. Then, at the end we put the results back together.

$NSLOTS represents how many slots (CPU cores) were given to the job.

Alvin Liem Bioinformatics, ECBC BioDefense Branch OptiMetrics, Inc. a DCS company (410)436-1214 (410)417-5801

On Tue, Aug 25, 2015 at 4:15 PM, Tyghe Vallard notifications@github.com wrote:

The way we are reworking the software is that it will auto-detect if it is running inside of either a Grid Engine or PBS/Torque job or if it is just running normally on a multi-processor system.

For PBS/SGE type jobs, we are trying to get it to detect if there are multiple hosts assigned to the job and if so to run in parallel on each host assigned.

If you guys are not actually running these jobs with multiple nodes in the job then it really doesn't matter as the software will just spawn N instances of blast. If you do allocate a few nodes to each job then we will have to figure out which variables Grid Engine sets and we can utilize them.

For PBS/Torque, $PBS_NODEFILE contains X entries for each host in the job, where X is how many CPUS for each host is allocated.

For $NSLOTS, does that tell how many CPUS there are in a job or how many hosts?

— Reply to this email directly or view it on GitHub https://github.com/VDBWRAIR/pathdiscov/issues/242#issuecomment-134724834 .

Andy Kilianski, PhD

BioDefense Branch, BioSciences Division Edgewood Chemical Biological Center Aberdeen Proving Ground, MD 21010

akilianski@gmail.com O: 410-436-1927 M: 937-689-5979

necrolyte2 commented 9 years ago

With PBS/Torque there is a scheduler so if we cut the input file into pieces and then create a new job for each piece then each job still needs to be scheduled. On in-house clusters this usually isn't a horrible thing since jobs will run almost immediately, however, on large clusters such as the DoD HPC each job may end up sitting in the queue waiting to be run. This means that the overall job will likely actually take longer to run.

We were hoping to use the model of initially scheduling the entire pipeline and telling the scheduler that you want say 2 nodes and 16 CPU's on each host and then the pipeline will just utilize all available hosts and CPUS.

The code we have now utilizes GNU Parallel in a way that we let the parallel command split the input file for us and then run on all hosts that we know about. Essentially it is just cutting the input file up in the same manner, but then needs a list of nodes to ssh to and run blast. I figured that Grid Engine worked the same as PBS/Torque in that you don't have to do anything too special to request additional nodes for a single job.

necrolyte2 commented 9 years ago

Related to VDBWRAIR/bio_pieces#57

VDBWRAIR / pathdiscov

Cluster support #242

Hope that helps.