#####################################
#####################################
-- DEPENDENCIES
STANOVAR version 0.1
BEDtools version 2.17.0
BreakDancer version 1.1.2
BreakSeq Lite version 1.0
BWA version 0.7.4
CNVnator version 0.2.7
GATK version 3.2.2
JDK version 1.7.0_03
Modules Release 3.2.8
Perl
Picard Tools version 1.32
Pindel version 0.2.4t
Python version 2.7
Simple Job Manager version 1.0
Tabix version 0.2.6
vcftools version 0.1.12
zlib version 1.2.7
root version 5.34.30
r version 3.2.0
-- INSTALLATION
HugeSeq is a modular, computational pipeline that runs in a Unix environment in a highly parallel fashion. It was tested on Red Hat Enterprise Linux (RHEL) server v5.6 but it should work in most Linux servers. The batch system it currently supports out-of-the-box is Sun Grid Engine.
Batch System
Many of the clusters are already installed with Sun Grid Engine (SGE). For installing SGE, please refer to the vendor's manual.
Running the analysis pipeline requires submitting many interdependent jobs to the batch scheduling system (e.g. Sun Grid Engine). Therefore, we developed a software program called SJM (Simple Job Manager) to simplify this process, including properly specifying the dependencies, tracking progress of the group of jobs, and responding properly if a job fails.
For batch systems other than SGE, it requires developing an adaptor in SJM. Please write to us for more details.
Modules Environment
To manage different versions of softwares and parameters in the modules, HugeSeq uses a Unix software package called Environment Modules, which provides for the dynamic modification of a user's environment via modulefiles.
To initiate Modules, modify your login profile such as .bash_profile to add the following:
. /path-to-Modules/default/init/sh
Supporting Tools
Install the required softwares, such as the aligners, variant callers and manipulation tools, defined in the software requirements section. For details, please refer to the individual software websites. The softwares are recommended to be installed separately under a single parent directory, such as ~/apps/BreakSeq and ~/apps/CNVnator.
Data Sets
HugeSeq depends on several public data sets for alignment, variant calling, and annotations. They are: The reference genome (e.g. HG19 in FASTA format: hg19.fa) The BWA index of the reference genome (e.g. hg19.fa.bwt, hg19.fa.ann, etc) A .dict dictionary of the contig names and sizes (e.g. hg19.fa.dict) A .fai fasta index file (e.g. hg19.fa.fai) For creating .dict and .fai, please see here. All the indexes and dictionary should reside in the reference genome directory which contains the whole genome FASTA (e.g. hg19.fa) The breakpoint junctions (i.e. BreakSeq library in FASTA format: bplib.fa) The SNP annotation UCSC Known Genes (knownGene) dbSNP SIFT (avsift) RepeatMasker (buildver_rmsk.gff) The STANOVAR application should be installed and corresponding module need to be defined.
Download HugeSeq to your server.
Extract the programs from the compressed archive to a directory, such as ~/app. A directory like ~/app/HugeSeq will then be created, which contains the core program and its configuration. As described above, HugeSeq uses the Environment Modules package for configuration. Its modulefile is in the directory /path-to-HugeSeq/modulefiles/hugeseq named with its version, such as 1.0. To enable Modules to look up the modulefile for correct setting, modify the login profile as above and add the following:
export MODULEPATH=/path-to-HugeSeq/modulefiles:$MODULEPATH
In addition, modify the module file, such as /path-to-HugeSeq/modulefiles/hugeseq/2.0, and change all the programs' paths to the locations where you installed the required programs and the data paths to where you stored the datasets.
Logout and login again to your shell to activate the login profile with the latest configuration. You should now be able to run HugeSeq by loading its module:
module load hugeseq/2.0
After loading the module, you can run HugeSeq simply by typing:
hugeseq
For the usage of HugeSeq, please refer to the Usage section.
-- USAGE
usage: hugeseq [-h] --reads1 FILE [FILE ...] [--reads2 FILE [FILE ...]] --output DIR [--account STR] [--tmp DIR] [--readgroup STR] [--samplename STR] [--bam] [--variants TYPE [TYPE ...]] [--targeted] [--capture FILE [FILE ...]] [--relax_realignment] [--reference_calls] [--snp_hapcaller] [--indel_hapcaller] [--nosnpvqsr] [--noindelvqsr] [--vqsrchrom] [--nobinning] [--nocleanup] [--novariant] [--alignmentonly] [--cleanuponly] [--variantonly] [--donealign] [--donebinning] [--donecleanup] [--donegenotyping] [--donesnpvqsr] [--memory SIZE] [--queue NAME] [--email NAME] [--threads COUNT] [--jobfile FILE] [--submit]
Generating the job file for the HugeSeq variant detection pipeline
optional arguments: -h, --help show this help message and exit --reads1 FILE [FILE ...] The FASTQ file(s) for reads 1 --reads2 FILE [FILE ...] The FASTQ file(s) for reads 2, if paired-end --output DIR The output directory --account STR Accounting string for the purpose of cluster accounting. --tmp DIR The TMP directory for storing intermediate files (default=output directory --readgroup STR The read group annotation (Default: @RG\tID:Default\tLB:Library\tPL:Illumina\tSM:SAMPLE) --samplename STR The SM tag in the read group annotation (Default: "SAMPLE" in @RG\tID:Default\tLB:Library\tPL:Illumina\tSM:SAMPLE) --bam Support for aligned BAMs as input. By default input (-r) is aligned again. Use --variantonly otherwise. --variants TYPE [TYPE ...] gatk breakdancer cnvnator pindel breakseq (default to all) --targeted Use GATK in targeted sequencing mode (default: whole- genome mode) --capture FILE [FILE ...] Capture BED file(s) used for targeted genotyping (default: void, separate multipe files with commas: capture1.bed,capture2.bed,...) --relax_realignment Relaxes GATKs realignment when dealing with badly scored reads (default: false) --reference_calls Store all reference calls from GATK (default: false) in gVCF format in addition to a standard VCF file containing only the variants (valid only for SNV calling) --snp_hapcaller Use GATK HaplotypeCaller to discover SNPs (default: UnifiredGenotyper) --indel_hapcaller Use GATK HaplotypeCaller to discover Indels (default: UnifiredGenotyper) --nosnpvqsr Do not perform VQSR SNPs (variant quality score recalibration) --noindelvqsr Do not perform VQSR on Indels (variant quality score recalibration) --vqsrchrom Perform VQSR on individual chromosomes (valid when binning performed; default: VQSR on whole genome VCF) --nobinning Do not bin the alignments by chromosomes --nocleanup Do not clean up the alignments --novariant Do not call variants --alignmentonly Only align input FASTQ or BAM files (-r) --cleanuponly Only clean up input BAM files (-r) --variantonly Only call variants in input BAM files (-r) --donealign Sequences already aligned using the pipeline --donebinning Alignments already binned by chromosomes using the pipeline --donecleanup Alignments already cleaned using the pipeline --donegenotyping Variants already called using the pipeline but VQSR is not --donesnpvqsr Processing is started after SNP VQSR (from Indel VQSR) --memory SIZE Memory size (GB) per job (default: 12) --queue NAME Queue for jobs (default: extended) --email NAME Email address to receive emails for ending or aborting last jobs in the queque --threads COUNT Number of threads for alignment, only works for SGE (default: 4) --jobfile FILE The jobfile name (default: stdout) --submit Submit the jobs