____ _________
_____/ __ \/ ____/ |
/ ___/ /_/ / / __/ /| |
/ / / ____/ /_/ / ___ |
/_/ /_/ \____/_/ |_|
****************************
* V 2.0.0 *
****************************
rPGA is a pipeline to discover hidden splicing variations by mapping personal transcriptomes to personal genomes. As of version 1.0.1, rPGA will also generate allele specific alignments using the "run alleles" option (see Usage). As of version 1.1.1, rPGA will N-mask known RNA-editing sites when making the personal genomes (see --rnaedit flag in Usage). As of version 1.3.5, rPGA can take unphased variant files as input and produce an N-masked allele specific analysis (see --nmask flag in Usage).
________________ ____________________
(Reference Genome) (Genotype Information)
|____________________________|
_________|_________
| Personal Genome |
|
|
____|____ _______________
| STAR | <------ (Sequenced Reads)
_______________|_______________
____|____ ____|____ ____|____
(Reference) ( Hap 1 ) ( Hap 2 )
(Alignment) (Alignment) (Alignment)
|_______________|_______________|
|
_________________|___________________
| - Discover Personal Splice junctions| ______________________
| - Calculate Usage Frequencies | <--- (Known Splice Junctions)
| - Compare to Known Splice Junctions |
|
_________________|_________________
____|____ ____|____ ____|____ ____|____
| Hap 1 | | Hap 2 | |Hap 1 & 2| |Reference|
|Specific | |Specific | |Specific | |Specific |
|Junctions| |Junctions| |Junctions| |Junctions|
rPGA is intended to be used in a Unix-based environment. It has been tested on Mac OS and Linux.
A working installation of Python is required.
rPGA makes use of the STAR software package for alignment. STAR must be installed. By default, rPGA expects them to be somewhere in your path, but you if they are not you can specify their locations when running the configure script (see below).
rPGA makes use of python packages pysam and pybedtools to process STAR output. pysam and pybedtools are their respective dependencies must be installed.
Installing rPGA
Usage - Discover hidden splice junctions
Usage - Allele specific alignment
Contacts and bug reports
All of the below installation instructions assume you have write access to the installation directory for rPGA; if you want to install into a central location for all users, you may need to prefix these with sudo.
To begin the installation, unpack the distribution and CD into the newly created directory.
tar -xf rPGA-2.0.0.tar.gz
cd rPGA-2.0.0
Or, clone from the github repository.
git clone https://github.com/Xinglab/rPGA.git
cd rPGA
Now type
./configure
This will configure the installation. rPGA is written partially in Python.
By default, it will use whichever python interpreter is invoked when you type
'Python' on the command line. If you wish to specify a different version of
Python, you can do this when running the configure
script. Additionally,
if any of the required external software packages listed above are not on your
path, you can specify their location when running configure
. For example,
to specify the path to python and all of the exeternal software packages,
you would type the following:
./configure --with-python=/some/path/to/python \
--with-STAR=/some/path/to/STAR
You needn't include them all, just the ones that cannot be found in your path. The configure script will tell you if any requirements cannot be found, in which case rPGA cannot be installed.
After configuration, rPGA is installed by typing:
make install
This will install Python modules for whichever version of Python you are using.
User executables and scripts will be placed into /path/to/rPGA/bin and /path/to/rPGA/scripts respectively. These cannot be moved, as the pipeline expects them to remain here. We suggest you modify your PATH environment variable to include these directories, but it is not required. All of the below instructions assume these directories are in your PATH variable; if not, replace the names of the scripts/executable with their full path.
This is the pipeline to discover splice junctions that are hidden when aligning an individual's transcript reads to the hg19 reference genome. There are three steps in the pipeline, 1. personalizing the genome, 2. aligning reads, and 3. discovering hidden splice junctions.
Inputs:
Download the reference sequences for the species from http://hgdownload.cse.ucsc.edu/downloads.html.
Concatenate all the files from the different chromosome into one single file. For example
$ cat \*.fa > ~/rPGAGenomes/hg19/hg19.fa
For any individual, rPGA needs to know where to find a vcf file with the genotype. vcf files with multiple samples, such as the 1000 Genomes vcf files, must be split it into separate files for each individual. This is easily done using bcftools, which can be downloaded from https://github.com/samtools/bcftools/wiki/HOWTOs.
To extract the individual genotype, if necessary:
$ bcftools view -s sample_name -v snps -p /path/to/ALL.chr.genotypes.vcf > \
sample.chr.genotype.vcf
Note rPGA requires a separate vcf file for each chromosome. They should be located in a directory and named 1.vcf, 2.vcf, 3.vcf,...
At this step rPGA is ready to make the personalized genome. To do this run this command:
$ rPGA personalize -r reference.fa -v genotype_directory -o output_directory
rPGA personalize options:
-o * output directory
-r * reference genome
-v * VCF directory
--gz flag denoting VCF files are gzipped
--rnaedit flag to N-mask rna editing sites
-e file containing RNA editing sites, can be downloaded from RADAR
(http://rnaedit.com/download)
* Required parameters
** Note if --rnaedit flag is used, RNA editing file must be provided using -e. rPGA will change each RNA editing site to an "N" in the personal genomes. The number and locations of RNA editing sites that overlap SNPs will be reported in report.personalize.txt.
Outputs:
The next step is to map the sequencing data to the personalized genome using STAR alignment tool.
Inputs:
Please note that we chose STAR for faster alignment, but that obviously comes at a cost, and that is the required memory. Make sure you have enough memory on your computer that is running the STAR (~ 32 GB, depending on the options of the mapper).
To align reads to personal genomes:
$ rPGA mapping -r reference.fa -s reads_1.fastq,[reads_2.fastq] -o outdir
rPGA mapping options:
-r * reference genome (Fasta)
-s * read sequences, either single or paired end
-o * output directory
-T number of threads STAR uses, default is 8
-M max number of multiple alignments, default is 20
-N max number of read mismatches, default is 3
--gz flag denoting sequence reads are gzipped
* Required parameters
Outputs:
The final step is to discover novel junctions using the discover function.
Inputs:
Usage:
$ rPGA discover -g annotation.gtf -v genotype_directory -o output
rPGA discover options:
-g * Annotation file (GTF)
-v * Genotype file or directory (VCF)
-o * Output directory
--rnaedit ** flag to consider rna editing sites
-e ** file containing RNA editing sites; can be downloaded from
RADAR (www.rnaedit.com/download/)
--gz flag denoting VCF genotype files are gzipped
-b1 *** Haplotype 1 alignment file to personal genome (BAM)
-b2 *** Haplotype 2 alignment file to personal genome (BAM)
-br *** Reference alignment file to reference genome (BAM)
* Required parameters
** Note: if --rnaedit flag is used, a file containing RNA editing events must be provided using -e. In this case, rPGA will disregard heterozygous SNPs that overlap RNA editing sites when assigning mapped reads to haplotypes.
*** Use these if you would like to supply your own personal genome mapping alignment files. If rPGA run mapping (previous step) is used there is no need to provide rPGA the alignment files, rPGA will use the alignments in HAP1/STARalign, HAP2/STARalign, REF/STARalign.
Outputs (per chromosome):
Haplotype specific bed files:
Columns of each bed file are:
The name of each splice junction is in the format J_R/NC/N3/N5/N35_SNPid.
SNPid is a comma deliminated list of the splice site SNP ids, which match the SNP ids in the given VCF file..
To discover hidden splice junctions, rPGA generates allele specific bam files. Follow this pipeline if you are only interested in generating the allele specific bam files. There are three steps: 1) Personalize reference genome, 2) Align RNA-seq reads to personal genomes, and 3) Assign haplotype specific reads. Note, if you wish to provide your own personal haplotype alignments, you may bypass steps (1) and (2) and use the -b1 and -b2 options in step 3 (See description of parameters below).
Inputs:
Download the reference sequences for the species from http://hgdownload.cse.ucsc.edu/downloads.html.
Concatenate all the files from the different chromosome into one single file. For example
$ cat \*.fa > ~/rPGAGenomes/hg19/hg19.fa
For any individual, rPGA needs to know where to find a vcf file with the genotype. vcf files with multiple samples, such as the 1000 Genomes vcf files, must be split it into separate files for each individual. This is easily done using bcftools, which can be downloaded from https://github.com/samtools/bcftools/wiki/HOWTOs.
To extract the individual genotype, if necessary:
$ bcftools view -s sample_name -v snps -p /path/to/ALL.chr.genotypes.vcf > \
sample.chr.genotype.vcf
At this step rPGA is ready to make the personalized genome. To do this run this command:
$ rPGA personalize -r reference.fa -v genotype_directory -o output_directory
rPGA personalize options:
-o * output directory
-r * reference genome
-v * VCF file or directory
--gz flag denoting VCF files are gzipped
--rnaedit ** flag to N-mask rna editing sites
-e ** file containing RNA editing sites, can be downloaded from RADAR
(http://rnaedit.com/download)
--nmask Flag to n-mask variant sites (use if using unphased data)
* Required parameters
** Note if --rnaedit flag is used, RNA editing file must be provided using -e. rPGA will change each RNA editing site to an "N" in the personal genomes. The number and locations of RNA editing sites that overlap SNPs will be reported in report.personalize.txt.
*** Use --nmask flag to nmask SNP positions in the reference genome. This should be used if you have unphased genotype data. This produces one N-masked personal genome instead of two personal genomes. Note, --nmask option must be used for all three rPGA run steps (personalize, mapping, and alleles)
Outputs:
Output (nmask flag):
The next step is to map the sequencing data to the personalized genome using STAR alignment tool.
Inputs:
Please note that we chose STAR for faster alignment, but that obviously comes at a cost, and that is the required memory. Make sure you have enough memory on your computer that is running the STAR (~ 32 GB, depending on the options of the mapper).
To align reads to personal genomes:
$ rPGA mapping alleles -r reference.fa -s reads_1.fastq,[reads_2.fastq] -o outdir
rPGA mapping options:
-r * reference genome (Fasta)
-s * read sequences, either single or paired end
-o * output directory
-T number of threads STAR uses, default is 8
-M max number of multiple alignments, default is 20
-N max number of read pair mismatches, default is 6
--gz flag denoting sequence reads are gzipped
--nmask ** flag to do N-mask RNA-seq alignment
* Required parameters
** Aligns RNA-seq reads to nmask.fa, producing just one alignment output. Note, --nmask option must be used for all three rPGA run steps (personalize, mapping, and alleles)
Outputs:
Outputs (nmask flag):
Finally, assign mapped reads to hap1 or hap2 using alleles function.
Inputs:
Outputs:
Outputs (if nmask flag is used):
Usage:
$ rPGA run alleles -c CHROM -v genotype_directory -o output_directory
rPGA alleles options:
-v * Genotype directory containing VCF files
-o * Output directory
--conflict flag to write bam file containing conflicting reads
--rnaedit ** flag to consider rna editing sites
-e ** file containing RNA editing sites; can be downloaded from
RADAR (www.rnaedit.com/download/)
--gz flag denoting VCF genotype files are gzipped
-b1 *** Haplotype 1 alignment file to personal genome (BAM)
-b2 *** Haplotype 2 alignment file to personal genome (BAM)
--nmask **** flag to do N-masking allele specific assignment
* Required parameter
** Note: if --rnaedit flag is used, a file containing RNA editing events must be provided using -e. In this case, rPGA will disregard heterozygous SNPs that overlap RNA editing sites when assigning mapped reads to haplotypes.
*** Use these if you would like to supply your own personal genome mapping alignment files. If rPGA run mapping (previous step) is used there is no need to provide rPGA the alignment files, rPGA will use the alignments in HAP1/STARalign and HAP2/STARalign.
**** If nmask flag is used, reads covering heterozygous SNPs are assigned to the reference or alternate allele. Note, --nmask option must be used for all three rPGA run steps (personalize, mapping, and alleles)
________________ ____________________
(Reference Genome) (Genotype Information)
|____________________________|
_________|_________
| Personal Genome |
|
|
____|____ _______________
| STAR | <------ (Sequenced Reads)
_______________|_______________
____|____ ____|____ ____|____
(Reference) ( Hap 1 ) ( Hap 2 )
(Alignment) (Alignment) (Alignment)
|_______________|
|
|
_____________|_____________
| - Collect reads covering |
| heterozygous SNPs |
| - Perform allele specific |
| read assignment |
|
_____________|____________
____|____ ____|____
( Hap 1 ) ( Hap 2 )
( Specific) ( Specific)
(Alignment) (Alignment)
To generate allele specific bam files:
Personalize reference genome according to a sample's genotype, producing two personalized genomes, hap1.fa and hap2.fa.
Use STAR to align the sample's sequenced reads to the personalized genomes.
Collect reads that cover each heterozygous SNP.
For each read:
Check the position in the read corresponding to the heterozygous SNP.
A read is hap1 specific if:
Likewise, a read is hap2 specific if
If a read covers multiple heterozygous SNPs, a majority vote is used. For example, if a read covers 3 heterozygous SNPs and 2 match the hap1 allele and 1 matches the hap2 allele AND the edit distance to hap1 < edit distance to hap2, the read is assigned to hap1.
If a read cannot be assigned to either hap1 or hap2 according to the above rules, it is considered "conflicting" and is not assigned to either haplotype. To output such conflicting reads, use the --conflict option when running "discover" or "alleles".
Write all hap1 and hap2 specific reads to hap1.bam and hap2.bam, respectively.
Note, rPGA adds up to 4 SAM tags:
If --nmask flag is used, the SAM flags are:
The consensus BAM file is a single BAM file that contains the best matching reads based on the haplotype assignment + non heterozygous reads alignments. It consists of reads obtained by the following procedure:
These comparisons are made for every read that is aligned. For example, if your initial fastq file contains 100 million reads, there would be 100 million reads in the consensus BAM file, minus the reads discarded in step 5.
Enjoy!
Yi Xing yxing@ucla.edu
Shayna Stein sstein93@ucla.edu
Emad Bahrami-Samani ebs@ucla.edu
If you found a bug or mistake in this project, we would like to know about it. Before you send us the bug report though, please check the following:
Stein S, Lu Z, Bahrami-Samani E, Park JW, Xing Y. Discover hidden splicing variations by mapping personal transcriptomes to personal genomes. Nucleic Acids Res. 2015 Dec 15;43(22):10612-22.
Copyright (C) 2015 University of California, Los Angeles (UCLA) Shayna R. Stein, Emad Bahrami-Samani, Yi Xing
Authors: Shayna R. Stein, Emad Bahrami-Samani, Yi Xing
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see http://www.gnu.org/licenses/.