eturro / mmseq

Haplotype, isoform and gene level expression analysis using multi-mapping RNA-seq reads
GNU General Public License v2.0
67 stars 20 forks source link

MMSEQ: Transcript and gene level expression analysis using multi-mapping RNA-seq reads

Build Status

mmseq collage

What is MMSEQ?

The MMSEQ package contains a collection of statistical tools for analysing RNA-seq expression data. Expression levels are inferred for each transcript using the mmseq program by modelling mappings of reads or read pairs (fragments) to sets of transcripts. These transcripts can be based on reference, custom or haplotype-specific sequences. The latter allows haplotype-specific analysis, which is useful in studies of allelic imbalance. The posterior distributions of the expression parameters for groups of transcripts belonging to the same gene are aggregated to provide gene-level expression estimates. Other aggregations (e.g. of transcripts sharing the same UTRs) are also possible. Isoform usage (i.e., the proportion of a gene's expression due to each isoform) is also estimated. Uncertainty in expression levels is summarised as the standard deviation of the posterior distribution of each expression parameter. When the uncertainty is large in all samples, a collapsing algorithm can be used for grouping transcripts into inferential units with reduced levels of uncertainty.

The package also includes a model-selection algorithm for differential analysis (implemented in mmdiff) that takes into account the posterior uncertainty in the expression parameters and can be used to select amongst an arbitrary number of models. The algorithm is regression-based and thus it can accomodate complex experimental designs. The model selection algorithm can be applied at the level of transcripts or transcript aggregates such as genes and it can also be applied to detect differential isoform usage by modelling summaries of the posterior distributions of isoform usage proportions as the outcomes of the linear regression models.

Citing MMSEQ

If you use the MMSEQ package, please cite:

If you use the mmdiff or mmcollapse programs, please also cite:

Key features

Installation

Download the latest release of MMSEQ, unzip and add the bin directory to your PATH. E.g.:

wget -O mmseq-latest.zip https://github.com/eturro/mmseq/archive/latest.zip
unzip mmseq-latest.zip && cd mmseq-latest
export PATH=`pwd`/bin:$PATH

You might want to strip the suffix from the binaries. E.g., under Linux:

cd bin
for f in `ls *-linux`; do
  mv $f `basename $f -linux`
done

The current release is 1.0.10 (changelog). Visit the release archive to download older releases.

Estimating expression levels

Input files:

The example commands below assume that the FASTQ files are asample_1.fq and asample_2.fq (paired-end) and the FASTA file is Homo_sapiens.GRCh37.70.ref_transcripts.fa.

Step 1a: Index the reference transcript sequences with Bowtie 1 (not Bowtie 2); to use kallisto instead, see Step 1b

bowtie-build --offrate 3 Homo_sapiens.GRCh37.70.ref_transcripts.fa Homo_sapiens.GRCh37.70.ref_transcripts 

(It is advisable to use a lower-than-default value for --offrate (such as 2 or 3) as long as the resulting index fits in memory.)

Step 1b: Index the reference transcript sequences with kallisto (this is an alternative to Step 1a)

kallisto index -i Homo_sapiens.GRCh37.70.ref_transcripts.kind Homo_sapiens.GRCh37.70.ref_transcripts.fa

Step 1c (optional): Trim out adapter sequences if necessary

If the insert size distribution overlaps the read length, trim back the reads to exclude adapter sequences. Trim Galore! works well. E.g. for libraries prepared using standard Illumina adapters (AGATCGGAAGAGC), run:

trim_galore -q 15 --stringency 3 -e 0.05 --length 36 --trim1 --paired asample_1.fq.gz asample_2.fq.gz

Step 2a: Align reads with Bowtie 1 (not Bowtie 2); to use kallisto instead, see Step 2b

bowtie -a --best --strata -S -m 100 -X 500 --chunkmbs 256 -p 8 Homo_sapiens.GRCh37.70.ref_transcripts \
  -1 <(gzip -dc asample_1.fq.gz) -2 <(gzip -dc asample_2.fq.gz) | samtools view -F 0xC -bS - | \
  samtools sort -n - asample.namesorted

Step 2b: Align reads with kallisto (this is an alternative to Step 2a)

kallisto pseudo -i Homo_sapiens.GRCh37.70.ref_transcripts.kind --pseudobam -o kout asample_1.fq.gz asample_2.fq.gz | \
  | samtools view -F 0xC -bS - | samtools sort -n - asample.namesorted

Step 3: Map reads to transcript sets

bam2hits Homo_sapiens.GRCh37.70.ref_transcripts.fa asample.namesorted.bam > asample.hits

Step 4: Obtain expression estimates

mmseq asample.hits asample

Description of output:

These steps operate on a sample-by-sample basis and the expression estimates are roughly proportional to the RNA concentrations in each sample. Some scaling of the estimates may be required to make them comparable across biological replicates and conditions. This can be achieved by reading in the output for multiple samples using the readmmseq() function defined in the mmseq.R R script included in the src/R directory (more information on this function can be found here). The posterior standard deviations capture the uncertainty due to both Poisson counting noise and the additional ambiguity in the mappings between reads and transcripts. The biological variance across samples can only be discerned with the use of biological replicates (see section on differential expression below).

Differential expression analysis

Flexible model comparison using MMDIFF

The mmdiff binary performs model comparison using the posterior summaries (log_mu and sd or mean_probit_proportion and sd_probit_proportion) saved in the MMSEQ tables. The models, indexed by m, are given by

mmdiff model

where y is the response (log expression or probit proportion parameter (see below)), M is a model-independent nuisance covariate, P is a model-dependent covariate, "nu" is a random effect capturing the posterior uncertainty in the response and "epsilon" is the residual.

Two models can be specified in a matrices file using the -m option (see below). Alternatively, for standard differential expression analysis (comparison of a model specifying two or more conditions with a model specifying a single condition), the -de convenience option may be used instead of -m. E.g. for a simple 2 vs. 2 gene-level comparison, run:

mmdiff -de 2 2 cond1rep1.gene.mmseq cond1rep2.gene.mmseq cond2rep1.gene.mmseq cond2rep2.gene.mmseq > out.mmdiff

The prior probability that the second model is true may be specified with -p FLOAT (default: 0.1).

In order to use the probit-transformed isoform usage proportions for inference (in order to check for differential isoform usage), set the option -useprops.

The models are regression-based and more complex experimental designs can be specified in a file containing four matrices:

If a model has no classes (i.e. just an intercept term), define the P matrix to be simply 1. E.g. for transcript-level differential expression between two groups of two samples without extraneous variables, the following command and matrices file would be appropriate (note, this is equivalent to specifying -de 2 2):

mmdiff -m matrices_file cond1rep1.mmseq cond1rep2.mmseq cond2rep1.mmseq cond2rep1.mmseq > out.mmdiff

where the matrices file contains:

# M; no. of rows = no. of observations
0
0
0
0
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 0
0 0
0 1
0 1
# P0(collapsed); no. of rows = no. of classes for model 0
1
# P1(collapsed); no. of rows = no. of classes for model 1
.5
-.5

The resulting design matrices and associated coefficients would be:

# model 0 ([1|P0]):
alpha0
1
1
1
1

# model 1 ([1|P1])
alpha1 eta1_0
1 .5
1 .5
1 -.5
1 -.5

Model-independent nuisance covariates can be specified in the M matrix. E.g. in a 3 vs 3 comparison with a single binary nuisance covariate, use:

# M; no. of rows = no. of observations
1
0
1
0
1
0
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 0
0 0
0 0
0 1
0 1
0 1
# P0(collapsed); no. of rows = no. of classes for model 0
1
# P1(collapsed); no. of rows = no. of classes for model 1
.5
-.5

The resulting design matrices and associated coefficients would be:

# model 0 ([1|M|P0]):
alpha0 beta0
1 1
1 0
1 1
1 0
1 1
1 0

# model 1 ([1|M|P1])
alpha1 beta1 eta1_0
1 1 .5
1 0 .5
1 1 .5
1 0 -.5
1 1 -.5
1 0 -.5

For a three-way differential expression analysis with three, three and two observations per group respectively, this matrices file would be appropriate (equivalent to using -de 3 3 2).

In order to assess whether the log fold change between group A and group B is different (as opposed to equal) to the log fold change between group C and group D, assuming there are two observations per group, this matrices file would be appropriate.

By default, mmdiff includes a global intercept alpha. If you prefer to fix alpha=0 and instead use the M covariate matrix to define multiple independent intercepts (beta), then set the option -fixalpha. This can be used to obtain class-specific expression summaries without performing model comparison: specify a prior probability of either model to zero (use -p 0), remove the intercept term using -fixalpha and use the M matrix to group samples into groups using a matrices file such as the following, which groups samples into sets of two:

# M; no. of rows = no. of observations
1 0 0
1 0 0
0 1 0
0 1 0
0 0 1
0 0 1
# C; no. of rows = no. of observations and no. of columns = 2 (one for each model)
0 0
0 0
0 0
0 0
0 0
0 0
# P0(collapsed); no. of rows = no. of classes for model 0
1
# P1(collapsed); no. of rows = no. of classes for model 1
1

The resulting design matrix and associated coefficients would then be:

# model 0 ([M]):
beta0_0 beta0_1 beta0_2
1 0 0
1 0 0
0 1 0
0 1 0
0 0 1
0 0 1

Thus, the posterior summaries encoded in beta0_0, beta0_1, beta0_2 would represent the class-specific expression estimates.

For further advice on setting up the matrices file for a particular study design, do not hesitate to contact the corresponding author.

Description of the output:

  1. feature_id: the name of the feature (e.g. Ensembl transcript ID)
  2. bayes_factor: the Bayes factor in favour of the second model
  3. posterior_probability: the posterior probability in favour of the second model (the prior probability is recorded in a # comment at the top of the file)
  4. alpha0 and alpha1: estimated posterior mean of the global intercept for each model
  5. beta0_0, beta0_1..., beta1_0, beta1_1...: estimated posterior means of the regression coefficients of the model-independent covariate matrix M under each model
  6. eta0_0, eta0_1..., eta1_0, eta1_1...: estimated posterior means of the regression coefficients of the model-dependent matrix P under each model
  7. mu_sample1, mu_sample2,... sd_sample1, sd_sample2,...: the data, i.e. the posterior means and standard deviations used as the outcomes

To recompute the posterior probability based on a different prior probability without having to re-run mmdiff, you can use the following function in R, where bf is the Bayes factor and prior is the new desired prior probability:

recompute_pp <- function(bf, prior) {  1/(1 + exp( -( log(bf) + log(prior) - log(1-prior) ))) }

To perform polytomous model selection (i.e., compare multiple models), run mmdiff multiple times, comparing each alternative model to some baseline model. Then, in R, source the mmseq.R file in the src/R directory and feed the mmdiff output files to the polyclass() function. A vector of prior probabilities for each model can be specified in the prior argument (assumed flat by default). The function returns a data frame containing the posterior probability of each model for each feature (postprob_model0, postprob_model1, etc.).

For an alternative approach to differential expression analysis using edgeR or DESeq, view these instructions.

Collapsing transcripts

It is possible to collapse sets of transcripts based on posterior correlation estimates with the mmcollapse binary. The overall precision of the expression estimates for a collapsed set is usually much greater than for the individual component transcripts, which typically share important structural features (many shared exons, etc). The mmcollapse binary uses the output of mmseq to do the collapsing:

mmcollapse basename1 basename2...

Here, basename1 basename2... corresponds to the name of the mmseq output files without the suffixes (such as .mmseq and .gene.mmseq). Pass all the basenames from the same experiment to mmcollapse in one go to ensure the collapsings are consistent across samples. The command creates a set of new transcript-level MMSEQ files with the suffix .collapsed.mmseq, which can be used to perform more powerful transcript-level differential analysis.

Reference files

Ready to download:

Making your own Ensembl vertebrate reference FASTAs:

Download the cDNA and ncRNA FASTA files for the Ensembl version and species of interest from the Ensembl FTP server and combine them into a single file. For humans, remove alternative haplotype/supercontig entries using the fastagrep.sh script in the bin directory, which is an adapted version of unix grep that greps multi-line FASTA entries. E.g.:

wget ftp://ftp.ensembl.org/pub/release-72/fasta/homo_sapiens/cdna/Homo_sapiens.GRCh37.72.cdna.all.fa.gz
wget ftp://ftp.ensembl.org/pub/release-72/fasta/homo_sapiens/ncrna/Homo_sapiens.GRCh37.72.ncrna.fa.gz
gunzip Homo_sapiens.GRCh37.72.*.fa.gz
cat Homo_sapiens.GRCh37.72.ncrna.fa >> Homo_sapiens.GRCh37.72.cdna.all.fa
fastagrep.sh -v 'supercontig|GRCh37:[^1-9XMY]' Homo_sapiens.GRCh37.72.cdna.all.fa > Homo_sapiens.GRCh37.72.ref_transcripts.fa # for humans only
rm Homo_sapiens.GRCh37.72.cdna.all.fa Homo_sapiens.GRCh37.72.ncrna.fa # clean up

For mice, you may need to remove patches and unscaffolded contigs. E.g.:

wget ftp://ftp.ensembl.org/pub/release-75/fasta/mus_musculus/cdna/Mus_musculus.GRCm38.75.cdna.all.fa.gz
wget ftp://ftp.ensembl.org/pub/release-75/fasta/mus_musculus/ncrna/Mus_musculus.GRCm38.75.ncrna.fa.gz
gunzip Mus_musculus.GRCm38.75.*.fa.gz
cat Mus_musculus.GRCm38.75.ncrna.fa >> Mus_musculus.GRCm38.75.cdna.all.fa
fastagrep.sh -v 'scaffold|PATCH' Mus_musculus.GRCm38.75.cdna.all.fa > Mus_musculus.GRCm38.75.ref_transcripts.fa # for mice only
rm Mus_musculus.GRCm38.75.cdna.all.fa Mus_musculus.GRCm38.75.ncrna.fa # clean up

FASTAs with other header conventions:

MMSEQ works out of the box with Ensembl FASTA files, which follow the convention >TID ... gene:GID ..., where TID is the Ensembl transcript ID, GID is the Ensembl gene ID, and the ... may contain other information, such as chromosomal location. In order to use FASTA files that follow a different convention, tell bam2hits how to extract the transcript and gene IDs from the FASTA header strings using the -m tg_regexp t_ind g_ind option. tg_regexp specifies a regular expression that matches FASTA entries (excluding the leading >), where pairs of brackets are used to identify the location of the transcript and gene ID strings. t_ind and g_ind are 1-based indexes to indicate which pair of brackets capture the transcript and gene IDs respectively. For example, if your entries looked like this:

>transcript1234 gene567|chr1|1234

then run bowtie with --fullref and run bam2hits with -m "(\S+) ([^\|]+)\|.*" 1 2.

If you are unsure what regular expression to use, try out by trial and error using the testregexp.rb script in the bin directory. After running bam2hits with the -m option, run hitstools inspect hits_file and compare the header and the main lines to make sure the extraction of IDs worked as expected.

Making sample-specific transcript FASTAs through genotyping and phasing:

If you have phased genotypes and wish to create a sample-specific transcript FASTA file, you can use the haploref.rb script. Suppose there was a two-exon transcript called ENST123 on chromosome 1 with reference sequence ATCGAATGA spread across two exons at positions 1000-1003 and 1010-1014. Suppose a particular individual carried G and T on one haplotype and C and G on the other haplotype, at positions 1003 and 1011 of chromosome 1, respectively. This information should be encoded in four files, as follows:

The MMSEQ package comes with statically-linked binaries for 64-bit Mac OS X and GNU/Linux, which should work out of the box on most systems. However, due to a lack of OpenMP support on Apple's clang compiler (as of El Capitan), the Mac binaries are single-threaded. In order to build from source, install the following dependencies (more info):

Then clone the GitHub repository and run make from the src directory, which will place the binaries in the bin directory:

 git clone https://github.com/eturro/mmseq.git
 cd mmseq/src
 make

Miscellany

The remainder of the .hits file links each read with a set of transcripts. E.g.:

>read1
transcriptA
transcriptB
>read2
transcript C
...

The full file can be printed out with the command hitstools t hits_file (see below).

Software usage

See authors.

bam2hits

Usage: bam2hits [-v] [-t] [-m tg_regexp t_ind g_ind] [-i expected_isize sd_of_isizes] [-k] [-b] [-c alt_lengths] [-u alt_lengths] transcript_fasta sorted_bam > hits_file
Generate hits file for use with `mmseq`.

Mandatory arguments:
  transcript_fasta:    FASTA file used as reference to produce SAM file.
  sorted_bam:          BAM file sorted by read name containing alignments to reference transcripts.

Option -v:             print version and quit.

Option -t:             output hits file in plain text instead of compressed binary format.

Arguments to option -m for use with non-Ensembl transcript FASTA files:
  tg_regexp:           regular expression matching FASTA entry names, where pairs of brackets
                       are used to capture transcript and gene IDs. E.g.: "(\S+).*gene:(\S+).*"
  t_ind:               index of bracket pair that captures the transcript ID. E.g.: 1.
  g_ind:               index of bracket pair that captures the gene ID. E.g.: 2.

Arguments to option -i for calculating effective transcript lengths and filtering very unlikely alignments:
  expected_isize:      expected insert size (e.g. 180). Default: for paired-end reads, the mode obtained using the
                       first 2M single-mapping proper pairs, smoothing the histogram in windows of 3;
                       for single-end reads, 180.
  sd_of_isizes:        standard deviation of insert sizes (e.g. 25). Default: for paired-end reads, 1/1.96 times
                       the distance between the mode and the top 2.50% insert sizes in the first 2M single-mapping proper pairs;
                       for single-end reads, 25.

Option -k:             for paired-end reads, disable insert size deviation filter, which removes alignments with
                       insert sizes beyond 1.96 sd of the mean if alternative alignments exist
                       with insert sizes within 1.96 sd of the mean.

Option -b:             merge hits to _A and _B arbitrary-phase haplo-isoforms. This is usually done if the only
                       reason for creating haplo-isoform sequences is to reduce allelic mapping biases.

Argument to option -c for adjusting transcript lengths for non-uniformity with 'mseq' R package (required):
  alt_lengths:         name of file in which to save the adjusted lengths (can be reused in future with -u)

Argument to option -u for correcting for non-uniformity:
  alt_lengths:         name of file containing adjusted transcript lengths

hitstools

Utilities for handling hitsfile (.hits) and bitsfile (.bits, `binary hitsfile`) formats.

View a summary of a hits file:
  hitstools inspect in.hits
Output header:
 hitstools header in.hits
Convert from text format to binary format:
  hitstools b in_text.hits > out_binary.hits
Convert from binary format to text format:
  hitstools t in_binary.hits > out_text.hits

mmseq

Usage: mmseq [OPTIONS...] hits_file output_base

Mandatory arguments:
  hits_file          hits file generated with `bam2hits`
  output_base        base name for output files

Optional arguments:
  -alpha FLOAT       value of alpha in Gamma prior for mu (default: 0.1)
  -beta FLOAT        value of beta in Gamma prior for mu (default: 0.1)
  -max_em_iter INT   maximum number of EM iterations (default: 1000)
  -epsilon FLOAT     minimum loglik ratio between successive EM iterations (default: 0.1)
  -gibbs_iter INT    number of Gibbs iterations (default: 16384)
  -gibbs_ss INT      subsampling interval for Gibbs output (default: gibbs_iter/1024)
  -seed INT          seed for the PRNG in thread 0 (default: 1234)
  -debug             output additional diagnostic files
  -help              print this help message
  -version           print the version

mmdiff

Usage: mmdiff [OPTIONS...] [-de n1 n2 ... nC | -m matrices_file] mmseq_file1 mmseq_file2... > out.mmdiff
       matrices_file contains M P0 P1 each separated by an empty line

Mandatory arguments:
  ONE OF:
  -de INT INT...    simple differential expression between several groups of samples, where
                    each INT corresponds to a grouping of MMSEQ files into one condition
  -m STRING         path to matrices file specifying the two models to compare
Optional arguments:
  -tracedir STRING  directory in which to save MCMC traces (default:  (do not write traces))
  -useprops         run on isoform/gene proportions instead of expression
  -permute          run on permuted dataset; combine with non-permuted results to obtain q-values
  -p FLOAT          prior probability of the second model (default: 0.1)
  -d FLOAT          d hyperparameter (default: 1.4)
  -s FLOAT          s hyperparameter (default: 2.0)
  -l INT            length of MCMC trace used to produce the input estimates (default: 1024)
  -fixalpha         fix alpha=0 and do not update (default: estimate alpha)
  -nonorm           do not normalise the input data
  -pdash FLOAT      initial value of pdash for improved mixing of gamma (stays fixed if -notune is set) (default: 0.5)
  -notune           do not tune pdash to improve mixing for gamma
  -uhfrac FLOAT     if normalising, use features which have at least one unique hit
                    in at least uhfrac of the samples (default: max(0.2, (N - floor(N^2/160))/N))
  -burnin INT       burnin iterations (default: 8192)
  -iter INT         MCMC iterations (default: 16384)
  -seed INT         seed for PRNG (default: 1234)
  -range INT INT    select features indexed within range (default: all)

mmcollapse

Usage: mmcollapse basename1 basename2...

extract_transcripts

Usage: extract_transcripts fasta_file ensembl_gtf_file

offsetGTF

Usage: offsetGTF ensembl_gtf_file indel_vcf_file

t2g_hits

Usage: t2g_hits <(hitstools t hits_file) > gene_hits_file
Convert a transcript-level hits file in text format into a gene-level hits file in text format

Mandatory arguments:
  hits_file          hits file generated with `bam2hits`

fastagrep.sh

Usage: fastagrep.sh: [-v] pattern fasta_file

testregexp.rb

Usage: testregexp.rb -m tg_regexp t_ind g_ind cdna_file
Test regular expression patterns for use with `bam2hits` and `haploref.rb`.

Arguments to flag -m:
  tg_regexp: regular expression matching FASTA entry names, where pairs of brackets
             are used to capture transcript and gene IDs. Default: "(\S+).*gene:(\S+).*"
  t_ind:     index of bracket pair that captures the transcript ID. Default: 1.
  g_ind:     index of bracket pair that captures the gene ID. Default: 2.

  cdna_file: reference FASTA file.

filterGTF.rb

Usage: filterGTF.rb cdna_file gtf_file > new_gtf_file
Filter GTF file according to transcripts in cDNA file.

Mandatory arguments:
  cdna_file: Ensembl transcript FASTA file (e.g. Homo_sapiens.GRCh37.56.cdna.ref.fa).
  gtf_file: Ensembl GTF file (e.g. Homo_sapiens.GRCh37.56.gtf).

haploref.rb

Usage: haploref.rb [-m tg_regexp t_ind g_ind] cdna_file gff_file pos_file hap_file > hapiso_fasta
Generate haplotype and isoform specific FASTA reference.

Mandatory arguments:
  cdna_file: default transcript FASTA file.
  gff_file: GFF file containing structure annotation for transcripts in cdna_file.
  pos_file: file containing, for each transcript, chromosome and positions of SNPs.
  hap_file: file containing, for each transcript, two versions, e.g. suffixed _A and _B, on separate lines with the alleles for the two haplotypes at each position listed in the pos_file (A and B respectively).

Arguments to option -m for use with non-Ensembl cDNA FASTA files:
  tg_regexp: regular expression matching FASTA entry names, where pairs of brackets
             are used to capture transcript and gene IDs. Default: "(\S+).*gene:(\S+).*"
  t_ind:     index of bracket pair that captures the transcript ID. Default: 1.
  g_ind:     index of bracket pair that captures the gene ID. Default: 2.

ensembl_gtf_to_gff.pl

convert gtf file from Ensembl to gff3 file

Usage: ensembl_gtf_to_gff.pl gtf_file > gff_file