dcjones / isolator

Rapid and robust analysis of RNA-Seq experiments.
MIT License
32 stars 7 forks source link

Isolator

Isolator analyzes RNA-Seq experiments.

There are many methods to analyze RNA-Seq data. Isolator differs in few important ways.

Note: I've run this tool on a lot of data, evaluated it with multiple benchmarks, and am writing a paper about it, but it's still alpha software. Contact me if you'd like help using it to analyze your data.

Installing

Currently, Isolator must be installed by cloning the git repository and building the source code. It has been tested on OS X and Linux.

Installing Dependencies

To build Isolator from source, you will need to first install Boost, HDF5, and cmake.

Linux

On Debian, Ubuntu, or similar.

apt-get install libboost-dev libboost-thread-dev libboost-system-dev libboost-timer-dev libhdf5-dev cmake

OSX

The easiest way to install dependencies is with Homebrew.

brew install boost hdf5 cmake

Building Isolator

From git

git clone https://github.com/dcjones/isolator.git
cd isolator/build
cmake ..
make
make install

This will install the isolator program to the default destination (usually /usr/local/bin).

Analyzing an RNA-Seq Experiment

Isolator works in two steps: analysis and summarization. The analysis step will run the sampler for a number of iterations collecting estimates of the model parameters, which it will output to an HDF5 file. This output can then mu summarized and turned into regular tab-delimined tables in a variety of ways.

What you'll need

Input to isolator is a GTF giving transcript/gene annotations along with one or more BAM/SAM files giving the aligned reads for each sample or replicate in the experiment.

Optionally, the genome sequence can also be provided in FASTA format, allowing isolator to correct for various forms of sequence bias, typically resulting in more accurate quantification.

Analyzing the Experiment

Experiments consist of one or more BAM (or SAM) files grouped into conditions. Isolator puts no restriction on the number of conditions or the number of samples within conditions, though if the goal is to compare two or more biological conditions, using multiple replicates is strongy recommended.

The analyze command is invoked like:

isolator analyze gene_annotations.gtf \
    a1.bam,a2.bam,a3.bam b1.bam,b2.bam,b3.bam

BAM files within the same condition are grouped with commas, and conditions are separated with spaced. Analyze is a general purpose command. It can also be invoked on a single bam file.

isolator analyze gene_annotations.gtf a.bam

The analyze command will run for a while and output a file named (by default) isolator-output.h5. It is an HDF5 file, which is a standardized data format than can be accessed using wide variety of tools. However, data stored in this file is a raw form, and typically far more information than is needed. To quickly generate to-the-point results, there is the isolator summarize command.

Summarizing the Analysis

The isolator analyze estimates parameters for a full probabilistic model of the experiment. It is not a simple statistical test that attempts to answer a single question (e.g. whether a gene is differentialy expressed). As a result, the results generated from isolator analyze can be used to examine the data in many ways.

This is where the isolator summarize comes in. It provides a number of sub-commands to generate simple tables of results from the sampler output. What follows are several examples. To see a list of the summarize sub-commands, run isolator summarize --list.

Transcript and Gene Expression

The simplest sort of summarization one might perform is to generate estimates of transcript expression.

isolator summarize transcript-expression isolator-output.h5

This will generate a tab-delimited file (named transcript-expression.tsv) that looks like:

gene_name       gene_id transcript_id           sample1_tpm    sample2_tpm    sample3_tpm    sample4_tpm    sample5_tpm    sample6_tpm
mt-Tf   ENSMUSG00000064336      ENSMUST00000082387      3.064260e+01    3.542358e+01    3.068786e+01    3.283976e+01    3.785287e+01    3.647695e+01
mt-Rnr1 ENSMUSG00000064337      ENSMUST00000082388      1.812662e+03    2.566914e+03    2.470344e+03    2.728776e+03    2.445997e+03    2.241069e+03
mt-Tv   ENSMUSG00000064338      ENSMUST00000082389      3.002667e+01    3.085850e+01    3.081490e+01    3.373214e+01    3.507645e+01    3.934946e+01
mt-Rnr2 ENSMUSG00000064339      ENSMUST00000082390      1.337085e+03    1.855240e+03    2.015555e+03    1.996198e+03    1.914501e+03    1.927667e+03

Each column gives a point estimate of the transcripts per million (TPM).

Credible intervals (i.e. Bayesian confidence interval) for each point estimate can also be generated by passing the --credible argument. The following command will generate a table that include lower and upper bounds of 95% credible interval for each point estimate.

isolator summarize transcript-expression --credible=0.95 isolator-output.h5

Expression can also be summarized on the gene level, according to the gene_id field in the input GTF file.

isolator summarize gene-expression isolator-output.h5

Differential Expression, Transcription, and Splicing

Since Isolator follows a Bayesian methodology, it does not assign p-values to differential expression, but rather estimates a probability that the condition-wise tendendency has changed by more than some minimum fold change. The researcher may choose the minimum fold change (it is set to a log2 fold change of 1.5 by default).

The command to generate a table of differential expression results:

isolator summarize differential-transcript-expression isolator-output.h5
# OR
isolator summarize differential-gene-expression isolator-output.h5

The minmum effect size can be set with the --effect-size argument. E.g. passing --effect-size 2 will cause the summarize command to compute the probability of a 2-fold or greater change in expression between conditions.

Expression is a general term for the relative number of copies of an mRNA. A change in expression can be driven by a change in transcription or in post-transcriptional processing (e.g. splicing). Transcriptional and post-transcriptional effects are explicitly modeled in Isolator and splicing can be separately tested.

to summarize condition pairwise changes in splicing:

isolator summarize differential-splicing isolator-output.h5

There are also methods built in to test for splicing changes at the feature (i.e. cassette exons and retained introns, currently) level, which is often more robust due to varying annotation quality.

isolator summarize differential-feature-splicing isolator-output.h5

These commands similarly accept the -a, -b, and --effect-size arguments to set the conditions being tested and the minimum effect size, respectively.

Note: for simplicitly the term "splicing" is in used here, when in fact it will capture any change in expression dynamics occouring after transcription initiation, including changes in splicing and transcription termination.

Using an experiment specification file

In the previous examples we invoked isolator analyze by passing some number of BAM files as arguments. This can be a pain with complex experiments involving many conditions. Furthermore, to keept track of all these conditions, we may wish to provide names. The analyze command can also be invoked with a YAML file describing the experiment and the location of the BAM files.

This file is structured like the following.

condition1-name:
  condition1-replicate1-name: cond1-repl1-reads.bam
  condition1-replicate2-name: cond1-repl2-reads.bam

condition2-name:
  condition2-replicate1-name: cond2-repl1-reads.bam
  condition2-replicate2-name: cond2-repl2-reads.bam

If we same this to experiment.yml, we can then invoke the analyze command more conveniently:

isolator analyze genes.gtf experiment.yml

For a more concrete example, here's an example of a specification of an experiment involving six conditions, each with either three or two replicates.

h7-1yr:
    1-h7-1yr: 1-h7-1yr.bam
    2-h7-1yr: 2-h7-1yr.bam
    3-h7-1yr: 3-h7-1yr.bam

h7-day20:
    4-h7-day20: 4-h7-day20.bam
    5-h7-day20: 5-h7-day20.bam
    6-h7-day20: 6-h7-day20.bam

empty-vector:
    8-empty-vector: 8-empty-vector.bam
    9-empty-vector: 9-empty-vector.bam

let-7g-oe:
    10-let-7g-oe: 10-let-7g-oe.bam
    11-let-7g-oe: 11-let-7g-oe.bam
    12-let-7g-oe: 12-let-7g-oe.bam

fetal-ventricle:
    13-fetal-ventricle: 13-fetal-ventricle.bam
    14-fetal-ventricle: 14-fetal-ventricle.bam

fetal-atrium:
    15-fetal-atrium: 15-fetal-atrium.bam
    16-fetal-atrium: 16-fetal-atrium.bam