Xinglab / rmats-turbo

Other
233 stars 55 forks source link

rMATS turbo v4.3.0

Latest Release Total GitHub Downloads Total Bioconda Installs Total SourceForge Downloads Total Docker Pulls

About

rMATS turbo is the C/Cython version of rMATS (refer to http://rnaseq-mats.sourceforge.net). The major difference between rMATS turbo and rMATS is speed and space usage. rMATS turbo is 100 times faster and the output file is 1000 times smaller than rMATS. These advantages make analysis and storage of a large scale dataset easy and convenient.

Counting part Statistical part
Speed (C/Cython version vs Python version) 20~100 times faster (one thread) 300 times faster (6 threads)
Storage usage (C/Cython version vs Python version) 1000 times smaller -

Table of contents

Dependencies

Tested on Ubuntu (20.04 LTS)

Build

If the required dependencies are already installed, then rMATS can be built with:

./build_rmats

And then run with:

python rmats.py {arguments}

The build_rmats script usage is:

./build_rmats [--conda] [--no-paired-model] [--no-darts-model]

--conda: create a conda environment for Python and R dependencies
--no-paired-model: do not install dependencies for the paired model
--no-darts-model: do not install dependencies for the darts model

With --conda build_rmats installs a conda environment that satisfies the required Python dependencies and also the R dependencies needed to use the paired model (PAIRADISE) and the DARTS model. The Python dependencies are listed in python_conda_requirements.txt and the R dependencies are handled using paired_model_conda_requirements.txt, darts_model_conda_requirements.txt, and install_r_deps.R after cloning the PAIRADISE and DARTS git repos.

run_rmats is a wrapper to call rmats.py with the conda environment used by build_rmats. It also sources setup_environment.sh which can be modified to handle other setup that might be needed before running rmats (such as Environment Modules).

If rMATS was built with ./build_rmats --conda then it should be run with:

./run_rmats {arguments}

It takes about 30 minutes to install dependencies and build rMATS (as tested on an Ubuntu VM with 2 CPUs and 4 GB of memory)

Test

test_rmats creates a conda environment and uses run_rmats to run the automated tests in tests/

Usage

Examples

Starting with FASTQ files

Suppose there are 2 sample groups with 2 sets of paired read (R1, R2) FASTQ files per group. (fastq.gz files can also be used)

Create txt files that will be used to pass this grouping of inputs to rMATS. The expected format is : to separate paired reads and , to separate replicates.

Details about the remaining arguments are discussed in All arguments

run rMATS on this input with:

python rmats.py --s1 /path/to/s1.txt --s2 /path/to/s2.txt --gtf /path/to/the.gtf --bi /path/to/STAR_binary_index -t paired --readLength 50 --nthread 4 --od /path/to/output --tmp /path/to/tmp_output

rMATS will first process the FASTQ input into BAM files stored in the --tmp directory. Then the splicing analysis will be performed.

Starting with BAM files

Reads can be mapped independently of rMATS with any aligner and then the resulting BAM files can be used as input to rMATS. rMATS requires aligned reads to match --readLength unless --variable-read-length is given. rMATS also ignores alignments with soft or hard clipping unless --allow-clipping is given.

Suppose there are 2 sample groups with 2 BAM files per group.

Create txt files that will be used to pass this grouping of inputs to rMATS. The expected format is , to separate replicates.

Details about the remaining arguments are discussed in All arguments

run rMATS on this input with:

python rmats.py --b1 /path/to/b1.txt --b2 /path/to/b2.txt --gtf /path/to/the.gtf -t paired --readLength 50 --nthread 4 --od /path/to/output --tmp /path/to/tmp_output

Running prep and post separately

rMATS analysis has two steps, prep and post. In the prep step, the input files are processed and a summary is saved to .rmats files in the --tmp directory. The .rmats files track info from each BAM separately according to the full path of the BAM specified in the input .txt file. In the post step, .rmats files are read and the final output files are created.

The --task argument allows the prep step of rMATS to be run independently for different subsets of input BAM files. Then the post step can be run on the independently generated .rmats files. This allows the computation to be run at different times and/or on different machines.

Suppose we have 8 BAMs and two machines that each have 4 CPU threads. Each machine can run the prep step on 4 BAMs concurrently. Then the post step can be run on one of the machines.

Split the BAMs into two groups. The assignment of BAMs to prep steps does not restrict the choice of --b1 and --b2 for a later post step.

On machine 1 run the prep step with prep1.txt:

python rmats.py --b1 /path/to/prep1.txt --gtf /path/to/the.gtf -t paired --readLength 50 --nthread 4 --od /path/to/output --tmp /path/to/tmp_output_prep_1 --task prep

On machine 2 run the prep step with prep2.txt:

python rmats.py --b1 /path/to/prep2.txt --gtf /path/to/the.gtf -t paired --readLength 50 --nthread 4 --od /path/to/output --tmp /path/to/tmp_output_prep_2 --task prep

Split the BAMs into two groups. This split is for statistically comparing the two groups and does not need to reflect the split used in the prep steps

Copy the .rmats files from the separate prep steps to a directory so that the post step can access all the prep data. The filenames have the format {datetime}_{id}.rmats and the filenames may conflict for prep steps run concurrently. The script cp_with_prefix.py is provided to disambiguate the .rmats filenames when copying to a shared directory:

python cp_with_prefix.py prep_1_ /path/to/tmp_output_post/ /path/to/tmp_output_prep_1/*.rmats
python cp_with_prefix.py prep_2_ /path/to/tmp_output_post/ /path/to/tmp_output_prep_2/*.rmats

On machine 1 run the post step:

python rmats.py --b1 /path/to/post1.txt --b2 /path/to/post2.txt --gtf /path/to/the.gtf -t paired --readLength 50 --nthread 4 --od /path/to/output --tmp /path/to/tmp_output_post --task post

Using the paired stats model

The default statistical model considers the samples to be unpaired. The --paired-stats flag can be used if each entry in --b1 is matched with its pair in --b2. As an example, if there are three replicates where each replicate has paired "a" and "b" data, then b1.txt and b2.txt should look like:

The --paired-stats flag can then be given so that the paired statistical model is used instead of the default unpaired model. As the paired model is running it updates a progress file under the --od directory. As an example /path/to/od/tmp/JC_SE/pairadise_status.txt is written when the paired model is producing the results for SE.MATS.JC.txt.

Running the statistical model separately

The rMATS statistical model requires an event definition file (fromGTF.[AS].txt) and a count file ({JC,JCEC}.raw.input.[AS].txt) as input. Usually those files are created by the post step which also runs the statistical model to create the final output file ([AS].MATS.{JC,JCEC}.txt). There may be situations where the event definitions and counts are already available and the statistical model can be run on those existing files with

python rmats.py --od /path/to/dir_with_existing_files --tmp /path/to/tmp_dir --task stat

One use case for --task stat is when there are more than two groups to compare. For example, if there are 3 sample groups, then it is possible to compare each sample group to the other two (1 to 2, 1 to 3, 2 to 3). This can be done by first processing all the samples together using the usual rMATS pipeline

After all of the BAMs have been processed in this way, the output directory will contain the necessary fromGTF.[AS].txt and {JC,JCEC}.raw.input.[AS].txt files. The fromGTF.[AS].txt files can be used "as is" for all comparisons involving the samples, but the information that is relevant to a specific comparison needs to be extracted from the {JC,JCEC}.raw.input.[AS].txt files. This can be done using rMATS_P/prepare_stat_inputs.py. If there are 3 replicates in each of the 3 groups and they were provided in the --b1 argument of the post step in ascending order (group_1_rep_1, group_1_rep_2, ..., group_3_rep_3) then the comparisons can be performed by

Tips

All Arguments

python rmats.py -h

usage: rmats.py [options]

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --gtf GTF             An annotation of genes and transcripts in GTF format
  --b1 B1               A text file containing a comma separated list of the
                        BAM files for sample_1. (Only if using BAM)
  --b2 B2               A text file containing a comma separated list of the
                        BAM files for sample_2. (Only if using BAM)
  --s1 S1               A text file containing a comma separated list of the
                        FASTQ files for sample_1. If using paired reads the
                        format is ":" to separate pairs and "," to separate
                        replicates. (Only if using fastq)
  --s2 S2               A text file containing a comma separated list of the
                        FASTQ files for sample_2. If using paired reads the
                        format is ":" to separate pairs and "," to separate
                        replicates. (Only if using fastq)
  --od OD               The directory for final output from the post step
  --tmp TMP             The directory for intermediate output such as ".rmats"
                        files from the prep step
  -t {paired,single}    Type of read used in the analysis: either "paired" for
                        paired-end data or "single" for single-end data.
                        Default: paired
  --libType {fr-unstranded,fr-firststrand,fr-secondstrand}
                        Library type. Use fr-firststrand or fr-secondstrand
                        for strand-specific data. Only relevant to the prep
                        step, not the post step. Default: fr-unstranded
  --readLength READLENGTH
                        The length of each read. Required parameter, with the
                        value set according to the RNA-seq read length
  --variable-read-length
                        Allow reads with lengths that differ from --readLength
                        to be processed. --readLength will still be used to
                        determine IncFormLen and SkipFormLen
  --anchorLength ANCHORLENGTH
                        The "anchor length" or "overhang length" used when
                        counting the number of reads spanning splice
                        junctions. A minimum number of "anchor length"
                        nucleotides must be mapped to each end of a given
                        splice junction. The minimum value is 1 and the
                        default value is set to 1 to make use of all possible
                        splice junction reads.
  --tophatAnchor TOPHATANCHOR
                        The "anchor length" or "overhang length" used in the
                        aligner. At least "anchor length" nucleotides must be
                        mapped to each end of a given splice junction. The
                        default is 1. (Only if using fastq)
  --bi BINDEX           The directory name of the STAR binary indices (name of
                        the directory that contains the suffix array file).
                        (Only if using fastq)
  --nthread NTHREAD     The number of threads. The optimal number of threads
                        should be equal to the number of CPU cores. Default: 1
  --tstat TSTAT         The number of threads for the statistical model. If
                        not set then the value of --nthread is used
  --cstat CSTAT         The cutoff splicing difference. The cutoff used in the
                        null hypothesis test for differential alternative
                        splicing. The default is 0.0001 for 0.01% difference.
                        Valid: 0 <= cutoff < 1. Does not apply to the paired
                        stats model
  --task {prep,post,both,inte,stat}
                        Specify which step(s) of rMATS-turbo to run. Default:
                        both. prep: preprocess BAM files and generate .rmats
                        files. post: load .rmats files into memory, detect and
                        count alternative splicing events, and calculate P
                        value (if not --statoff). both: prep + post. inte
                        (integrity): check that the BAM filenames recorded by
                        the prep task(s) match the BAM filenames for the
                        current command line. stat: run statistical test on
                        existing output files
  --statoff             Skip the statistical analysis
  --paired-stats        Use the paired stats model
  --darts-model         Use the DARTS statistical model
  --darts-cutoff DARTS_CUTOFF
                        The cutoff of delta-PSI in the DARTS model. The output
                        posterior probability is P(abs(delta_psi) > cutoff).
                        The default is 0.05
  --novelSS             Enable detection of novel splice sites (unannotated
                        splice sites). Default is no detection of novel splice
                        sites
  --mil MIL             Minimum Intron Length. Only impacts --novelSS
                        behavior. Default: 50
  --mel MEL             Maximum Exon Length. Only impacts --novelSS behavior.
                        Default: 500
  --allow-clipping      Allow alignments with soft or hard clipping to be used
  --fixed-event-set FIXED_EVENT_SET
                        A directory containing fromGTF.[AS].txt files to be
                        used instead of detecting a new set of events
  --individual-counts   Output individualCounts.[AS_Event].txt files and add
                        the individual count columns to [AS_Event].MATS.JC.txt

Output

In rMATS-turbo, each alternative splicing pattern has a corresponding set of output files. In the filename templates below, [AS_Event] is replaced by one of the five basic alternative splicing patterns: skipped exon (SE), alternative 5' splice sites (A5SS), alternative 3' splice sites (A3SS), mutually exclusive exons (MXE), or retained intron (RI). As shown in the diagram, the number of supporting reads can be counted by the junction reads only (JC) or by both the junction and exon reads (JCEC). The output files from different counting methods are also indicated in the file name.

rmats-turbo

--od contains the final output files from the post step:

--tmp contains the intermediate files generated by the prep step: