McSplicer

McSplicer is a probabilistic model for estimating splice site usages, rather than modeling an individual outcome of a splicing process such as exon skipping. We assume that potential 5' and 3' splice sites are given. This information can be obtained from annotation databases or estimated from RNA-seq reads by running existing assemblers. The potential splice sites partition a gene into a sequence of segments. We introduce a sequence of hidden variables, each of which indicates whether a corresponding segment is part of a transcript. We model the splicing process by assuming that this sequence of hidden variables follows an inhomogeneous Markov chain, hence the name Markov chain Splicer. The parameters in the model are interpreted as splice site usages. We use EM algorithm to maximize the likelihood of these parameters. Using splice site estimates, one can describe the splicing processes, and estimate the probabilities of various local splicing events. For a full description of the method please check McSplicer paper: https://doi.org/10.1093/bioinformatics/btab050

McSplicer

Using this software

Installation

git clone https://github.com/canzarlab/McSplicer.git

You can execute the script:

cd McSplicer/
python3 ./python_scripts/McSplicer.py --help

Dependencies

Python 3.6 implementation of McSplicer requires the following standard package:

numpy>=1.15.4

Usage

Initially, you need as inputs:

The annotation in GTF format, which can be obtained from available reference annotation, e.g., Homo sapiens genome assembly GRCh37 (hg19), or estimated from RNA-seq reads using existing assemblers such as StringTie. For example the GTF files in ./examples/gtf/ were generated by running StringTie with a genome-guided mode.
Aligned, sorted and indexed RNA-seq reads in SAMTools BAM file. For example the BAM files in ./examples/bam/ were aligned using STAR, then sorted and indexed using SAMTools.

You can run McSplicer easily in 3 steps:

Run exonRefine to refine the set of exons into non-overlapping, contiguous subexonic regions, i.e., signatures. For example, in the figure shown above we have 4 distinct signatures numbered in the grey rectangles {1, 2, 3, 4}.

./bin/exonRefine  <annotation.gtf> --prefix OUTPUT_PREFIX

Run SigCount on the refined annotation generated in the previous step together with the input BAM file to generate signature counts, i.e., the number of reads mapped to each of the signatures or their junctions. For example, in the figure above we have have 16 reads mapped to the first signuture and 4 reads mapped to the junction 3-4.

 ./bin/sigcount <alignments.bam> <annotation_refined.gtf> <outfile-prefix>

Run McSplucer to get splice site usage estimates.

python3 ./python_scripts/McSplicer.py \
    --gtf REFINED_GTF \
    --count_file SIGNATURE_COUNT_FILE \
    --out_dir OUTPUT_DIRECTORY\
    --bootstraps NUM_OF_BOOSTRAPS\
    --read_len READ_LENGTH\
    --prefix OUT_FILE_PREFIX

Execute McSplicer script with --help option for a complete list of options.

Sample data and usage examples can be found in ./examples subfolder.

Output:

The output csv file contains the bootstrap step, splice site index, gene strand, chromosome, splice site genome position, and McSplicer splice site usage estimate.

If you choose to run McSplicer with --bootstraps n, step 0 in the output file corresponds to the estimates based on input count data, and the following n steps correspond to the estimates of bootstrap count data.

Splice site index column represents the index of 3' start or 5' end splice sites as they appear in a gene according to their chronological order, e.g., s₀, s₁,..., e₀, e₁,.. see the figure above for illustration.

Data availability:

The subfolder ./simulation_study contains the data and script needed to generate the synthetic RNA-seq datasets and reproduce the results reported in McSplicer paper. Spike-In RNA Variants are available at the NCBI database in dataset ID: SRR3497201. The RNA-seq reads of the 36 individuals with autism spectrum disorder are publicly available here.

McSplicer developers:

Israa Alqassem (alqassem.isra@gmail.com)
Yash Kumar Sonthalia (yks01247@gmail.com)

canzarlab / McSplicer

readme

McSplicer

Using this software

Installation

Dependencies

Usage

Output:

Data availability:

McSplicer developers: