canzarlab / McSplicer

GNU General Public License v3.0
12 stars 6 forks source link

McSplicer

McSplicer is a probabilistic model for estimating splice site usages, rather than modeling an individual outcome of a splicing process such as exon skipping. We assume that potential 5' and 3' splice sites are given. This information can be obtained from annotation databases or estimated from RNA-seq reads by running existing assemblers. The potential splice sites partition a gene into a sequence of segments. We introduce a sequence of hidden variables, each of which indicates whether a corresponding segment is part of a transcript. We model the splicing process by assuming that this sequence of hidden variables follows an inhomogeneous Markov chain, hence the name Markov chain Splicer. The parameters in the model are interpreted as splice site usages. We use EM algorithm to maximize the likelihood of these parameters. Using splice site estimates, one can describe the splicing processes, and estimate the probabilities of various local splicing events. For a full description of the method please check McSplicer paper: https://doi.org/10.1093/bioinformatics/btab050

       

McSplicer

Using this software

Installation

git clone https://github.com/canzarlab/McSplicer.git

You can execute the script:

cd McSplicer/
python3 ./python_scripts/McSplicer.py --help

Dependencies

Python 3.6 implementation of McSplicer requires the following standard package:

Usage

Initially, you need as inputs:

You can run McSplicer easily in 3 steps:

  1. Run exonRefine to refine the set of exons into non-overlapping, contiguous subexonic regions, i.e., signatures. For example, in the figure shown above we have 4 distinct signatures numbered in the grey rectangles {1, 2, 3, 4}.
./bin/exonRefine  <annotation.gtf> --prefix OUTPUT_PREFIX
  1. Run SigCount on the refined annotation generated in the previous step together with the input BAM file to generate signature counts, i.e., the number of reads mapped to each of the signatures or their junctions. For example, in the figure above we have have 16 reads mapped to the first signuture and 4 reads mapped to the junction 3-4.
 ./bin/sigcount <alignments.bam> <annotation_refined.gtf> <outfile-prefix>
  1. Run McSplucer to get splice site usage estimates.

    python3 ./python_scripts/McSplicer.py \
        --gtf REFINED_GTF \
        --count_file SIGNATURE_COUNT_FILE \
        --out_dir OUTPUT_DIRECTORY\
        --bootstraps NUM_OF_BOOSTRAPS\
        --read_len READ_LENGTH\
        --prefix OUT_FILE_PREFIX
    

    Execute McSplicer script with --help option for a complete list of options.

Sample data and usage examples can be found in ./examples subfolder.

Output:

The output csv file contains the bootstrap step, splice site index, gene strand, chromosome, splice site genome position, and McSplicer splice site usage estimate.

If you choose to run McSplicer with --bootstraps n, step 0 in the output file corresponds to the estimates based on input count data, and the following n steps correspond to the estimates of bootstrap count data.

Splice site index column represents the index of 3' start or 5' end splice sites as they appear in a gene according to their chronological order, e.g., s0, s1,..., e0, e1,.. see the figure above for illustration.

Data availability:

The subfolder ./simulation_study contains the data and script needed to generate the synthetic RNA-seq datasets and reproduce the results reported in McSplicer paper. Spike-In RNA Variants are available at the NCBI database in dataset ID: SRR3497201. The RNA-seq reads of the 36 individuals with autism spectrum disorder are publicly available here.

McSplicer developers:

© 2020 McSplicer