McSplicer is a probabilistic model for estimating splice site usages, rather than modeling an individual outcome of a splicing process such as exon skipping. We assume that potential 5' and 3' splice sites are given. This information can be obtained from annotation databases or estimated from RNA-seq reads by running existing assemblers. The potential splice sites partition a gene into a sequence of segments. We introduce a sequence of hidden variables, each of which indicates whether a corresponding segment is part of a transcript. We model the splicing process by assuming that this sequence of hidden variables follows an inhomogeneous Markov chain, hence the name Markov chain Splicer. The parameters in the model are interpreted as splice site usages. We use EM algorithm to maximize the likelihood of these parameters. Using splice site estimates, one can describe the splicing processes, and estimate the probabilities of various local splicing events. For a full description of the method please check McSplicer paper: https://doi.org/10.1093/bioinformatics/btab050
git clone https://github.com/canzarlab/McSplicer.git
You can execute the script:
cd McSplicer/
python3 ./python_scripts/McSplicer.py --help
Python 3.6 implementation of McSplicer requires the following standard package:
Initially, you need as inputs:
The annotation in GTF format, which can be obtained from available reference annotation, e.g., Homo sapiens genome assembly GRCh37 (hg19), or estimated from RNA-seq reads using existing assemblers such as StringTie. For example the GTF files in ./examples/gtf/
were generated by running StringTie with a genome-guided mode.
Aligned, sorted and indexed RNA-seq reads in SAMTools BAM file. For example the BAM files in ./examples/bam/
were aligned using STAR, then sorted and indexed using SAMTools.
You can run McSplicer easily in 3 steps:
./bin/exonRefine <annotation.gtf> --prefix OUTPUT_PREFIX
./bin/sigcount <alignments.bam> <annotation_refined.gtf> <outfile-prefix>
Run McSplucer to get splice site usage estimates.
python3 ./python_scripts/McSplicer.py \
--gtf REFINED_GTF \
--count_file SIGNATURE_COUNT_FILE \
--out_dir OUTPUT_DIRECTORY\
--bootstraps NUM_OF_BOOSTRAPS\
--read_len READ_LENGTH\
--prefix OUT_FILE_PREFIX
Execute McSplicer script with --help
option for a complete list of options.
Sample data and usage examples can be found in ./examples
subfolder.
The output csv file contains the bootstrap step, splice site index, gene strand, chromosome, splice site genome position, and McSplicer splice site usage estimate.
If you choose to run McSplicer with --bootstraps n
, step 0 in the output file corresponds to the estimates based on input count data, and the following n steps correspond to the estimates of bootstrap count data.
Splice site index column represents the index of 3' start or 5' end splice sites as they appear in a gene according to their chronological order, e.g., s0, s1,..., e0, e1,.. see the figure above for illustration.
The subfolder ./simulation_study
contains the data and script needed to generate the synthetic RNA-seq datasets and reproduce the results reported in McSplicer paper.
Spike-In RNA Variants are available at the NCBI database in dataset ID: SRR3497201. The RNA-seq reads of the 36 individuals with autism spectrum disorder are publicly available here.
© 2020 McSplicer