BMILAB / AStrap

AStrap: Identification of alternative splicing from transcript sequences without a reference genome
GNU General Public License v2.0
4 stars 4 forks source link

AStrap R package

Identification of alternative splicing from transcript sequences without a reference genome

About

AStrap implements a de novo approach to detect alternative splicing (AS) from transcript sequences without a reference genome, including identification of AS events by extensive pair-wise alignments of transcript sequences from SMRT sequencing data and prediction of AS types by a machine-learning model integrating more than 500 assembled features. AS events of four types including intron retention (IR), exon skipping (ES), alternative donor sites (AltD), and alternative acceptor sites (AltA) were considered. AStrap consists of four main stages: data preprocessing, feature construction, classification model building, identification of AS events and prediction of AS types. AStrap could be a valuable addition to the community for the study of AS in non-model organisms with limited genetic resources.

Installing AStrap

Mandatory

Required R Packages

Suggested R Packages

Installation

Using AStrap

In order to facilitate user understanding, we use the provided example dataset to illustrate the standard analysis work-flow of AStrap. Please refer to the User Guide for full details.

Section 1 Data loading

For identification AS events and prediction AS types, first the user should load data into AStrap.

Section 2 Feature construction

In AStrap, we have compiled a compendium of 511 unique features that covers major factors known to shape introns and/or exons. In fact, feature construction has been embedded in the function AStrap (see below), users therefore don��t need to carry out this step.

Section 3 Model building and performance evaluation

Two classification models trained on collected AS data from rice and human were integrated in AStrap, which could be directly applied for distinguishing among AS types for other species. For classification of AS types, we applied and compared three widely used machine-learning techniques, including support vector machine (SVM), random forests (RF), and adaptive boosting (AdaBoost). According to our analysis (see our paper), the RF-based model performed the best, followed by the AdaBoost-based model, and the SVM-based model performed the worst. Therefore, it is recommended that users adopt RF-based model for prediction of AS types.

* Use human classification model, including SVM, RF, AdaBoost.

human_model<- load(system.file("data","human_model.Rdata",package = "AStrap"))

Meanwhile, users can also train a specific classification model on their own data sets.
* Use function "extract_IsoSeq_ge" to extract sequence around splice sites based on genome.

Loading example alternative splicing data

path <- system.file("extdata","sample_riceAS.txt",package = "AStrap") rice_ASdata <-read.table(path,sep="\t",head = TRUE,stringsAsFactors = FALSE)

Loading genome using the package BSgenome

library("BSgenome.Osativa.MSU.MSU7")

Extracting sequence around splice sites based on the genome

rice_ASdata<- extract_IsoSeq_ge(rice_ASdata,Osativa)

* Use function "buildTrainModel" to build model. The classification method can be chosen using parameter classifier, including SVM, RF (default), and AdaBoost. This function returns a list, including training set, test set, fitted model, predicted classification results, evaluation matrix of the fitted model and an ROC curve. 

library(randomForest) library(ROCR) library(ggplot2) model <- buildTrainModel(rice_ASdata, chooseNum = 100, proTrain = 2/3, proTest = 1/3, ASlength =0, classifier = "rf", use.all = FALSE)


Section 4 Identification of AS events and prediction of AS types
---------
This section describes the identification of AS events based on pairwise alignment of isoforms of the same cluster and prediction of AS types based on the fitted model.
*  User function "AStrap" to identify AS events and predict AS types.

Loading rice model

rice_model<- load(system.file("data","rice_model.Rdata",package = "AStrap"))

Identification and prediction based on RF-based model of rice

result <- AStrap(alignment,trSequence,rice_RFmodel)

* User function "plotAS" to visualize intuitively the result.

library(Gviz) plotAS(result$ASevent, id = 1) plotAS(result$ASevent, id = 7) plotAS(result$ASevent, id = 13) plotAS(result$ASevent, id = 21)



Citation
---------
If you are using AStrap, please cite: [Ji G, Ye W, Su Y, Chen M, Huang G and Wu X* (2019) AStrap: identification of alternative splicing from transcript sequences without a reference genome, Bioinformatics, 35, 2654-2656.](https://pubmed.ncbi.nlm.nih.gov/30535139/)