PEA is a docker-based integrated R toolkit that aims to facilitate the plant epitranscriptome analysis. This toolkit contains a comprehensive collection of functions required for read mapping, CMR calling, motif scanning and discovery, and gene functional enrichment analysis. PEA also takes advantage of machine learning technologies for transcriptome-scale CMR prediction, with high prediction accuracy, using the Positive Samples Only Learning algorithm, which addresses the two-class classification problem by using only positive samples (CMRs), in the absence of negative samples (non-CMRs). Hence PEA is a versatile epitranscriptome analysis pipeline covering CMR calling, prediction, and annotation.
# Loading sample data for reads mapping
fq <- system.file("extdata/test.fq", package = "PEA")
referenceGenome <- system.file("extdata/chromosome1.fa", package = "PEA")
# reads mapping using tophat with default parameter, the alignment results will be saved to the working directory (/host directory of dataset)
test.bam <- readMapping(alignment = "tophat", fq = fq, refGenome = referenceGenome, paired = F)
# reads mapping using tophat with 2 threads
test.bam <- readMapping(alignment = "tophat", fq = fq,refGenome = referenceGenome, paired = F, ... = "-p 2")
Note: other alignment toolkits such as Bowtie, Bowtie 2, TopHat 2, Hisat and Hisat can be easily invoked by specifying “alignment” parameter in “readMapping” function.
PEA integrated five peak calling methods including “SlidingWindow”, “exomePeak”, “MetPeak”, “MACS2” and “BayesPeak”. each peak calling method can be easily invoked by specifying parameter “method” in “CMRCalling” function. How to choose:
“BayesPeak” is specially designed to detect genomic enriched regions in
ChIP-seq dataset using Bayesian hidden Markov model and Markov chain Monte
Carlo algorithms.
Summary: Both “SlidingWindow” and “exomePeak” are sliding window-based method to detect significant enriched regions and perform well on CMR prediction, while “MACS2” and “BayesPeak” are peak calling methods designed for ChIP-seq dataset, but “MACS2” can strictly control false positives, in contrast, BayePeak is weaker in controlling false positives. In addition, method “exomePeak”, “MeTPeak” and “MACS2” can accept biological replicates using statistical methods.
# Loading sample data for peak calling
input.bam <- system.file("extdata/chr1_input_test.bam", package = "PEA")
RIP.bam <- system.file("extdata/chr1_RIP_test.bam", package = "PEA")
refGenome <- system.file("extdata/chromosome1.fa", package = "PEA")
GTF <- system.file("extdata/chromosome1.gtf", package = "PEA")
cmrMat <- CMRCalling(CMR = "m6A", IPBAM = RIP.bam, inputBAM = input.bam, method = "SlidingWindow", mappedInput = 17472, mappedRIP = 20072, refGenome = refGenome)
system(paste0("samtools faidx ", refGenome)) system(paste0("cut -f1,2 ", refGenome, ".fai > genome.size")) cmrMat <- CMRCalling(CMR = "m6A", IPBAM = RIP.bam, inputBAM = input.bam, method = "SlidingWindow", mappedInput = 17472, mappedRIP = 20072, refGenome = "genome.size")
write.table(cmrMat, file = "SlidingWindow_peaks.txt", sep = "\t", quote = F, row.names = F, col.names = F)
Note: parameters "mappedInput" and "mappedRIP" represent the number of reads aligned to reference genome in input and RIP samples, respectively.
cmrMat <- CMRCalling(CMR = "m6A", method = "exomePeak", IPBAM = RIP.bam, inputBAM = input.bam, GTF = GTF)
write.table(cmrMat, file = "exomePeak_peaks.txt", sep = "\t", quote = F, row.names = F, col.names = F)
cmrMat <- CMRCalling(CMR = "m6A", method = "MetPeak", IPBAM = RIP.bam, inputBAM = input.bam, GTF = GTF)
write.table(cmrMat, file = "MetPeak_peaks.txt", sep = "\t",quote = F, row.names = F, col.names = F)
cmrMat <- CMRCalling(CMR = "m6A", method = "MACS2", IPBAM = RIP.bam, inputBAM = input.bam, GTF = GTF, ...="--nomodel")
Note: futher parameters recognized by MACS2 can be specified in "..."
write.table(cmrMat, file = "MACS2_peaks.txt", sep = "\t", quote = F, row.names = F, col.names = F)
cmrMat <- CMRCalling(CMR = "m6A", method = "BayesPeak", IPBAM = RIP.bam, inputBAM = input.bam, GTF = GTF)
write.table(cmrMat, file = "BayesPeak_peaks.txt", sep = "\t", quote = F, row.names = F, col.names = F)
In order to be recognized by the ML-based classification models, PEA can transform each sample (L-nt flanking sequence centered on m6A or non-m6A modifications) into a (L4+20+22)-dimensional vector which including L4 binary-based features, 20 k-mer based features (k = 1 and k = 2) and 22 PseDNC-based features.
# Convert genomic position to cDNA position
GTF <- system.file("extdata/chromosome1.gtf", package = "PEA")
peaks <- G2T(bedPos = cmrMat, GTF = GTF)
# Searching RRACH motif from cDNA sequence
cDNA <- system.file("extdata/chr1_cdna.fa", package = "PEA")
motifPos <- searchMotifPos(sequence = cDNA)
# Find confident positive samples and unlabel samples
posSamples <- findConfidentPosSamples(peaks = peaks, motifPos = motifPos)
unlabelSamples <- findUnlabelSamples(cDNAID = posSamples$cDNAID, motifPos = motifPos, posSamples = posSamples$positives)
In order to be recognized by the ML-based classification models, PEA can transform each sample (L-nt flanking sequence centered on m6A or non-m6 modifications) into a (L*4+20+22)-dimensional vector which including L*4 binary-based features, 20 k-mer based features (k = 1 and k = 2) and 22 PseDNC-based features.
# Extracting flanking sequence of 101-nt centered on m6A or non-m6A
positives <- posSamples$positives
posSeq <- extractSeqs(RNAseq = cDNA, samples = positives, seqLen = 101)
unlabelSeq <- extractSeqs(RNAseq = cDNA, samples = unlabelSamples, seqLen = 101)
# Feature encoding using *binary*, *k-mer* and *PseDNC* encoding schemes
posFeatureMat <- featureEncoding(RNAseq = posSeq)
unlabelFeatureMat <- featureEncoding(RNAseq = unlabelSeq)
featureMat <- rbind(posFeatureMat, unlabelFeatureMat)
The RF is a powerful ML algorithm that has been widely used for the binary classification problems in bioinformatics and computational biology (Cui, et al., 2015; Ma, et al., 2014; Touw, et al., 2013). To build a RF-based classifier (predictor), positive samples and negative samples are required to be specified by the user. Existing m6A prediction approach typically use the known m6A modifications as the positive sample set and the unlabeled samples as the negative sample set to build predictors to identify new m6A modifications from unlabeled samples (Chen, et al., 2016; Zhou, et al., 2016). Such kind of m6A predictors is actually built from a noisy negative sample set. As a result, the predictors do not perform well as they could in identifying new m6A modifications. To address this challenge, PSOL algorithm was introduced into the PEA package, which can be used to build an ML-based binary classification system with high prediction accuracy with only positive samples and without pre-specified negative samples. This algorithm has been previously applied to predict abiotic stress-response genes and genomic loci encoding functional noncoding RNAs (Ma, et al., 2014; Wang, et al., 2006).
To implement the PSOL algorithm, an initial negative sample set with the same size as the positive sample set was firstly constructed, which contained samples that were selected from the unlabeled sample set based on the maximal Euclidean distances to positive samples. The negative sample set was then expanded iteratively using the RF-based classifier until the designated iteration number was reached. In each iteration, a ten-fold cross-validation method and the receiver operating characteristic (ROC) curve analysis were performed to evaluate the predictive performance of current classifier in distinguishing between positive samples and negative samples. In order to get the optimal RF classifier, the RF classifier with max AUC value was selected from the ten-fold cross-validation and was subsequently used for scoring the unlabeled samples with a user-adjustable threshold (TPR = 0.995) to ensure a large fraction of positive samples can be correctly identified.
In each iteration, an m6A predictor was built using the RF algorithm with positive samples and selected negative samples. PSOL algorithm can be easily implemented using the function “PSOL” in the PEA package.
# Creating a directory to save psol results
dir.create("./psol")
psolResDic <- "./psol/"
# Starting running PSOL
psolRes <- PSOL(featureMatrix = featureMat, positives = positives,
unlabels = unlabelSamples, PSOLResDic = psolResDic, cpus = 1)
# Extracting negative samples generated by PSOL
negatives <- psolRes$finalNegatives
# Performing 5-fold cross-validation experiment to evaluate the performance of predictor
cvRes <- cross_validation(featureMat = featureMat, positives = positives, negatives = negatives, cross = 5)
# Plotting ROC curves for 5-fold cross-validation
pdf("cross_validation.pdf", height = 5, width = 5)
plotROC(cvRes = cvRes)
dev.off()
After performing PSOL, an m6A predictor would be generated in the “psolRes” object, users can easily predict novel candidate CMRs by following command.
# Predicting novel candidate m6A modifications
predSeq <- system.file("extdata/test_pred.fa", package = "PEA")
predMat <- predCMR(predSeq, model = psolRes$model)
PEA also provides CMR annotation to provide insights into spatial and functional associations of CMRs through function “CMRAnnotation”. Using this function, the manner of distribution of CMRs in the transcriptome is statistically analyzed, including the spatial distribution of CMRs, and the regions of enrichment of CMRs within transcripts. In addition, motif scanning and de novo motif discovery are also provided to investigate the potential regulatory mechanisms leading by CMRs. Moreover, gene functional (Gene Ontology) enrichment analysis is also performed to characterize the enriched functions of CMR-corresponding transcripts using R package “topGO”
# Loading sample data
GTF <- system.file("extdata/chromosome1.gtf", package = "PEA")
input.bam <- system.file("extdata/chr1_input_test.bam", package = "PEA")
RIP.bam <- system.file("extdata/chr1_RIP_test.bam", package = "PEA")
refGenome <- system.file("extdata/chromosome1.fa", package = "PEA")
cDNA <- system.file("extdata/chr1_cdna.fa", package = "PEA")
# Extract the UTR position from GTF file
UTRMat <- getUTR(GTF = GTF)
cmrMat <- CMRCalling(CMR = "m6A", IPBAM = RIP.bam, inputBAM = input.bam, method = "SlidingWindow", mappedInput = 17472, mappedRIP = 20072, refGenome = refGenome)
# Perform CMR location distribution analysis
pdf("CMR_location.pdf", height = 10, width = 10)
results <- CMRAnnotation(cmrMat = cmrMat, SNR = F, UTRMat = UTRMat, genomic = T, annotation = "location", GTF = GTF, RNAseq = cDNA)
dev.off()
# Search motif
testSeq <- system.file("extdata/test.fa", package = "PEA")
pdf("motifScan.pdf", height = 5, width = 5)
results.scan <- CMRAnnotation(cmrSeq = testSeq, annotation = "motifScan")
dev.off()
# De-novo motif detection
pdf("motifDetect.pdf", height = 5, width = 5)
results.detect <- CMRAnnotation(cmrSeq = testSeq, annotation = "motifDetect")
dev.off()
# GO enrichment analysis
library(topGO)
peaks <- G2T(bedPos = cmrMat, GTF = GTF)
enrichment <- CMRAnnotation(cmrMat = peaks, GTF = GTF, annotation = "GO", topNodes = 20, dataset = "athaliana_eg_gene")
The source codes of PEA are freely available at PEA
Zhai, J., Song, J., Cheng, Q., Tang, Y., & Ma, C. PEA: an integrated R toolkit for plant epitranscriptome analysis. Bioinformatics, 2018.