Open balwierz opened 7 years ago
Thanks for clarification @GuangchuangYu. Literally what I need is that my "peaks" have strand information. And I want to annotate them with genes only on the same strand, while ignoring genes on the other strand. An easy implementation would be to separate my peaks to two groups on both strands, do the same to the TxDb, and use annotatePeak twice for (+,+) and (-,-) pairs and finally merge the results. However, I don't know how to subdivide TxDb object into two based on strand....
Do you know the way? Or can you recommend (or ideally implement) a solution?
Cheers, Piotr
Hi @balwierz, I'm dealing with this exact issue and I'm curious if you found a satisfactory solution. I've been playing with your idea of splitting your peak file by strandedness, but doing the similar split for TxDb doesn't seem to be tenable. I posed this question on bioconductor and the best solution I received was to slim the TxDb into a GRanges of transcripts, which then can be subset by strand, but this seemed to impair some of Chipseekers function in annotatePeak (addFlankGeneinfo= TRUE stopped returning all nearby genes).
Hi @balwierz ! I guess you have already found a way in the meanwhile :D !
I had the same problem and I created two separate TxDb for "+ genes" and "- genes", then I did 2 separate annotation and in each output I selected only the lines that display the same strand of the TxDb used, then I merged the tables. I hope is correct
library("org.Hs.eg.db")
library(TxDb.Hsapiens.UCSC.hg38.knownGene)
TxDb = TxDb.Hsapiens.UCSC.hg38.knownGene
#covert TxDb to list object
aa=as.list(TxDb)
sapply(aa,nrow) ;
lapply(aa,head)
# create strand specific lists
a.fw = aa
a.rv = aa
table(aa [[ "transcripts" ]]$"tx_strand")
table(aa [[ "splicings" ]]$"exon_strand")
# identify + and - genes, transcripts and exons.
w_aa.trans.fw = which( aa[["transcripts"]]$"tx_strand"=="+")
w_aa.trans.rv = which( aa[["transcripts"]]$"tx_strand"=="-")
w_aa.splic.fw = which( aa[["splicings"]]$"exon_strand"=="+")
w_aa.splic.rv = which( aa[["splicings"]]$"exon_strand"=="-")
tx_id.fw = aa[["transcripts"]]$"tx_id"[w_aa.trans.fw]
tx_id.rv = aa[["transcripts"]]$"tx_id"[w_aa.trans.rv]
table(aa [[ "transcripts" ]]$"tx_strand")
length(tx_id.fw)
length(tx_id.rv)
table(aa [[ "splicings" ]]$"exon_strand")
length(tx_id.fw)
length(tx_id.rv)
# keep "+" lines only in "FW" TxDB-list and "-" lines only in the "RV" TxDB-list
a.fw[[ "transcripts" ]] = a.fw[[ "transcripts" ]] [ w_aa.trans.fw , ]
a.rv[[ "transcripts" ]] = a.rv[[ "transcripts" ]] [ w_aa.trans.rv , ]
a.fw[[ "splicings" ]] = a.fw[[ "splicings" ]] [ w_aa.splic.fw , ]
a.rv[[ "splicings" ]] = a.rv[[ "splicings" ]] [ w_aa.splic.rv , ]
a.fw[[ "genes" ]] = a.fw[[ "genes" ]] [ a.fw[[ "genes" ]]$"tx_id" %in% tx_id.fw , ]
a.rv[[ "genes" ]] = a.rv[[ "genes" ]] [ a.rv[[ "genes" ]]$"tx_id" %in% tx_id.rv , ]
sapply(aa ,nrow) ; sapply(a.fw,nrow) ; sapply(a.rv,nrow) ;
# create the 2 strand specific databases
TxDb.fw =do.call(makeTxDb, a.fw)
TxDb.rv =do.call(makeTxDb, a.rv)
# run chipseeker
peakAnno.fw = annotatePeak("My_file", tssRegion=c(-3000, 3000), TxDb=TxDb.fw, annoDb="org.Hs.eg.db" )
peakAnno.rv = annotatePeak("My_file", tssRegion=c(-3000, 3000), TxDb=TxDb.rv, annoDb="org.Hs.eg.db" )
peakAnno.fw = as.data.table(peakAnno.fw)
peakAnno.rv = as.data.table(peakAnno.rv)
nrow(peakAnno.fw)
nrow(peakAnno.rv)
peakAnno.fw_filt = peakAnno.fw [ peakAnno.fw$"strand.1"=="+"]
peakAnno.rv_filt = peakAnno.rv [ peakAnno.rv$"strand.1"=="-"]
# merge and sort
peakAnno.str.spe = rbind(peakAnno.fw_filt,peakAnno.rv_filt)
peakAnno.str.spe = peakAnno.str.spe[ order(peakAnno.str.spe$"seqnames",peakAnno.str.spe$"start", peakAnno.str.spe$"geneStart") , ]
mirkocelii, thanks for the code. I encountered the same problem. The argument and its description are indeed very misleading...
steps to reproduce:
TxDb object construction:
library(ChIPseeker)
library(AnnotationDbi)
library(GenomicFeatures)
TxDb.Drerio.ucsc.geneid = makeTxDbFromUCSC(genome = "danRer7", tablename = "refGene")
annotation of one genomic range on the "+" strand:
as.data.frame(annotatePeak(GRanges("chr25", IRanges(37328949, 37329249), strand="+"), TxDb=TxDb.Drerio.ucsc.geneid, genomicAnnotationPriority = c("Promoter", "5UTR", "3UTR", "Exon","Downstream", "Intergenic"), sameStrand=TRUE))
Yields:
Which is wrong, because this transcript is on negative strand.
transcripts(TxDb.Drerio.ucsc.geneid)[transcripts(TxDb.Drerio.ucsc.geneid)$tx_name == "NM_001089523"]
Moreover, the region overlaps a transcript on "+" strand: NM_001270554.1, which actually should be assigned instead of the one above.
transcripts(TxDb.Drerio.ucsc.geneid)[transcripts(TxDb.Drerio.ucsc.geneid)$tx_name == "NM_001270554"]
From ?annotatePeak:
Details: