Closed crazyhottommy closed 8 years ago
5' ---> 3'
peakStart peakEnd
+ 34802689-----34802698
geneEnd geneStart
- 34601188-----------------34715659
PARD3 and the peak have no overlap. PARD3 is annotated as nearest gene, while the annotation column annotate the location of the peak. Nearest gene is calculated by distance to TSS and genomic annotation is determined by overlap.
see also https://github.com/GuangchuangYu/ChIPseeker/issues/12
I think it has to do with the gene annotation package, let's try refseq:
library(GenomicFeatures)
hg19.refseq.db <- makeTxDbFromUCSC(genome="hg19", table="refGene")
gr<- GRanges( seqnames ="chr10", ranges=IRanges(start= 34802689, end= 34802698), strand="+")
ap<- annotatePeak(gr, tssRegion=c(-3000, 3000),
TxDb=hg19.refseq.db, level = "transcript", annoDb="org.Hs.eg.db",
sameStrand = FALSE, ignoreOverlap = FALSE,
ignoreDownstream = TRUE)
>as.data.frame(ap)
seqnames start end width strand annotation geneChr geneStart geneEnd geneLength geneStrand
1 chr10 34802689 34802698 10 + Distal Intergenic chr10 34048641 34061608 12968 -
geneId transcriptId distanceToTSS ENSEMBL SYMBOL GENENAME
1 100505583 NR_038932 -741081 ENSG00000261683 LINC00838 long intergenic non-protein coding RNA 838
If you check the refseq ID
>transcripts(hg19.refseq.db)[transcripts(hg19.refseq.db)$tx_name=="NM_001184791"]
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr10 [34398488, 35104253] - | 26255 NM_001184791
NM_001184791 is one of the transcripts of PARD3 gene.
the peak 34802689-----34802698 actually resides in the intron of NM_001184791. It is not annotating PARD3 as the overlapping gene, but LINC00838 which is more upstream of the PARD3 gene. You can verify this on the IGV browser.
Thanks for looking into this!
Ming
good catch!
> GRegion[GRegion$gene_id == "NM_001184791/56288"]
GRanges object with 20 ranges and 2 metadata columns:
seqnames ranges strand | gene_id intron_rank
<Rle> <IRanges> <Rle> | <character> <integer>
26255 chr10 [34400491, 34408540] - | NM_001184791/56288 1
26255 chr10 [34408669, 34420390] - | NM_001184791/56288 2
26255 chr10 [34420512, 34558584] - | NM_001184791/56288 3
26255 chr10 [34558828, 34606034] - | NM_001184791/56288 4
26255 chr10 [34606267, 34620044] - | NM_001184791/56288 5
... ... ... ... ... ... ...
26255 chr10 [34688342, 34690753] - | NM_001184791/56288 16
26255 chr10 [34690846, 34759012] - | NM_001184791/56288 17
26255 chr10 [34759192, 34805906] - | NM_001184791/56288 18
26255 chr10 [34806088, 34985245] - | NM_001184791/56288 19
26255 chr10 [34985348, 35103803] - | NM_001184791/56288 20
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
> peaks
GRanges object with 1 range and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr10 [34802689, 34802698] +
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
> findOverlaps(peaks, GRegion)
Hits object with 0 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
-------
queryLength: 1
subjectLength: 457682
> findOverlaps(peaks, unstrand(GRegion))
Hits object with 11 hits and 0 metadata columns:
queryHits subjectHits
<integer> <integer>
[1] 1 226414
[2] 1 226437
[3] 1 226460
[4] 1 226483
[5] 1 226505
[6] 1 226527
[7] 1 226547
[8] 1 226571
[9] 1 226591
[10] 1 226611
[11] 1 226630
-------
queryLength: 1
subjectLength: 457682
This is due to the newly introduced support of peaks with strandness. findOverlaps
only report overlap in the same strand.
It has been fixed in version >= 1.7.6.
> ff=transcriptsBy(hg19.refseq.db)
> ff=unlist(ff)
> ff
GRanges object with 55053 ranges and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
1 chr19 [58858172, 58864865] - | 46301 NM_130786
10 chr8 [18248755, 18258723] + | 21083 NM_000015
100 chr20 [43248163, 43280376] - | 47512 NM_000022
1000 chr18 [25530927, 25616549] - | 42760 NM_001308176
1000 chr18 [25530927, 25757410] - | 42761 NM_001792
... ... ... ... ... ... ...
9994 chr6 [90539619, 90584155] + | 16709 NM_012115
9997 chr22 [50961997, 50964033] - | 49604 NM_001169111
9997 chr22 [50961997, 50964034] - | 49605 NM_005138
9997 chr22 [50961997, 50964574] - | 49606 NM_001169110
9997 chr22 [50961997, 50964868] - | 49607 NM_001169109
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
> ff[which(ff$tx_name == "NM_001184785")]
GRanges object with 1 range and 2 metadata columns:
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
56288 chr10 [34398488, 35104253] - | 26249 NM_001184785
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
If we look at the location of PARD3 (NM_001184785), we can find that the TSS is 35104253
, which is at the 3' region of the peak. When ignoreDownstream = TRUE
, it will just be ignored.
> ap<- annotatePeak(gr, tssRegion=c(-3000, 3000),TxDb=hg19.refseq.db, level = "transcript")
>> preparing features information... 2016-01-12 10:35:59
>> identifying nearest features... 2016-01-12 10:35:59
>> calculating distance from peak to TSS... 2016-01-12 10:36:00
>> assigning genomic annotation... 2016-01-12 10:36:00
>> assigning chromosome lengths 2016-01-12 10:36:09
>> done...
> as.data.frame(ap)
seqnames start end width strand
1 chr10 34802689 34802698 10 +
annotation geneChr geneStart geneEnd
1 Intron (NM_001184785/56288, intron 22 of 24) chr10 34398488 35104253
geneLength geneStrand geneId transcriptId distanceToTSS
1 705766 - 56288 NM_001184785 301555
ap<- annotatePeak(gr, tssRegion=c(-3000, 3000),TxDb=hg19.refseq.db, level = "transcript", ignoreDownstream = TRUE)
> as.GRanges(ap)
GRanges object with 1 range and 9 metadata columns:
seqnames ranges strand | annotation geneChr geneStart
<Rle> <IRanges> <Rle> | <character> <factor> <integer>
[1] chr10 [34802689, 34802698] + | Intron (NM_001184785/56288, intron 22 of 24) chr10 34048641
geneEnd geneLength geneStrand geneId transcriptId distanceToTSS
<integer> <integer> <factor> <character> <character> <numeric>
[1] 34061608 12968 - 100505583 NR_038932 -741081
-------
seqinfo: 1 sequence from hg19 genome
thx Guangchuang! Now, the annotation is correct. However, I want the annotated gene to be PARD3 even when ignoreDownstream = TRUE
.
So, one has to call findOverlaps
before resizing the gene to TSS and unstrand the gene. If there is no overlapping genes, then use follow(breakpoint, unstrand(resize(gene, width=1))
to find the upstream ones.
In other words. I want to annotate the breakpoint with overlapping genes first (no matter the strandness of the gene and the breakpoint). If there is no overlapping gene for that breakpoint, then find the closest gene upstream of the breakpoint.
Thanks again, Ming
I see and introduced another parameter, overlap.
##' @param overlap one of 'TSS' or 'all', if overlap="all", then gene overlap with peak will be reported
as nearest gene, no matter the overlap is at TSS region or not.
> ap<- annotatePeak(gr, tssRegion=c(-3000, 3000),TxDb=hg19.refseq.db, level = "transcript", ignoreDownstream = TRUE)
>> preparing features information... 2016-01-12 11:57:56
>> identifying nearest features... 2016-01-12 11:57:56
>> calculating distance from peak to TSS... 2016-01-12 11:57:56
>> assigning genomic annotation... 2016-01-12 11:57:56
>> assigning chromosome lengths 2016-01-12 11:58:04
>> done... 2016-01-12 11:58:04
> as.data.frame(ap)
seqnames start end width strand
1 chr10 34802689 34802698 10 +
annotation geneChr geneStart geneEnd
1 Intron (NM_001184785/56288, intron 22 of 24) chr10 34048641 34061608
geneLength geneStrand geneId transcriptId distanceToTSS
1 12968 - 100505583 NR_038932 -741081
> ap2<- annotatePeak(gr, tssRegion=c(-3000, 3000),TxDb=hg19.refseq.db, level = "transcript", ignoreDownstream = TRUE, overlap='all')
>> preparing features information... 2016-01-12 11:58:32
>> identifying nearest features... 2016-01-12 11:58:32
>> calculating distance from peak to TSS... 2016-01-12 11:58:33
>> assigning genomic annotation... 2016-01-12 11:58:33
>> assigning chromosome lengths 2016-01-12 11:58:34
>> done... 2016-01-12 11:58:34
> as.data.frame(ap2)
seqnames start end width strand annotation geneChr geneStart
1 chr10 34802689 34802698 10 + Promoter (<=1kb) chr10 34398488
geneEnd geneLength geneStrand geneId transcriptId distanceToTSS
1 35104253 705766 - 56288 NM_001184785 0
>
By default only overlap with TSS count and we assign distanceToTSS = 0 if it's a overlap hit.
ap2
annotate the peak as Promoter (<=1kb)
which is indeed not true.I try to fix it, if overlap='all', ChIPseeker will determine whether the overlap is indeed in TSS, if yes, distanceToTSS=0, otherwise, calculate the distance.
ap2<- annotatePeak(gr, tssRegion=c(-3000, 3000),TxDb=hg19.refseq.db, level = "transcript", ignoreDownstream = TRUE, overlap='all')
> as.data.frame(ap2)
seqnames start end width strand
1 chr10 34802689 34802698 10 +
annotation geneChr geneStart geneEnd
1 Intron (NM_001184785/56288, intron 22 of 24) chr10 34398488 35104253
geneLength geneStrand geneId transcriptId distanceToTSS
1 705766 - 56288 NM_001184785 -301555
To much parameters introduced recently. Be careful and help testing it. Thanks.
Thx for taking the effort! I do not see the overlap
option in ChIPseeker_1.7.6
I did commit yesterday. You need to re-install ChIPseeker.
@GuangchuangYu I just installed the latest version, but got errors:
breakA_anno<- annotatePeak(breakpointA, tssRegion=c(-3000, 3000),
TxDb=hg19.refseq.db, level = "transcript", annoDb="org.Hs.eg.db",
sameStrand = FALSE, ignoreOverlap = FALSE,
ignoreDownstream = TRUE, overlap = "all")
>> preparing features information... 2016-01-12 21:56:55
>> identifying nearest features... 2016-01-12 21:56:55
Error: subscript contains NAs or out-of-bounds indices
In addition: Warning messages:
1: In start(peaks) - start(features[featureIdx]) :
longer object length is not a multiple of shorter object length
2: In end(peaks) - start(features[featureIdx]) :
longer object length is not a multiple of shorter object length
3: In dd[peakIdx] <- distance_minimal :
number of items to replace is not a multiple of replacement length
Some breakpoints may not have the closest upstream genes. NA will be returned.
give me the breakpointA
, I will look into it.
Thx, just sent to your gmail.
> breakA_anno<- annotatePeak(breakpointA, tssRegion=c(-3000, 3000),
+ TxDb=hg19.refseq.db, level = "transcript", annoDb="org.Hs.eg.db",
+ sameStrand = FALSE, ignoreOverlap = FALSE,
+ ignoreDownstream = TRUE, overlap = "all")
>> preparing features information... 2016-01-13 14:12:28
>> identifying nearest features... 2016-01-13 14:12:29
>> calculating distance from peak to TSS... 2016-01-13 14:12:29
>> assigning genomic annotation... 2016-01-13 14:12:29
>> adding gene annotation... 2016-01-13 14:12:39
Loading required package: org.Hs.eg.db
Loading required package: DBI
'select()' returned many:many mapping between keys and columns
>> assigning chromosome lengths 2016-01-13 14:12:42
>> done... 2016-01-13 14:12:42
The NA issue had been removed in ChIPseeker >=1.7.7
Thx! It is working properly now. I will report back if I have any other problems. Thanks again for making this useful package and great user support.
Ming
Hi Guangchuan,
if you can test this:
gr
is in the intron of PARD3, not sure why the annotation is distal intergenic.Thanks. Ming