YuLab-SMU / ChIPseeker

:dart: ChIP peak Annotation, Comparison and Visualization
https://onlinelibrary.wiley.com/share/author/GYJGUBYCTRMYJFN2JFZZ?target=10.1002/cpz1.585
226 stars 75 forks source link

annotation is wrong? #23

Closed crazyhottommy closed 8 years ago

crazyhottommy commented 8 years ago

Hi Guangchuan,

if you can test this:

library(ChIPseeker)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
gr<- GRanges( seqnames ="chr10", ranges=IRanges(start= 34802689, end=  34802698), strand="+")

ap<- annotatePeak(gr, tssRegion=c(-3000, 3000), 
              TxDb=txdb, annoDb="org.Hs.eg.db",
              sameStrand = FALSE, ignoreOverlap = FALSE,
              ignoreDownstream = TRUE)

as.data.frame(ap)
  seqnames    start      end width strand        annotation geneChr geneStart  geneEnd geneLength geneStrand
1    chr10 34802689 34802698    10      + Distal Intergenic   chr10  34601188 34715659     114472          -
  geneId transcriptId distanceToTSS         ENSEMBL SYMBOL                             GENENAME
1  56288   uc001ixo.2        -87030 ENSG00000148498  PARD3 par-3 family cell polarity regulator

> sessionInfo()
R version 3.2.2 (2015-08-14)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.10.1 (Yosemite)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] org.Hs.eg.db_3.2.3                      RSQLite_1.0.0                          
 [3] DBI_0.3.1                               BSgenome.Hsapiens.UCSC.hg19_1.4.0      
 [5] BSgenome_1.38.0                         rtracklayer_1.30.1                     
 [7] Biostrings_2.38.3                       XVector_0.10.0                         
 [9] TxDb.Hsapiens.UCSC.hg19.knownGene_3.2.2 GenomicFeatures_1.22.7                 
[11] AnnotationDbi_1.32.3                    Biobase_2.30.0                         
[13] ggplot2_2.0.0                           tidyr_0.3.1                            
[15] dplyr_0.4.3                             GenomicInteractions_1.4.1              
[17] GenomicRanges_1.23.10                   GenomeInfoDb_1.7.3                     
[19] IRanges_2.5.19                          S4Vectors_0.9.18                       
[21] BiocGenerics_0.17.2                     ChIPseeker_1.7.5                       
[23] BiocInstaller_1.20.1                   

loaded via a namespace (and not attached):
 [1] httr_1.0.0                  splines_3.2.2               gtools_3.5.0               
 [4] Formula_1.2-1               assertthat_0.1              latticeExtra_0.6-26        
 [7] Rsamtools_1.22.0            lattice_0.20-33             biovizBase_1.18.0          
[10] chron_2.3-47                digest_0.6.8                RColorBrewer_1.1-2         
[13] colorspace_1.2-6            plyr_1.8.3                  XML_3.98-1.3               
[16] devtools_1.9.1              biomaRt_2.26.1              zlibbioc_1.16.0            
[19] scales_0.3.0                gdata_2.17.0                BiocParallel_1.4.3         
[22] UpSetR_1.0.2                SummarizedExperiment_1.1.13 nnet_7.3-11                
[25] Gviz_1.14.2                 survival_2.38-3             magrittr_1.5               
[28] memoise_0.2.1               gplots_2.17.0               foreign_0.8-66             
[31] data.table_1.9.6            tools_3.2.2                 matrixStats_0.50.1         
[34] gridBase_0.4-7              stringr_1.0.0               munsell_0.4.2              
[37] cluster_2.0.3               plotrix_3.6-1               lambda.r_1.1.7             
[40] caTools_1.17.1              futile.logger_1.4.1         grid_3.2.2                 
[43] RCurl_1.95-4.7              dichromat_2.0-0             VariantAnnotation_1.16.4   
[46] igraph_1.0.1                labeling_0.3                bitops_1.0-6               
[49] boot_1.3-17                 gtable_0.1.2                curl_0.9.4                 
[52] R6_2.1.1                    GenomicAlignments_1.7.11    gridExtra_2.0.0            
[55] knitr_1.12                  Hmisc_3.17-1                futile.options_1.0.0       
[58] KernSmooth_2.23-15          stringi_1.0-1               Rcpp_0.12.2                
[61] rpart_4.1-10                acepack_1.3-3.3   

gr is in the intron of PARD3, not sure why the annotation is distal intergenic.

Thanks. Ming

GuangchuangYu commented 8 years ago
5' ---> 3'
                                                  peakStart    peakEnd
+                                                 34802689-----34802698
  geneEnd                  geneStart
- 34601188-----------------34715659

PARD3 and the peak have no overlap. PARD3 is annotated as nearest gene, while the annotation column annotate the location of the peak. Nearest gene is calculated by distance to TSS and genomic annotation is determined by overlap.

see also https://github.com/GuangchuangYu/ChIPseeker/issues/12

crazyhottommy commented 8 years ago

I think it has to do with the gene annotation package, let's try refseq:


library(GenomicFeatures)
hg19.refseq.db <- makeTxDbFromUCSC(genome="hg19", table="refGene")

gr<- GRanges( seqnames ="chr10", ranges=IRanges(start= 34802689, end=  34802698), strand="+")

ap<- annotatePeak(gr, tssRegion=c(-3000, 3000), 
                         TxDb=hg19.refseq.db, level = "transcript", annoDb="org.Hs.eg.db",
                         sameStrand = FALSE, ignoreOverlap = FALSE,
                         ignoreDownstream = TRUE)

>as.data.frame(ap)
seqnames    start      end width strand        annotation geneChr geneStart  geneEnd geneLength geneStrand
1    chr10 34802689 34802698    10      + Distal Intergenic   chr10  34048641 34061608      12968          -
     geneId transcriptId distanceToTSS         ENSEMBL    SYMBOL                                   GENENAME
1 100505583    NR_038932       -741081 ENSG00000261683 LINC00838 long intergenic non-protein coding RNA 838

If you check the refseq ID

>transcripts(hg19.refseq.db)[transcripts(hg19.refseq.db)$tx_name=="NM_001184791"]
GRanges object with 1 range and 2 metadata columns:
      seqnames               ranges strand |     tx_id      tx_name
         <Rle>            <IRanges>  <Rle> | <integer>  <character>
  [1]    chr10 [34398488, 35104253]      - |     26255 NM_001184791

NM_001184791 is one of the transcripts of PARD3 gene.

the peak 34802689-----34802698 actually resides in the intron of NM_001184791. It is not annotating PARD3 as the overlapping gene, but LINC00838 which is more upstream of the PARD3 gene. You can verify this on the IGV browser.

Thanks for looking into this!

Ming

GuangchuangYu commented 8 years ago

good catch!

> GRegion[GRegion$gene_id == "NM_001184791/56288"]
GRanges object with 20 ranges and 2 metadata columns:
        seqnames               ranges strand   |            gene_id intron_rank
           <Rle>            <IRanges>  <Rle>   |        <character>   <integer>
  26255    chr10 [34400491, 34408540]      -   | NM_001184791/56288           1
  26255    chr10 [34408669, 34420390]      -   | NM_001184791/56288           2
  26255    chr10 [34420512, 34558584]      -   | NM_001184791/56288           3
  26255    chr10 [34558828, 34606034]      -   | NM_001184791/56288           4
  26255    chr10 [34606267, 34620044]      -   | NM_001184791/56288           5
    ...      ...                  ...    ... ...                ...         ...
  26255    chr10 [34688342, 34690753]      -   | NM_001184791/56288          16
  26255    chr10 [34690846, 34759012]      -   | NM_001184791/56288          17
  26255    chr10 [34759192, 34805906]      -   | NM_001184791/56288          18
  26255    chr10 [34806088, 34985245]      -   | NM_001184791/56288          19
  26255    chr10 [34985348, 35103803]      -   | NM_001184791/56288          20
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
> peaks
GRanges object with 1 range and 0 metadata columns:
      seqnames               ranges strand
         <Rle>            <IRanges>  <Rle>
  [1]    chr10 [34802689, 34802698]      +
  -------
  seqinfo: 1 sequence from an unspecified genome; no seqlengths
> findOverlaps(peaks, GRegion)
Hits object with 0 hits and 0 metadata columns:
   queryHits subjectHits
   <integer>   <integer>
  -------
  queryLength: 1
  subjectLength: 457682
> findOverlaps(peaks, unstrand(GRegion))
Hits object with 11 hits and 0 metadata columns:
       queryHits subjectHits
       <integer>   <integer>
   [1]         1      226414
   [2]         1      226437
   [3]         1      226460
   [4]         1      226483
   [5]         1      226505
   [6]         1      226527
   [7]         1      226547
   [8]         1      226571
   [9]         1      226591
  [10]         1      226611
  [11]         1      226630
  -------
  queryLength: 1
  subjectLength: 457682

This is due to the newly introduced support of peaks with strandness. findOverlaps only report overlap in the same strand.

It has been fixed in version >= 1.7.6.

GuangchuangYu commented 8 years ago

nearest gene issue (PARD3 or LINC00838)

> ff=transcriptsBy(hg19.refseq.db)
> ff=unlist(ff)
> ff
GRanges object with 55053 ranges and 2 metadata columns:
       seqnames               ranges strand   |     tx_id      tx_name
          <Rle>            <IRanges>  <Rle>   | <integer>  <character>
     1    chr19 [58858172, 58864865]      -   |     46301    NM_130786
    10     chr8 [18248755, 18258723]      +   |     21083    NM_000015
   100    chr20 [43248163, 43280376]      -   |     47512    NM_000022
  1000    chr18 [25530927, 25616549]      -   |     42760 NM_001308176
  1000    chr18 [25530927, 25757410]      -   |     42761    NM_001792
   ...      ...                  ...    ... ...       ...          ...
  9994     chr6 [90539619, 90584155]      +   |     16709    NM_012115
  9997    chr22 [50961997, 50964033]      -   |     49604 NM_001169111
  9997    chr22 [50961997, 50964034]      -   |     49605    NM_005138
  9997    chr22 [50961997, 50964574]      -   |     49606 NM_001169110
  9997    chr22 [50961997, 50964868]      -   |     49607 NM_001169109
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome
> ff[which(ff$tx_name == "NM_001184785")]
GRanges object with 1 range and 2 metadata columns:
        seqnames               ranges strand |     tx_id      tx_name
           <Rle>            <IRanges>  <Rle> | <integer>  <character>
  56288    chr10 [34398488, 35104253]      - |     26249 NM_001184785
  -------
  seqinfo: 93 sequences (1 circular) from hg19 genome

If we look at the location of PARD3 (NM_001184785), we can find that the TSS is 35104253, which is at the 3' region of the peak. When ignoreDownstream = TRUE, it will just be ignored.

> ap<- annotatePeak(gr, tssRegion=c(-3000, 3000),TxDb=hg19.refseq.db, level = "transcript")
>> preparing features information...         2016-01-12 10:35:59
>> identifying nearest features...       2016-01-12 10:35:59
>> calculating distance from peak to TSS...  2016-01-12 10:36:00
>> assigning genomic annotation...       2016-01-12 10:36:00
>> assigning chromosome lengths          2016-01-12 10:36:09
>> done...
> as.data.frame(ap)
  seqnames    start      end width strand
1    chr10 34802689 34802698    10      +
                                    annotation geneChr geneStart  geneEnd
1 Intron (NM_001184785/56288, intron 22 of 24)   chr10  34398488 35104253
  geneLength geneStrand geneId transcriptId distanceToTSS
1     705766          -  56288 NM_001184785        301555
crazyhottommy commented 8 years ago
ap<- annotatePeak(gr, tssRegion=c(-3000, 3000),TxDb=hg19.refseq.db, level = "transcript", ignoreDownstream = TRUE)

> as.GRanges(ap)
GRanges object with 1 range and 9 metadata columns:
      seqnames               ranges strand |                                   annotation  geneChr geneStart
         <Rle>            <IRanges>  <Rle> |                                  <character> <factor> <integer>
  [1]    chr10 [34802689, 34802698]      + | Intron (NM_001184785/56288, intron 22 of 24)    chr10  34048641
        geneEnd geneLength geneStrand      geneId transcriptId distanceToTSS
      <integer>  <integer>   <factor> <character>  <character>     <numeric>
  [1]  34061608      12968          -   100505583    NR_038932       -741081
  -------
  seqinfo: 1 sequence from hg19 genome

thx Guangchuang! Now, the annotation is correct. However, I want the annotated gene to be PARD3 even when ignoreDownstream = TRUE.

So, one has to call findOverlaps before resizing the gene to TSS and unstrand the gene. If there is no overlapping genes, then use follow(breakpoint, unstrand(resize(gene, width=1)) to find the upstream ones.

In other words. I want to annotate the breakpoint with overlapping genes first (no matter the strandness of the gene and the breakpoint). If there is no overlapping gene for that breakpoint, then find the closest gene upstream of the breakpoint.

Thanks again, Ming

GuangchuangYu commented 8 years ago

I see and introduced another parameter, overlap.

##' @param overlap one of 'TSS' or 'all', if overlap="all", then gene overlap with peak will be reported 
                     as nearest gene, no matter the overlap is at TSS region or not.
> ap<- annotatePeak(gr, tssRegion=c(-3000, 3000),TxDb=hg19.refseq.db, level = "transcript", ignoreDownstream = TRUE)
>> preparing features information...         2016-01-12 11:57:56
>> identifying nearest features...       2016-01-12 11:57:56
>> calculating distance from peak to TSS...  2016-01-12 11:57:56
>> assigning genomic annotation...       2016-01-12 11:57:56
>> assigning chromosome lengths          2016-01-12 11:58:04
>> done...                   2016-01-12 11:58:04
> as.data.frame(ap)
  seqnames    start      end width strand
1    chr10 34802689 34802698    10      +
                                    annotation geneChr geneStart  geneEnd
1 Intron (NM_001184785/56288, intron 22 of 24)   chr10  34048641 34061608
  geneLength geneStrand    geneId transcriptId distanceToTSS
1      12968          - 100505583    NR_038932       -741081
> ap2<- annotatePeak(gr, tssRegion=c(-3000, 3000),TxDb=hg19.refseq.db, level = "transcript", ignoreDownstream = TRUE, overlap='all')
>> preparing features information...         2016-01-12 11:58:32
>> identifying nearest features...       2016-01-12 11:58:32
>> calculating distance from peak to TSS...  2016-01-12 11:58:33
>> assigning genomic annotation...       2016-01-12 11:58:33
>> assigning chromosome lengths          2016-01-12 11:58:34
>> done...                   2016-01-12 11:58:34
> as.data.frame(ap2)
  seqnames    start      end width strand       annotation geneChr geneStart
1    chr10 34802689 34802698    10      + Promoter (<=1kb)   chr10  34398488
   geneEnd geneLength geneStrand geneId transcriptId distanceToTSS
1 35104253     705766          -  56288 NM_001184785             0
>

By default only overlap with TSS count and we assign distanceToTSS = 0 if it's a overlap hit.

I try to fix it, if overlap='all', ChIPseeker will determine whether the overlap is indeed in TSS, if yes, distanceToTSS=0, otherwise, calculate the distance.

ap2<- annotatePeak(gr, tssRegion=c(-3000, 3000),TxDb=hg19.refseq.db, level = "transcript", ignoreDownstream = TRUE, overlap='all')
> as.data.frame(ap2)
  seqnames    start      end width strand
1    chr10 34802689 34802698    10      +
                                    annotation geneChr geneStart  geneEnd
1 Intron (NM_001184785/56288, intron 22 of 24)   chr10  34398488 35104253
  geneLength geneStrand geneId transcriptId distanceToTSS
1     705766          -  56288 NM_001184785       -301555

To much parameters introduced recently. Be careful and help testing it. Thanks.

crazyhottommy commented 8 years ago

Thx for taking the effort! I do not see the overlap option in ChIPseeker_1.7.6

GuangchuangYu commented 8 years ago

I did commit yesterday. You need to re-install ChIPseeker.

crazyhottommy commented 8 years ago

@GuangchuangYu I just installed the latest version, but got errors:

breakA_anno<- annotatePeak(breakpointA, tssRegion=c(-3000, 3000), 
                          TxDb=hg19.refseq.db, level = "transcript", annoDb="org.Hs.eg.db",
                        sameStrand = FALSE, ignoreOverlap = FALSE,
                        ignoreDownstream = TRUE, overlap = "all")

>> preparing features information...         2016-01-12 21:56:55 
>> identifying nearest features...       2016-01-12 21:56:55 
Error: subscript contains NAs or out-of-bounds indices
In addition: Warning messages:
1: In start(peaks) - start(features[featureIdx]) :
  longer object length is not a multiple of shorter object length
2: In end(peaks) - start(features[featureIdx]) :
  longer object length is not a multiple of shorter object length
3: In dd[peakIdx] <- distance_minimal :
  number of items to replace is not a multiple of replacement length

Some breakpoints may not have the closest upstream genes. NA will be returned.

GuangchuangYu commented 8 years ago

give me the breakpointA, I will look into it.

crazyhottommy commented 8 years ago

Thx, just sent to your gmail.

GuangchuangYu commented 8 years ago
> breakA_anno<- annotatePeak(breakpointA, tssRegion=c(-3000, 3000),
+                           TxDb=hg19.refseq.db, level = "transcript", annoDb="org.Hs.eg.db",
+                         sameStrand = FALSE, ignoreOverlap = FALSE,
+                         ignoreDownstream = TRUE, overlap = "all")
>> preparing features information...         2016-01-13 14:12:28
>> identifying nearest features...       2016-01-13 14:12:29
>> calculating distance from peak to TSS...  2016-01-13 14:12:29
>> assigning genomic annotation...       2016-01-13 14:12:29
>> adding gene annotation...             2016-01-13 14:12:39
Loading required package: org.Hs.eg.db
Loading required package: DBI

'select()' returned many:many mapping between keys and columns
>> assigning chromosome lengths          2016-01-13 14:12:42
>> done...                   2016-01-13 14:12:42

The NA issue had been removed in ChIPseeker >=1.7.7

crazyhottommy commented 8 years ago

Thx! It is working properly now. I will report back if I have any other problems. Thanks again for making this useful package and great user support.

Ming