YuLab-SMU / ChIPseeker

:dart: ChIP peak Annotation, Comparison and Visualization
https://onlinelibrary.wiley.com/share/author/GYJGUBYCTRMYJFN2JFZZ?target=10.1002/cpz1.585
219 stars 74 forks source link

Critical errors in annotating introns and exons - wrong gene #214

Open tjparnell opened 1 year ago

tjparnell commented 1 year ago

The reported annotation for peaks that fall in introns (and exons) are frequently reported as being in the adjacent neighboring gene, suggesting an off-by-1 indexing error when reporting the identifiers in the geneId and transcriptId columns. The identifiers are correct in the annotation column.

Reproducible code with version 1.34.1:

files <- getSampleFiles()
peak <- readPeakFile(files[[1]])
anno <- annotatePeak(peak, TxDb = txdb, annoDb = "org.Hs.eg.db")
annotable <- as.data.frame(anno)
head(annotable)

Result

  seqnames     start       end width strand                                    annotation geneChr geneStart
1     chrX  61728297  61728780   484      *                             Distal Intergenic      23  62567107
2    chr10  39105185  39105362   178      *                             Distal Intergenic      10  38989727
3     chrY  13137266  13137499   234      *                             Distal Intergenic      24  14517915
4    chr11 114049918 114050234   317      *       Intron (uc001poo.1/7704, intron 3 of 3)      11 114128553
5     chrY  13107715  13107867   153      *                             Distal Intergenic      24  14517915
6     chr1 142655900 142656170   271      * Intron (uc001eiw.1/uc001eiw.1, intron 1 of 7)       1 142697421
    geneEnd geneLength geneStrand    geneId transcriptId distanceToTSS         ENSEMBL      SYMBOL
1  62571218       4112          2    139886   uc004dvf.3        842438 ENSG00000186767       SPIN4
2  38991371       1645          1    399746   uc021ppf.1        115458            <NA>    ACTR3BP5
3  14533389      15475          2    352887   uc022cji.1       1395890 ENSG00000206159      GYG2P1
4 114183238      54686          1      4837   uc001por.1        -78319 ENSG00000166741        NNMT
5  14533389      15475          2    352887   uc022cji.1       1425522 ENSG00000206159      GYG2P1
6 142713605      16185          2 100874392   uc031pnv.1         57435            <NA> ANKRD20A12P
                                                GENENAME
1                               spindlin family member 4
2                                    ACTR3B pseudogene 5
3                              glycogenin 2 pseudogene 1
4                       nicotinamide N-methyltransferase
5                              glycogenin 2 pseudogene 1
6 ankyrin repeat domain 20 family member A12, pseudogene

If you refer to peak in line 4, it is listed as being in gene NNMT. However, if you load the peak file in a genome browser, you will note it is actually in an intron of ZBTB16. Gene NNMT is actually immediately to the right ("downstream") of ZBTB16. The transcript/gene identifiers in the annotation column do not match those in the transcriptId and geneId columns. The IDs in the annotation column are actually correct. Finally, the peak end coordinate of 114050234 is less than the geneStart coordinate of 114128553, a logical failure.

The same behavior is also observed for the peak in line 6: it is listed as gene ANKRD20A12P, but in actuality is left of the gene.

This behavior is also seen in version 1.28.3 with Bioconductor 3.13 under R4.1. I have also seen the issue when using my own annotation imported from a GTF file.

I suspect this may also be related to #158, #166, #210, and #212.

ChristophH commented 1 month ago

@tjparnell Did you ever get to the core of this problem? Do you think it has been fixed? I'm having another issue with ChIPseeker and am now wondering how much I can trust the package overall given open issues like this.

tjparnell commented 1 week ago

@ChristophH I finally tested release 1.38.0 on our server, and the test code above now appears to give correct annotation. I have yet to test it fully. I have not seen any substantial commits here beyond documentation and version number changes, which makes me wonder if the problem was due to upstream code. Nevertheless, given the general absence of responses and concern to most queries here, my enthusiasm is still lacking.