YuLab-SMU / ChIPseeker

:dart: ChIP peak Annotation, Comparison and Visualization
https://onlinelibrary.wiley.com/share/author/GYJGUBYCTRMYJFN2JFZZ?target=10.1002/cpz1.585
220 stars 74 forks source link

Problems with ChIPseeker annotation #153

Closed iramai closed 3 years ago

iramai commented 3 years ago

Prerequisites

Describe you issue

Ask in right place

iramai commented 3 years ago

Hi! I have been using ChIPseeker for some sequencing experiments, for the annotation. But I have identified some kind of errors with the annotataion of the coordinates. I have followed the tutorial from bioconductor: (http://www.bioconductor.org/packages/release/bioc/vignettes/ChIPseeker/inst/doc/ChIPseeker.html), And used the TxDb object for that steps for annotation. The thing is that sometimes the tool identifies some gene features far away from the gene position.

This are the followed commands:

library(GenomicRanges) library(rtracklayer) library(dplyr) library(ChIPpeakAnno) library(ChIPseeker) library(org.Mm.eg.db) library(TxDb.Mmusculus.UCSC.mm10.knownGene) txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene files<-getSampleFiles() files peakAnnoBatch<-annotatePeak(files[[1]], tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Mm.eg.db")

loading peak file... 2021-04-09 0:02:31 preparing features information... 2021-04-09 0:02:34 identifying nearest features... 2021-04-09 0:02:35 calculating distance from peak to TSS... 2021-04-09 0:02:41 assigning genomic annotation... 2021-04-09 0:02:41 adding gene annotation... 2021-04-09 0:03:05 'select()' returned 1:many mapping between keys and columns assigning chromosome lengths 2021-04-09 0:03:06 done... 2021-04-09 0:03:06

But when analyzing the output file I have found some incongruities. Some genes are annotated out of the correct regions. For example: chr10 | 13203078 | 13203079 | Ltv1 | Distal intergenic chr10 | 13210280 | 132102811 | Ltv1 | 3'UTR chr10 | 13213953 | 13213954 | Ltv1 | 3'UTR chr10 | 13224394 | 13224395 | Ltv1 | Intron (ENSMUST00000105545.11/215789, intron 12 of 12) chr10 | 13229471 | 13229472 | Ltv1 | Intron (ENSMUST00000105545.11/215789, intron 11 of 12) chr10 | 13236038 | 13236039 | Ltv1 | Intron (ENSMUST00000105545.11/215789, intron 9 of 12)

When de Ltv1 gene coordinates are chr10: 13178140-13193168, that is out of the regions detected on the ChIPseeker tool. In fact those coordinates belong to the gene Phactr2 (chr10: 13213395-13324289), and the previously annoted ensembl codes belong to this second gene, not to the Ltv1.

Can you help me solving this issue? I don't understand why the tool is nos detecting properly the gene intersects or if I am doing something wrong.

Thanks in advance,

Iraia

MingLi-929 commented 3 years ago

Browsing the question roughly, Here are my opinions.

As the code you typed in,

files<-getSampleFiles()
files
peakAnnoBatch<-annotatePeak(files[[1]], tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Mm.eg.db")

you are using the sample files in chipseeker as your input files. (https://github.com/YuLab-SMU/ChIPseeker/tree/master/inst/extdata/GEO_sample_data)

Getting closer into the sample files, you can see the description of these samples files. Since you use files[[1]], the description laid below.

GSM1174480_ARmo_0M_peaks.bed.gz

And getting into GEO to see the details of this GSM1174480, you can see the description of this file. (https://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSM1174480)

Data processing | Illumina Genome Analyzer II base calling software
The raw reads alignment was carried out with Bowtie and the reads were mapped on the human genome version 19 (hg19).

In all, the sample file in chipseeker is the chip-seq result of Homo sapiens. You should use the reference the same as description to get the correct annotation.

but in the code,

library(TxDb.Mmusculus.UCSC.mm10.knownGene)
txdb <- TxDb.Mmusculus.UCSC.mm10.knownGene

This TxDb file TxDb.Mmusculus.UCSC.mm10.knownGene is for Mus musculus rather than Homo sapiens.

So you can try this way to get the correct annotation.

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
iramai commented 3 years ago

Hi! Sorry for my late answerd. I have been on holidays. Maybe I did not include all the information of the sequences that I used for the analysis. I did use that commands previosly mentioned y my comment, after adding my files to the Chipseeker folder (GEO_sample_data), with the aim of using the same commands you use in the protocol:

files<-getSampleFiles() files $Annotation_pval_f.txt [1] "C:/Users/Documents/R/win-library/4.0/ChIPseeker/extdata/GEO_sample_data/Annotation_pval_f.txt" $ARmo_0M [1] "C:/Users/Documents/R/win-library/4.0/ChIPseeker/extdata/GEO_sample_data/GSM1174480_ARmo_0M_peaks.bed.gz" $ARmo_1nM [1] "C:/Users/Documents/R/win-library/4.0/ChIPseeker/extdata/GEO_sample_data/GSM1174481_ARmo_1nM_peaks.bed.gz" $ARmo_100nM [1] "C:/Users/Documents/R/win-library/4.0/ChIPseeker/extdata/GEO_sample_data/GSM1174482_ARmo_100nM_peaks.bed.gz" $CBX6_BF [1] "C:/Users/Documents/R/win-library/4.0/ChIPseeker/extdata/GEO_sample_data/GSM1295076_CBX6_BF_ChipSeq_mergedReps_peaks.bed.gz" $CBX7_BF [1] "C:/Users/Documents/R/win-library/4.0/ChIPseeker/extdata/GEO_sample_data/GSM1295077_CBX7_BF_ChipSeq_mergedReps_peaks.bed.gz"

peakAnnoBatch<-annotatePeak(files[[1]], tssRegion=c(-3000, 3000), TxDb=txdb, annoDb="org.Mm.eg.db")

The Annotation_pval_f.txt file is the file that I want to be annotated and is the result of experimentation with mESCs, and that is why in the previous pipeline I use the mm10 annotation file.

So knowing all this information, can you help me detecting the problem with the annotation that I explain in the previous comment?

Thanks in advance,

Iraia