BIMSBbioinfo / genomation

R package for genomic feature analysis and visualization
http://bioinformatics.mdc-berlin.de/genomation/
73 stars 22 forks source link

Site upstream of transcription start site annotated as intronic #203

Open pkothiyal opened 2 years ago

pkothiyal commented 2 years ago

Hi, I tried annotating a few sites with annotateWithGeneParts and I'm getting an intron annotation even though the site is upstream of transcription start site. Here's an example:

BED12 entry for the associated transcript:

chr12      3906815 3908547 ENSMUST00000172913      0       +       3906815 3908265 0       3       68,138,295,     0,790,1437,

Site for annotation (chr12:3905814):

df <- data.frame (chr  = c("chr12"),
                  start = c(3905814),
                  end = c(3905814)
                  )

gr <- makeGRangesFromDataFrame(df, keep.extra.columns=FALSE,
                                      start.field="start",
                                      end.field="end")

as.data.frame(gr)

seqnames    start   end width   strand
chr12   3905814 3905814 1   *

Annotate:

gene.obj=readTranscriptFeatures(mm10.bed", up.flank = 1000,down.flank = 1000, remove.unusual=TRUE, unique.prom = TRUE)
annRes=annotateWithGeneParts(gr,gene.obj)
cbind(getAssociationWithTSS(annRes), as.data.frame(getMembers(annRes)))

target.row  dist.to.feature feature.name    feature.strand prom exon  intron
       1    -1002   ENSMUST00000172913  +               0    0    1

As seen above, the site is annotated in the intron of the transcript even though it is 1002 bases upstream of the transcription start site. The promoter flank is set at 1,000 so I understand it not being mapped to promoter but should it be intergenic instead of intronic in this case?

Thanks, Prachi

AstrobioMike commented 2 years ago

Heya @pkothiyal and devs,

I happened to peek at this too, and just because of the older version of R where I tested, version 1.16.0 of genomation was installed, and this didn't happen in that version (in case it helps track down whatever is going on):

library(genomation)
packageVersion("genomation") # 1.16.0
library(GenomicRanges)
packageVersion("GenomicRanges") # 1.36.1

bed <- c("chr12", 3906815, 3908547, "ENSMUST00000172913", 0, "+", 3906815, 3908265, 0, 3, "68,138,295,", "0,790,1437,")
bed_df <- t(data.frame(bed))
write.table(bed_df, "mm10.bed", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE)

df <- data.frame (chr  = c("chr12"),
                  start = c(3905814),
                  end = c(3905814)
                  )

gr <- makeGRangesFromDataFrame(df, keep.extra.columns = FALSE,
                                      start.field = "start",
                                      end.field = "end")

gene.obj <- readTranscriptFeatures("mm10.bed", up.flank = 1000, down.flank = 1000, remove.unusual = TRUE, unique.prom = TRUE)

annRes <- annotateWithGeneParts(gr, gene.obj)

cbind(getAssociationWithTSS(annRes), as.data.frame(getMembers(annRes)))

#   target.row dist.to.feature       feature.name feature.strand prom exon intron
#            1           -1002 ENSMUST00000172913              +    0    0      0

sessionInfo()
# R version 3.6.3 (2020-02-29)
# Platform: x86_64-apple-darwin15.6.0 (64-bit)
# Running under: macOS  10.16
# 
# Matrix products: default
# LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# attached base packages:
#  [1] parallel  stats4    grid      stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] GenomicRanges_1.36.1 GenomeInfoDb_1.20.0  IRanges_2.18.3       S4Vectors_0.22.1     BiocGenerics_0.30.0  genomation_1.16.0   
# 
# loaded via a namespace (and not attached):
#  [1] Rcpp_1.0.8                  lattice_0.20-45             Rsamtools_2.0.3             Biostrings_2.52.0          
#  [5] assertthat_0.2.1            utf8_1.2.2                  gridBase_0.4-7              R6_2.5.1                   
#  [9] plyr_1.8.6                  ggplot2_3.3.5               pillar_1.6.5                zlibbioc_1.30.0            
# [13] rlang_1.0.0                 rstudioapi_0.13             data.table_1.14.2           Matrix_1.2-18              
# [17] BiocParallel_1.18.1         readr_2.1.1                 stringr_1.4.0               RCurl_1.98-1.5             
# [21] bit_4.0.4                   munsell_0.5.0               DelayedArray_0.10.0         compiler_3.6.3             
# [25] rtracklayer_1.44.4          pkgconfig_2.0.3             tidyselect_1.1.1            SummarizedExperiment_1.14.1
# [29] tibble_3.1.6                GenomeInfoDbData_1.2.1      matrixStats_0.61.0          XML_3.99-0.3               
# [33] fansi_1.0.2                 withr_2.4.3                 crayon_1.4.2                dplyr_1.0.7                
# [37] tzdb_0.2.0                  GenomicAlignments_1.20.1    bitops_1.0-7                gtable_0.3.0               
# [41] lifecycle_1.0.1             DBI_1.1.2                   magrittr_2.0.2              scales_1.1.1               
# [45] KernSmooth_2.23-20          cli_3.1.1                   stringi_1.7.6               impute_1.58.0              
# [49] vroom_1.5.7                 XVector_0.24.0              reshape2_1.4.4              ellipsis_0.3.2             
# [53] generics_0.1.1              vctrs_0.3.8                 tools_3.6.3                 bit64_4.0.5                
# [57] BSgenome_1.52.0             Biobase_2.44.0              glue_1.6.1                  seqPattern_1.16.0          
# [61] purrr_0.3.4                 hms_1.1.1                   plotrix_3.8-2               colorspace_2.0-2           
# [65] BiocManager_1.30.16    
al2na commented 2 years ago

Thank you, something must have been changed in GenomicRanges that causes this.

On Mon, Jan 31, 2022 at 5:28 PM Mike Lee @.***> wrote:

Heya @pkothiyal https://github.com/pkothiyal and devs,

I happened to peek at this too, and just because of the older version of R where I tested, version 1.16.0 of genomation was installed, and this didn't happen in that version (in case it helps track down whatever is going on):

library(genomation) packageVersion("genomation") # 1.16.0 library(GenomicRanges) packageVersion("GenomicRanges") # 1.36.1 bed <- c("chr12", 3906815, 3908547, "ENSMUST00000172913", 0, "+", 3906815, 3908265, 0, 3, "68,138,295,", "0,790,1437,")bed_df <- t(data.frame(bed)) write.table(bed_df, "mm10.bed", sep = "\t", quote = FALSE, row.names = FALSE, col.names = FALSE) df <- data.frame (chr = c("chr12"), start = c(3905814), end = c(3905814) ) gr <- makeGRangesFromDataFrame(df, keep.extra.columns = FALSE, start.field = "start", end.field = "end") gene.obj <- readTranscriptFeatures("mm10.bed", up.flank = 1000, down.flank = 1000, remove.unusual = TRUE, unique.prom = TRUE) annRes <- annotateWithGeneParts(gr, gene.obj)

cbind(getAssociationWithTSS(annRes), as.data.frame(getMembers(annRes)))

target.row dist.to.feature feature.name feature.strand prom exon intron# 1 -1002 ENSMUST00000172913 + 0 0 0

sessionInfo()# R version 3.6.3 (2020-02-29)# Platform: x86_64-apple-darwin15.6.0 (64-bit)# Running under: macOS 10.16# # Matrix products: default# LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib# # locale:# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8# # attached base packages:# [1] parallel stats4 grid stats graphics grDevices utils datasets methods base # # other attached packages:# [1] GenomicRanges_1.36.1 GenomeInfoDb_1.20.0 IRanges_2.18.3 S4Vectors_0.22.1 BiocGenerics_0.30.0 genomation_1.16.0 # # loaded via a namespace (and not attached):# [1] Rcpp_1.0.8 lattice_0.20-45 Rsamtools_2.0.3 Biostrings_2.52.0 # [5] assertthat_0.2.1 utf8_1.2.2 gridBase_0.4-7 R6_2.5.1 # [9] plyr_1.8.6 ggplot2_3.3.5 pillar_1.6.5 zlibbioc_1.30.0 # [13] rlang_1.0.0 rstudioapi_0.13 data.table_1.14.2 Matrix_1.2-18 # [17] BiocParallel_1.18.1 readr_2.1.1 stringr_1.4.0 RCurl_1.98-1.5 # [21] bit_4.0.4 munsell_0.5.0 DelayedArray_0.10.0 compiler_3.6.3 # [25] rtracklayer_1.44.4 pkgconfig_2.0.3 tidyselect_1.1.1 SummarizedExperiment_1.14.1# [29] tibble_3.1.6 GenomeInfoDbData_1.2.1 matrixStats_0.61.0 XML_3.99-0.3 # [33] fansi_1.0.2 withr_2.4.3 crayon_1.4.2 dplyr_1.0.7 # [37] tzdb_0.2.0 GenomicAlignments_1.20.1 bitops_1.0-7 gtable_0.3.0 # [41] lifecycle_1.0.1 DBI_1.1.2 magrittr_2.0.2 scales_1.1.1 # [45] KernSmooth_2.23-20 cli_3.1.1 stringi_1.7.6 impute_1.58.0 # [49] vroom_1.5.7 XVector_0.24.0 reshape2_1.4.4 ellipsis_0.3.2 # [53] generics_0.1.1 vctrs_0.3.8 tools_3.6.3 bit64_4.0.5 # [57] BSgenome_1.52.0 Biobase_2.44.0 glue_1.6.1 seqPattern_1.16.0 # [61] purrr_0.3.4 hms_1.1.1 plotrix_3.8-2 colorspace_2.0-2 # [65] BiocManager_1.30.16

— Reply to this email directly, view it on GitHub https://github.com/BIMSBbioinfo/genomation/issues/203#issuecomment-1025965780, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAE32EPZRTVRET5SFUDJ4YLUY22BFANCNFSM5M3QWVXQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

-- Sent from mobile, excuse the brevity