Closed gabee-chan closed 3 years ago
Hi @gabee-chan, thanks for your question! We'll look into it and get back to you soon.
Hi @gabee-chan, you can try installing the package from Github and try creating your promoter annotation again.
I've tried creating promoter annotations for Zebrafish (Danio rerio) and it works. One caveat is that the seqnames in your TxDb or GTF file should follow standard naming (e.g. "chr1" or "1").
> pa <- preparePromoterAnnotation(file = "Danio_rerio.GRCz11.103.chr.gtf.gz", species = "Danio_rerio")
Parsing input file...
Import genomic features from the file as a GRanges object ... OK
Prepare the 'metadata' data frame ... OK
Make the TxDb object ... OK
Extract exons by transcripts...
Identify overlapping first exons for each gene...
Prepare mapping between transcripts, tss, promoters and genes...
Prepare annotated intron ranges...
Annotating reduced exon ranges...
Prepare promoter coordinates and first exon ranges...
Warning message:
In .get_cds_IDX(mcols0$type, mcols0$phase) :
The "phase" metadata column contains non-NA values for features of type stop_codon. This
information was ignored.
>
> head(pa@promoterCoordinates)
GRanges object with 6 ranges and 4 metadata columns:
seqnames ranges strand | promoterId internalPromoter firstExonEnd intronId
<Rle> <IRanges> <Rle> | <integer> <logical> <integer> <IntegerList>
[1] chr9 34121839 - | 1 FALSE 34121792 106045
[2] chr9 34089156 + | 2 FALSE 34090811 100784
[3] chr4 15103696 - | 3 FALSE 15103602 43293
[4] chr4 15011341 + | 4 FALSE 15011512 37110
[5] chr12 33484458 + | 5 FALSE 33485048 129560
[6] chr24 22074272 - | 6 FALSE 22074235 242963
-------
seqinfo: 26 sequences from an unspecified genome; no seqlengths
Session information:
R version 4.0.3 (2020-10-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19041)
Matrix products: default
locale:
[1] LC_COLLATE=English_Singapore.1252 LC_CTYPE=English_Singapore.1252
[3] LC_MONETARY=English_Singapore.1252 LC_NUMERIC=C
[5] LC_TIME=English_Singapore.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] proActiv_1.1.21 testthat_3.0.1
loaded via a namespace (and not attached):
[1] colorspace_2.0-0 ellipsis_0.3.1 rprojroot_2.0.2
[4] biovizBase_1.38.0 htmlTable_2.1.0 XVector_0.30.0
[7] GenomicRanges_1.42.0 base64enc_0.1-3 fs_1.5.0
[10] dichromat_2.0-0 rstudioapi_0.13 remotes_2.2.0
[13] bit64_4.0.5 AnnotationDbi_1.52.0 fansi_0.4.1
[16] xml2_1.3.2 splines_4.0.3 knitr_1.30
[19] geneplotter_1.68.0 pkgload_1.1.0 Formula_1.2-4
[22] Rsamtools_2.6.0 annotate_1.68.0 cluster_2.1.0
[25] dbplyr_2.0.0 png_0.1-7 compiler_4.0.3
[28] httr_1.4.2 backports_1.2.1 lazyeval_0.2.2
[31] assertthat_0.2.1 Matrix_1.2-18 cli_2.2.0
[34] htmltools_0.5.0 prettyunits_1.1.1 tools_4.0.3
[37] gtable_0.3.0 glue_1.4.2 GenomeInfoDbData_1.2.4
[40] dplyr_1.0.2 rappdirs_0.3.1 Rcpp_1.0.5
[43] Biobase_2.50.0 vctrs_0.3.6 Biostrings_2.58.0
[46] rtracklayer_1.49.5 xfun_0.19 stringr_1.4.0
[49] ps_1.5.0 lifecycle_0.2.0 ensembldb_2.14.0
[52] devtools_2.3.2 XML_3.99-0.5 zlibbioc_1.36.0
[55] scales_1.1.1 BSgenome_1.58.0 VariantAnnotation_1.36.0
[58] ProtGenerics_1.22.0 hms_0.5.3 MatrixGenerics_1.2.0
[61] parallel_4.0.3 SummarizedExperiment_1.20.0 AnnotationFilter_1.14.0
[64] RColorBrewer_1.1-2 curl_4.3 memoise_1.1.0
[67] gridExtra_2.3 ggplot2_3.3.3 biomaRt_2.46.0
[70] rpart_4.1-15 latticeExtra_0.6-29 stringi_1.5.3
[73] RSQLite_2.2.1 genefilter_1.72.0 S4Vectors_0.28.1
[76] desc_1.2.0 checkmate_2.0.0 GenomicFeatures_1.42.1
[79] BiocGenerics_0.36.0 pkgbuild_1.2.0 BiocParallel_1.24.1
[82] GenomeInfoDb_1.26.2 rlang_0.4.9 pkgconfig_2.0.3
[85] matrixStats_0.57.0 bitops_1.0-6 lattice_0.20-41
[88] purrr_0.3.4 htmlwidgets_1.5.3 GenomicAlignments_1.26.0
[91] bit_4.0.4 processx_3.4.5 tidyselect_1.1.0
[94] magrittr_2.0.1 DESeq2_1.30.0 R6_2.5.0
[97] IRanges_2.24.1 generics_0.1.0 Hmisc_4.4-2
[100] DelayedArray_0.16.0 DBI_1.1.0 pillar_1.4.7
[103] foreign_0.8-80 withr_2.3.0 survival_3.2-7
[106] RCurl_1.98-1.2 nnet_7.3-14 tibble_3.0.4
[109] crayon_1.3.4 BiocFileCache_1.14.0 jpeg_0.1-8.1
[112] progress_1.2.2 usethis_2.0.0 locfit_1.5-9.4
[115] grid_4.0.3 data.table_1.13.6 blob_1.2.1
[118] callr_3.5.1 digest_0.6.27 xtable_1.8-4
[121] openssl_1.4.3 stats4_4.0.3 munsell_0.5.0
[124] Gviz_1.34.0 sessioninfo_1.1.1 askpass_1.1
Hi @jleechung ! Thanks for your help!
For some organisms I'm able to get the promoterAnnotation object, but for others no :(
For example for Gorilla (Gorilla gorilla ) I'm getting the below output:
Parsing input file... Import genomic features from the file as a GRanges object ... OK Prepare the 'metadata' data frame ... OK Make the TxDb object ... OK Extract exons by transcripts... Identify overlapping first exons for each gene... Prepare mapping between transcripts, tss, promoters and genes... Prepare annotated intron ranges... Annotating reduced exon ranges... Error in
[[<-(
tmp, name, value = new("SimpleIntegerList", elementType = "integer", : 33718 elements in value to replace 36607 elements Calls: preparePromoterAnnotation ... annotateReducedExonRanges -> $<- -> $<- -> [[<- -> [[<- In addition: Warning message: In .get_cds_IDX(mcols0$type, mcols0$phase) : The "phase" metadata column contains non-NA values for features of type stop_codon. This information was ignored. Execution halted
I'm wondering why is this happening if the Gorilla GTF has the standard naming ("1")
Thanks for your help! Gaby
Hi @gabee-chan, I looked into the Gorilla GTF file and I think chromosome names like "2A" and "2B" are causing the problem. Internally, proActiv uses the keepStandardChromosomes
function from GenomeInfoDb, which takes in a species argument to check which chromosomes are standard for the species in question. When no species
is provided, this may trim names like "2A" (see below), resulting in the error.
We'll try to make the code more robust to support other species (maybe allow the user to define seqnames for "non-standard" species), but this will take awhile.
> gr <- GRanges(c("chr1", "chr2A", "chr2B", "chr3"), IRanges(1:4, width=5))
> keepStandardChromosomes(gr, pruning.mode = "tidy")
GRanges object with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr1 1-5 *
[2] chr3 4-8 *
-------
seqinfo: 2 sequences from an unspecified genome; no seqlengths
Hi @gabee-chan I also came across the problem caused by the chromosome names like "2A" and "2B", did you fixed it?
Hi! I'm trying to run proActiv with different vertebrate species. My problem is when I try to run the preparePromoterAnnotation function the species I want to analyze is not in the genomeStyles. Therefore I was wondering if there is a way that I can still get my promoterAnnotation object when the species is not defined in the genomeStyles or what would you recommend me.
Thanks for your help!