Closed cmatKhan closed 2 years ago
Hi,
A genomic range can be out-of-bound at the beginning or at the end of the chromosome, or at both ends. If the GRanges object "knows" the length of the chromosome, and if the chromosome is not circular, a warning gets issued when the object is transformed into an object that contains out-of-bound ranges:
library(GenomicRanges)
gr <- GRanges("chr1:55-80", seqlengths=c(chr1=100))
seqinfo(gr)
# Seqinfo object with 1 sequence from an unspecified genome:
# seqnames seqlengths isCircular genome
# chr1 100 NA <NA>
With shift()
:
gr2 <- shift(gr, 30)
# Warning messages:
# 1: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
# GRanges object contains 1 out-of-bound range located on sequence chr1.
# Note that ranges located on a sequence whose length is unknown (NA) or
# on a circular sequence are not considered out-of-bound (use
# seqlengths() and isCircular() to get the lengths and circularity flags
# of the underlying sequences). You can use trim() to trim these ranges.
# See ?`trim,GenomicRanges-method` for more information.
# 2: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
# GRanges object contains 1 out-of-bound range located on sequence chr1.
# Note that ranges located on a sequence whose length is unknown (NA) or
# on a circular sequence are not considered out-of-bound (use
# seqlengths() and isCircular() to get the lengths and circularity flags
# of the underlying sequences). You can use trim() to trim these ranges.
# See ?`trim,GenomicRanges-method` for more information.
gr2
# GRanges object with 1 range and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 85-110 *
# -------
# seqinfo: 1 sequence from an unspecified genome
With promoters()
:
prom1 <- promoters(gr, upstream=60, downstream=0)
# Warning message:
# In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
# GRanges object contains 1 out-of-bound range located on sequence chr1.
# Note that ranges located on a sequence whose length is unknown (NA) or
# on a circular sequence are not considered out-of-bound (use
# seqlengths() and isCircular() to get the lengths and circularity flags
# of the underlying sequences). You can use trim() to trim these ranges.
# See ?`trim,GenomicRanges-method` for more information.
prom2 <- promoters(gr, upstream=0, downstream=50)
# Warning message:
# In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
# GRanges object contains 1 out-of-bound range located on sequence chr1.
# Note that ranges located on a sequence whose length is unknown (NA) or
# on a circular sequence are not considered out-of-bound (use
# seqlengths() and isCircular() to get the lengths and circularity flags
# of the underlying sequences). You can use trim() to trim these ranges.
# See ?`trim,GenomicRanges-method` for more information.
Note that these warning messages actually come from the validity method for GRanges objects:
validObject(gr2)
# [1] TRUE
# Warning message:
# In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
# GRanges object contains 1 out-of-bound range located on sequence chr1.
# Note that ranges located on a sequence whose length is unknown (NA) or
# on a circular sequence are not considered out-of-bound (use
# seqlengths() and isCircular() to get the lengths and circularity flags
# of the underlying sequences). You can use trim() to trim these ranges.
# See ?`trim,GenomicRanges-method` for more information.
validObject(prom1)
# [1] TRUE
# Warning message:
# In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
# GRanges object contains 1 out-of-bound range located on sequence chr1.
# Note that ranges located on a sequence whose length is unknown (NA) or
# on a circular sequence are not considered out-of-bound (use
# seqlengths() and isCircular() to get the lengths and circularity flags
# of the underlying sequences). You can use trim() to trim these ranges.
# See ?`trim,GenomicRanges-method` for more information.
validObject(prom2)
# [1] TRUE
# Warning message:
# In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
# GRanges object contains 1 out-of-bound range located on sequence chr1.
# Note that ranges located on a sequence whose length is unknown (NA) or
# on a circular sequence are not considered out-of-bound (use
# seqlengths() and isCircular() to get the lengths and circularity flags
# of the underlying sequences). You can use trim() to trim these ranges.
# See ?`trim,GenomicRanges-method` for more information.
Also note that these objects can easily be trimmed with trim()
(as explained in the above warning messages):
trim(gr2)
# GRanges object with 1 range and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 85-100 *
# -------
# seqinfo: 1 sequence from an unspecified genome
trim(prom1)
GRanges object with 1 range and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 1-54 *
# -------
# seqinfo: 1 sequence from an unspecified genome
trim(prom2)
# GRanges object with 1 range and 0 metadata columns:
# seqnames ranges strand
# <Rle> <IRanges> <Rle>
# [1] chr1 55-100 *
# -------
# seqinfo: 1 sequence from an unspecified genome
This is quickly mentioned in ?GenomicRanges::trim
.
The reason for the current design is that negative positions are not necessarily a problem. They are actually meaningful on a circular chromosome like chrM:
library(BSgenome.Hsapiens.UCSC.hg38)
getSeq(BSgenome.Hsapiens.UCSC.hg38, GRanges("chr1", IRanges(-4, 10)))
# Error in loadFUN(x, seqname, ranges) :
# trying to load regions beyond the boundaries of non-circular sequence
# "chr1"
chrM <- BSgenome.Hsapiens.UCSC.hg38$chrM
chrM
# 16569-letter DNAString object
# seq: GATCACAGGTCTATCACCCTATTAACCACTCACGGG...AGCCCACACGTTCCCCTTAAATAAGACATCACGATG
getSeq(BSgenome.Hsapiens.UCSC.hg38, GRanges("chrM", IRanges(1, 10)))
# DNAStringSet object of length 1:
# width seq
# [1] 10 GATCACAGGT
getSeq(BSgenome.Hsapiens.UCSC.hg38, GRanges("chrM", IRanges(-4, 10)))
# DNAStringSet object of length 1:
# width seq
# [1] 15 CGATGGATCACAGGT
So it's not clear that they should be automatically and systematically "fixed". Furthermore, doing so would make some operations hard to revert. For example shift(gr, -30)
would not longer be guaranteed to revert shift(gr, 30)
.
This is why handling out-of-bound ranges is the responsibility of the user.
I understand that this is maybe harder to do in the context of getPromoterSeq
but you're not providing a reproducible example so it's hard for me to further comment on that.
H.
oops, I didn't mean to close this yet
I think closing it is probably best -- I wasn't thinking about circular seqs, which of course means that there isn't an 'out-of-bounds'. I also didn't populate the seqlengths in the TxDb.
This is what I ultimately did to extract the promoters, but it would be much cleaner to populate the seqlengths and use trim
. I am including this for the sake of completeness -- no need to comment.
The genome and similar (not quite the same, yet) annotations can be found here: https://fungidb.org/fungidb/app/record/dataset/DS_f670f03393
library(GenomicFeatures)
library(Rsamtools)
faFile = FaFile(
"~/projects/brentlab/kn99_database/genome_files/kn99/KN99_fungiDB/FungiDB-53_CneoformansKN99_Genome.fasta",
index = "/home/chase/projects/brentlab/kn99_database/genome_files/kn99/KN99_fungiDB/FungiDB-53_CneoformansKN99_Genome.fasta.fai"
)
annot = makeTxDbFromGFF("~/projects/brentlab/kn99_database/genome_files/kn99/CURRENT_KN99_ANNOTATIONS/liftoff_h99_to_kn99_fixed_phases_with_type_removed_overlap_ckf44.gff")
promoters = promoters(annot, upstream = 400, downstream = 0)
# fix problematic indicies
promoters[promoters$tx_name == "CKF44_09000-t26_1"]@ranges =
IRanges(start = 1, end = 219)
promoters[promoters$tx_name == "CKF44_06802-t26_1"]@ranges =
IRanges(start = 1, end = 219)
promoters[promoters$tx_name == "CKF44_07918-t26_1"]@ranges =
IRanges(start = 773708, end = 774060)
# extract promoters
kn99_promoter_seqs = getSeq(
faFile,
promoters
)
I suppose it might be worth adding a line to the makeTxDbFromGFF()
output message that, if the seqlengths aren't populated, some functionality is compromised, though that likely should be expected by a reasonable user.
> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS
Matrix products: default
BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /opt/intel/compilers_and_libraries_2020.0.166/linux/mkl/lib/intel64_lin/libmkl_rt.so
locale:
[1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
[3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
[5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
[7] LC_PAPER=en_US.UTF-8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats4 stats graphics grDevices datasets utils methods
[8] base
other attached packages:
[1] rtracklayer_1.54.0 Rsamtools_2.10.0 Biostrings_2.62.0
[4] XVector_0.34.0 GenomicFeatures_1.46.3 AnnotationDbi_1.56.2
[7] Biobase_2.54.0 GenomicRanges_1.46.1 GenomeInfoDb_1.30.0
[10] IRanges_2.28.0 S4Vectors_0.32.3 BiocGenerics_0.40.0
loaded via a namespace (and not attached):
[1] Rcpp_1.0.7 lattice_0.20-45
[3] prettyunits_1.1.1 png_0.1-7
[5] assertthat_0.2.1 digest_0.6.29
[7] utf8_1.2.2 BiocFileCache_2.2.0
[9] R6_2.5.1 RSQLite_2.2.9
[11] httr_1.4.2 pillar_1.6.4
[13] zlibbioc_1.40.0 rlang_0.4.12
[15] progress_1.2.2 curl_4.3.2
[17] rstudioapi_0.13 blob_1.2.2
[19] Matrix_1.4-0 BiocParallel_1.28.3
[21] stringr_1.4.0 RCurl_1.98-1.5
[23] bit_4.0.4 biomaRt_2.50.1
[25] DelayedArray_0.20.0 compiler_4.1.2
[27] pkgconfig_2.0.3 tidyselect_1.1.1
[29] KEGGREST_1.34.0 SummarizedExperiment_1.24.0
[31] tibble_3.1.6 GenomeInfoDbData_1.2.7
[33] matrixStats_0.61.0 XML_3.99-0.8
[35] fansi_1.0.0 crayon_1.4.2
[37] dplyr_1.0.7 dbplyr_2.1.1
[39] rappdirs_0.3.3 GenomicAlignments_1.30.0
[41] bitops_1.0-7 grid_4.1.2
[43] lifecycle_1.0.1 DBI_1.1.2
[45] magrittr_2.0.1 stringi_1.7.6
[47] cachem_1.0.6 renv_0.15.0
[49] xml2_1.3.3 filelock_1.0.2
[51] ellipsis_0.3.2 vctrs_0.3.8
[53] generics_0.1.1 rjson_0.2.21
[55] restfulr_0.0.13 tools_4.1.2
[57] bit64_4.0.5 glue_1.6.0
[59] purrr_0.3.4 hms_1.1.1
[61] MatrixGenerics_1.6.0 parallel_4.1.2
[63] fastmap_1.1.0 yaml_2.2.1
[65] memoise_2.0.1 BiocIO_1.4.0
Should
promoters()
be allowed to return negative indicies?When trying to use
getPromoterSeq()
, I received the following error:Upon inspecting the promoter regions with the same setting ,
upstream = 400
, as I had used ingetPromoterSeq
, I discovered that there are two genes which, with this promoter definition, return negative values due to the fact that they are annotated at the very beginning of the chromosome such that 400bp upstream causes the promoter range to be returned as a negative value:I confirmed that this was, in fact, the cause of the error in
getPromoterSeq()
by settingupstream
to 20 or less, so that no negative values are returned, andgetPromoterSeq()
returned without error.Possibly in the
promoters
function, a warning should be issued? Alternatively, if the definition of the promoter places the upstream/downstream indicies out of range of the chromosome, maybe setting to the chromosome limit (0 or length(chromosome)) would be appropriate? Maybe this should be an argument, something like,handle_out_of_bound_indicies = TRUE
by default?