Bioconductor / GenomicFeatures

Query the gene models of a given organism/assembly
https://bioconductor.org/packages/GenomicFeatures
26 stars 12 forks source link

promoters() returns out of range indicies #40

Closed cmatKhan closed 2 years ago

cmatKhan commented 2 years ago

Should promoters() be allowed to return negative indicies?

When trying to use getPromoterSeq(), I received the following error:

Error in value[[3L]](cond) : 
   record 6097 (CP022322.1:-380-19) was truncated
  file: /home/chase/projects/brentlab/kn99_database/genome_files/kn99/KN99_fungiDB/FungiDB-53_CneoformansKN99_Genome.fasta

Upon inspecting the promoter regions with the same setting , upstream = 400, as I had used in getPromoterSeq, I discovered that there are two genes which, with this promoter definition, return negative values due to the fact that they are annotated at the very beginning of the chromosome such that 400bp upstream causes the promoter range to be returned as a negative value:

> promoters[promoters@ranges@start < 0]
GRanges object with 2 ranges and 2 metadata columns:
                      seqnames    ranges strand |     tx_id           tx_name
                         <Rle> <IRanges>  <Rle> | <integer>       <character>
  CKF44_06802-t26_1 CP022322.1  -380-219      + |      1106 CKF44_06802-t26_1
  CKF44_09000-t26_1 CP022335.1  -380-219      + |      9174 CKF44_09000-t26_1 

I confirmed that this was, in fact, the cause of the error in getPromoterSeq() by setting upstream to 20 or less, so that no negative values are returned, and getPromoterSeq() returned without error.

Possibly in the promoters function, a warning should be issued? Alternatively, if the definition of the promoter places the upstream/downstream indicies out of range of the chromosome, maybe setting to the chromosome limit (0 or length(chromosome)) would be appropriate? Maybe this should be an argument, something like, handle_out_of_bound_indicies = TRUE by default?

hpages commented 2 years ago

Hi,

A genomic range can be out-of-bound at the beginning or at the end of the chromosome, or at both ends. If the GRanges object "knows" the length of the chromosome, and if the chromosome is not circular, a warning gets issued when the object is transformed into an object that contains out-of-bound ranges:

library(GenomicRanges)
gr <- GRanges("chr1:55-80", seqlengths=c(chr1=100))
seqinfo(gr)
# Seqinfo object with 1 sequence from an unspecified genome:
#   seqnames seqlengths isCircular genome
#   chr1            100         NA   <NA>

With shift():

gr2 <- shift(gr, 30)
# Warning messages:
# 1: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
#   GRanges object contains 1 out-of-bound range located on sequence chr1.
#   Note that ranges located on a sequence whose length is unknown (NA) or
#   on a circular sequence are not considered out-of-bound (use
#   seqlengths() and isCircular() to get the lengths and circularity flags
#   of the underlying sequences). You can use trim() to trim these ranges.
#   See ?`trim,GenomicRanges-method` for more information.
# 2: In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
#   GRanges object contains 1 out-of-bound range located on sequence chr1.
#   Note that ranges located on a sequence whose length is unknown (NA) or
#   on a circular sequence are not considered out-of-bound (use
#   seqlengths() and isCircular() to get the lengths and circularity flags
#   of the underlying sequences). You can use trim() to trim these ranges.
#   See ?`trim,GenomicRanges-method` for more information.

gr2
# GRanges object with 1 range and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1    85-110      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome

With promoters():

prom1 <- promoters(gr, upstream=60, downstream=0)
# Warning message:
# In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
#   GRanges object contains 1 out-of-bound range located on sequence chr1.
#   Note that ranges located on a sequence whose length is unknown (NA) or
#   on a circular sequence are not considered out-of-bound (use
#   seqlengths() and isCircular() to get the lengths and circularity flags
#   of the underlying sequences). You can use trim() to trim these ranges.
#   See ?`trim,GenomicRanges-method` for more information.

prom2 <- promoters(gr, upstream=0, downstream=50)
# Warning message:
# In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
#   GRanges object contains 1 out-of-bound range located on sequence chr1.
#   Note that ranges located on a sequence whose length is unknown (NA) or
#   on a circular sequence are not considered out-of-bound (use
#   seqlengths() and isCircular() to get the lengths and circularity flags
#   of the underlying sequences). You can use trim() to trim these ranges.
#   See ?`trim,GenomicRanges-method` for more information.

Note that these warning messages actually come from the validity method for GRanges objects:

validObject(gr2)
# [1] TRUE
# Warning message:
# In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
#   GRanges object contains 1 out-of-bound range located on sequence chr1.
#   Note that ranges located on a sequence whose length is unknown (NA) or
#   on a circular sequence are not considered out-of-bound (use
#   seqlengths() and isCircular() to get the lengths and circularity flags
#   of the underlying sequences). You can use trim() to trim these ranges.
#   See ?`trim,GenomicRanges-method` for more information.

validObject(prom1)
# [1] TRUE
# Warning message:
# In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
#   GRanges object contains 1 out-of-bound range located on sequence chr1.
#   Note that ranges located on a sequence whose length is unknown (NA) or
#   on a circular sequence are not considered out-of-bound (use
#   seqlengths() and isCircular() to get the lengths and circularity flags
#   of the underlying sequences). You can use trim() to trim these ranges.
#   See ?`trim,GenomicRanges-method` for more information.

validObject(prom2)
# [1] TRUE
# Warning message:
# In valid.GenomicRanges.seqinfo(x, suggest.trim = TRUE) :
#   GRanges object contains 1 out-of-bound range located on sequence chr1.
#   Note that ranges located on a sequence whose length is unknown (NA) or
#   on a circular sequence are not considered out-of-bound (use
#   seqlengths() and isCircular() to get the lengths and circularity flags
#   of the underlying sequences). You can use trim() to trim these ranges.
#   See ?`trim,GenomicRanges-method` for more information.

Also note that these objects can easily be trimmed with trim() (as explained in the above warning messages):

trim(gr2)
# GRanges object with 1 range and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1    85-100      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome

trim(prom1)
GRanges object with 1 range and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1      1-54      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome

trim(prom2)
# GRanges object with 1 range and 0 metadata columns:
#       seqnames    ranges strand
#          <Rle> <IRanges>  <Rle>
#   [1]     chr1    55-100      *
#   -------
#   seqinfo: 1 sequence from an unspecified genome

This is quickly mentioned in ?GenomicRanges::trim.

The reason for the current design is that negative positions are not necessarily a problem. They are actually meaningful on a circular chromosome like chrM:

library(BSgenome.Hsapiens.UCSC.hg38)
getSeq(BSgenome.Hsapiens.UCSC.hg38, GRanges("chr1", IRanges(-4, 10)))
# Error in loadFUN(x, seqname, ranges) : 
#   trying to load regions beyond the boundaries of non-circular sequence
#   "chr1"

chrM <- BSgenome.Hsapiens.UCSC.hg38$chrM
chrM
# 16569-letter DNAString object
# seq: GATCACAGGTCTATCACCCTATTAACCACTCACGGG...AGCCCACACGTTCCCCTTAAATAAGACATCACGATG

getSeq(BSgenome.Hsapiens.UCSC.hg38, GRanges("chrM", IRanges(1, 10)))
# DNAStringSet object of length 1:
#     width seq
# [1]    10 GATCACAGGT

getSeq(BSgenome.Hsapiens.UCSC.hg38, GRanges("chrM", IRanges(-4, 10)))
# DNAStringSet object of length 1:
#     width seq
# [1]    15 CGATGGATCACAGGT

So it's not clear that they should be automatically and systematically "fixed". Furthermore, doing so would make some operations hard to revert. For example shift(gr, -30) would not longer be guaranteed to revert shift(gr, 30).

This is why handling out-of-bound ranges is the responsibility of the user.

I understand that this is maybe harder to do in the context of getPromoterSeq but you're not providing a reproducible example so it's hard for me to further comment on that.

H.

hpages commented 2 years ago

oops, I didn't mean to close this yet

cmatKhan commented 2 years ago

I think closing it is probably best -- I wasn't thinking about circular seqs, which of course means that there isn't an 'out-of-bounds'. I also didn't populate the seqlengths in the TxDb.

This is what I ultimately did to extract the promoters, but it would be much cleaner to populate the seqlengths and use trim. I am including this for the sake of completeness -- no need to comment.

The genome and similar (not quite the same, yet) annotations can be found here: https://fungidb.org/fungidb/app/record/dataset/DS_f670f03393

library(GenomicFeatures)
library(Rsamtools)

faFile = FaFile(
  "~/projects/brentlab/kn99_database/genome_files/kn99/KN99_fungiDB/FungiDB-53_CneoformansKN99_Genome.fasta",
  index = "/home/chase/projects/brentlab/kn99_database/genome_files/kn99/KN99_fungiDB/FungiDB-53_CneoformansKN99_Genome.fasta.fai"
)

annot = makeTxDbFromGFF("~/projects/brentlab/kn99_database/genome_files/kn99/CURRENT_KN99_ANNOTATIONS/liftoff_h99_to_kn99_fixed_phases_with_type_removed_overlap_ckf44.gff")

promoters = promoters(annot, upstream = 400, downstream = 0)

# fix problematic indicies
promoters[promoters$tx_name == "CKF44_09000-t26_1"]@ranges =
  IRanges(start = 1, end = 219)

promoters[promoters$tx_name == "CKF44_06802-t26_1"]@ranges =
  IRanges(start = 1, end = 219)

promoters[promoters$tx_name == "CKF44_07918-t26_1"]@ranges =
  IRanges(start = 773708, end = 774060)

# extract promoters
kn99_promoter_seqs = getSeq(
 faFile,
 promoters
)

I suppose it might be worth adding a line to the makeTxDbFromGFF() output message that, if the seqlengths aren't populated, some functionality is compromised, though that likely should be expected by a reasonable user.

> sessionInfo()
R version 4.1.2 (2021-11-01)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 18.04.6 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.7.1
LAPACK: /opt/intel/compilers_and_libraries_2020.0.166/linux/mkl/lib/intel64_lin/libmkl_rt.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8    
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods  
[8] base     

other attached packages:
 [1] rtracklayer_1.54.0     Rsamtools_2.10.0       Biostrings_2.62.0     
 [4] XVector_0.34.0         GenomicFeatures_1.46.3 AnnotationDbi_1.56.2  
 [7] Biobase_2.54.0         GenomicRanges_1.46.1   GenomeInfoDb_1.30.0   
[10] IRanges_2.28.0         S4Vectors_0.32.3       BiocGenerics_0.40.0   

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7                  lattice_0.20-45            
 [3] prettyunits_1.1.1           png_0.1-7                  
 [5] assertthat_0.2.1            digest_0.6.29              
 [7] utf8_1.2.2                  BiocFileCache_2.2.0        
 [9] R6_2.5.1                    RSQLite_2.2.9              
[11] httr_1.4.2                  pillar_1.6.4               
[13] zlibbioc_1.40.0             rlang_0.4.12               
[15] progress_1.2.2              curl_4.3.2                 
[17] rstudioapi_0.13             blob_1.2.2                 
[19] Matrix_1.4-0                BiocParallel_1.28.3        
[21] stringr_1.4.0               RCurl_1.98-1.5             
[23] bit_4.0.4                   biomaRt_2.50.1             
[25] DelayedArray_0.20.0         compiler_4.1.2             
[27] pkgconfig_2.0.3             tidyselect_1.1.1           
[29] KEGGREST_1.34.0             SummarizedExperiment_1.24.0
[31] tibble_3.1.6                GenomeInfoDbData_1.2.7     
[33] matrixStats_0.61.0          XML_3.99-0.8               
[35] fansi_1.0.0                 crayon_1.4.2               
[37] dplyr_1.0.7                 dbplyr_2.1.1               
[39] rappdirs_0.3.3              GenomicAlignments_1.30.0   
[41] bitops_1.0-7                grid_4.1.2                 
[43] lifecycle_1.0.1             DBI_1.1.2                  
[45] magrittr_2.0.1              stringi_1.7.6              
[47] cachem_1.0.6                renv_0.15.0                
[49] xml2_1.3.3                  filelock_1.0.2             
[51] ellipsis_0.3.2              vctrs_0.3.8                
[53] generics_0.1.1              rjson_0.2.21               
[55] restfulr_0.0.13             tools_4.1.2                
[57] bit64_4.0.5                 glue_1.6.0                 
[59] purrr_0.3.4                 hms_1.1.1                  
[61] MatrixGenerics_1.6.0        parallel_4.1.2             
[63] fastmap_1.1.0               yaml_2.2.1                 
[65] memoise_2.0.1               BiocIO_1.4.0