Bioconductor / BSgenome

Software infrastructure for efficient representation of full genomes and their SNPs
https://bioconductor.org/packages/BSgenome
9 stars 8 forks source link

getSeq versus Views runtime #37

Open jayoung opened 2 years ago

jayoung commented 2 years ago

hey,

An observation, rather than a bug: I noticed that getSeq runs much slower than Views when extracting sequences from a BSgenome using GRanges to specify the desired regions. I'll use Views for my project, but I'm wondering if getSeqs would be better if updated to use whatever code Views is using.

An example is below. With hg38 I needed a large number of windows to see the problem. With the genome I'm actually working on, the problem is also evident with a much smaller number of windows (100 windows). That one is a very draft-y mammalian genome, with many many contigs - I forged the BSgenome package from NCBI genome GCF_014825515.1 .

thanks!

Janet

library(BSgenome.Hsapiens.UCSC.hg38)

windows10kb <-  tileGenome(seqlengths(BSgenome.Hsapiens.UCSC.hg38), 
                           tilewidth=10000, 
                           cut.last.tile.in.chrom=TRUE)

system.time( x <- getSeq(Hsapiens, windows10kb[1:100000]) )
#  user  system elapsed 
# 6.625   1.202   8.157 

system.time( y <- Views(Hsapiens, windows10kb[1:100000]) )
#  user  system elapsed 
# 0.023   0.002   0.025 

sessionInfo()
# 
# R version 4.2.1 (2022-06-23)
# Platform: x86_64-apple-darwin17.0 (64-bit)
# Running under: macOS Monterey 12.6
# 
# Matrix products: default
# LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
# 
# locale:
# [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
# 
# attached base packages:
# [1] stats4    stats     graphics  grDevices utils     datasets  methods   base     
# 
# other attached packages:
# [1] BSgenome.Hsapiens.UCSC.hg38_1.4.4 BSgenome_1.64.0                  
# [3] rtracklayer_1.56.1                Biostrings_2.64.1                
# [5] XVector_0.36.0                    GenomicRanges_1.48.0             
# [7] GenomeInfoDb_1.32.3               IRanges_2.30.1                   
# [9] S4Vectors_0.34.0                  BiocGenerics_0.42.0              
# 
# loaded via a namespace (and not attached):
#     [1] rstudioapi_0.14             zlibbioc_1.42.0             GenomicAlignments_1.32.1   
# [4] BiocParallel_1.30.3         lattice_0.20-45             rjson_0.2.21               
# [7] tools_4.2.1                 grid_4.2.1                  SummarizedExperiment_1.26.1
# [10] parallel_4.2.1              Biobase_2.56.0              matrixStats_0.62.0         
# [13] yaml_2.3.5                  crayon_1.5.2                BiocIO_1.6.0               
# [16] Matrix_1.5-1                GenomeInfoDbData_1.2.8      restfulr_0.0.15            
# [19] bitops_1.0-7                codetools_0.2-18            RCurl_1.98-1.9             
# [22] DelayedArray_0.22.0         compiler_4.2.1              MatrixGenerics_1.8.1       
# [25] Rsamtools_2.12.0            XML_3.99-0.11              
jayoung commented 2 years ago

hmm - digging in a bit more, is Views quicker simply because it's not actually getting the entire sequence of each region?

when I actually get the seqs from the Views object, that's time-consuming too. Oh well!

system.time( y <- Views(Hsapiens, windows10kb[1:100000]) )
system.time( y2 <- as(y, "DNAStringSet" )
#   user  system elapsed 
#  5.672   1.405   7.805 

I don't really need a DNAStringSet - character vector would be fine.

Feel free to close this, although if you have suggestions about the most efficient way to get large numbers of sequences from a BSgenome, I'd love to hear them.

thanks again!

Janet

hpages commented 2 years ago

Hi Janet,

hmm - digging in a bit more, is Views quicker simply because it's not actually getting the entire sequence of each region?

Yep, that's why. The Views() constructor doesn't actually load any sequence at all in memory... until you display the returned BSgenomeViews object. Then the show() method for these objects will load very small pieces of the sequences, only what's needed to print a compact summary of the object.

Note that doing y <- Views(Hsapiens, windows10kb[1:100000]); as(y, "DNAStringSet") is basically equivalent to doing getSeq(Hsapiens, windows10kb[1:100000]) because that's in fact what the coercion method from BSgenomeViews to DNAStringSet does behind the scene i.e. it does getSeq(subject(y), granges(y)).

if you have suggestions about the most efficient way to get large numbers of sequences from a BSgenome, I'd love to hear them

getSeq() is meant to be the fast way to do that. I implemented the function a long time ago so forgot the details, so I just took a quick look at the implementation to refresh my memory. I actually see some room for improvement in the following case:

In that case, it seems that getSeq() is not taking full advantage of the fast random access capabilities of the twobit format. Some rough testing seems to indicate that getSeq() could be made about 2x faster:

system.time(x <- getSeq(Hsapiens, windows10kb[1:2000]))
#    user  system elapsed 
#   0.259   0.000   0.259

system.time(z <- getSeq(Hsapiens@single_sequences@twobitfile, windows10kb[1:2000]))
#    user  system elapsed 
#   0.127   0.008   0.135

The 2nd call to getSeq() calls it directly on the TwoBitFile object contained in the BSgenome object. This kind of shortcut could be taken under some conditions (but not always).

Not a huge deal, and a refactoring of getSeq() is long due anyways. Has been on my unofficial TODO list for years. The code is old and needs to be modernized. I also have on my unofficial TODO list to rename the function extractSeqs, for consistency with extractTranscriptSeqs and extractUpstreamSeqs from the GenomicFeatures package. But that will be for another day.

H.

jayoung commented 2 years ago

thanks for taking a look - appreciate it.

check this out: with the genome I've been working with the time difference is MUCH more pronounced than with hg38 (and this test only uses 100 regions). I suspect it's because it's a lot more fragmented than hg38 (56740 contigs versus 640 for hg38).

so the speedup provided by using the 2bit file is VERY helpful in my case.

thank you!

# test with Ajamaicensis (= accession GCF_014825515.1 )
library(BSgenome.Ajamaicensis.NCBI.Ajam2)
windows10kb_Ajam2 <-  tileGenome(seqlengths(BSgenome.Ajamaicensis.NCBI.Ajam2), 
                           tilewidth=10000, 
                           cut.last.tile.in.chrom=TRUE)

system.time(x_Ajam2 <- getSeq(Ajamaicensis, windows10kb_Ajam2[1:100]))
#   user  system elapsed 
# 10.683   2.743  13.652 

system.time(z_Ajam2 <- getSeq(Ajamaicensis@single_sequences@twobitfile, windows10kb_Ajam2[1:100]))
#   user  system elapsed 
# 0.090   0.028   0.120 

# check timings for hg38 on my system
library(BSgenome.Hsapiens.UCSC.hg38)

windows10kb_human <-  tileGenome(seqlengths(BSgenome.Hsapiens.UCSC.hg38), 
                           tilewidth=10000, 
                           cut.last.tile.in.chrom=TRUE)

system.time(x_human <- getSeq(Hsapiens, windows10kb_human[1:100]))
# user  system elapsed 
# 0.075   0.004   0.117 
system.time(z_human <- getSeq(Hsapiens@single_sequences@twobitfile, windows10kb_human[1:100]))
# user  system elapsed 
# 0.024   0.004   0.042 
hpages commented 2 years ago

Wait.. what? That's crazy :dizzy:

But yeah, that's because your genome has a lot of sequences. The getSeq() method for BSgenome objects uses a for loop in R to load the contigs in memory one at a time, and to extract the queried sequences from each contig. This implementation predates support for 2bit files in Bioconductor (added by Michael Lawrence in 2011 in rtracklayer), and is from the time when the genome sequences in a BSgenome object were stored in serialized DNAString objects. Not a big deal when the genome has only a few hundred sequences, but very inefficient when there are tens of thousands of them! :cry:

Note that some BSgenome objects still use this old storage:

> library(BSgenome.Hsapiens.UCSC.hg18)

> BSgenome.Hsapiens.UCSC.hg18@single_sequences
RdaNamedSequences instance of length 49:
          chr1          chr2          chr3 ...  chr22_random   chrX_random
     247249719     242951149     199501827 ...        257318       1719168

> BSgenome.Hsapiens.UCSC.hg18@single_sequences@dirpath
[1] "/home/hpages/R/R-4.2.r82318/library/BSgenome.Hsapiens.UCSC.hg18/extdata"

> dir(BSgenome.Hsapiens.UCSC.hg18@single_sequences@dirpath)
 [1] "chr1_random.rda"   "chr1.rda"          "chr10_random.rda" 
 [4] "chr10.rda"         "chr11_random.rda"  "chr11.rda"        
 [7] "chr12.rda"         "chr13_random.rda"  "chr13.rda"        
[10] "chr14.rda"         "chr15_random.rda"  "chr15.rda"        
[13] "chr16_random.rda"  "chr16.rda"         "chr17_random.rda" 
[16] "chr17.rda"         "chr18_random.rda"  "chr18.rda"        
[19] "chr19_random.rda"  "chr19.rda"         "chr2_random.rda"  
[22] "chr2.rda"          "chr20.rda"         "chr21_random.rda" 
[25] "chr21.rda"         "chr22_h2_hap1.rda" "chr22_random.rda" 
[28] "chr22.rda"         "chr3_random.rda"   "chr3.rda"         
[31] "chr4_random.rda"   "chr4.rda"          "chr5_h2_hap1.rda" 
[34] "chr5_random.rda"   "chr5.rda"          "chr6_cox_hap1.rda"
[37] "chr6_qbl_hap2.rda" "chr6_random.rda"   "chr6.rda"         
[40] "chr7_random.rda"   "chr7.rda"          "chr8_random.rda"  
[43] "chr8.rda"          "chr9_random.rda"   "chr9.rda"         
[46] "chrM.rda"          "chrX_random.rda"   "chrX.rda"         
[49] "chrY.rda"          "seqlengths.rda"   

That's because they contain IUPAC ambiguity codes that are not supported by the 2bit format:

> alphabetFrequency(BSgenome.Hsapiens.UCSC.hg18$chr3)[1:6]
       A        C        G        T        M        R 
58688741 38632560 38648191 58735330        1        2 

OTOH the getSeq() method for TwoBitFile objects is implemented (in C) in rtracklayer and takes full advantage of the fast random access capabilities of the twobit format.

So yes, time to modernize this. I'm moving this up in my TODO list. Let's keep this issue open until I get to this.

H.

jayoung commented 2 years ago

I know! Crazy.

I'm toying with the idea of just getting rid of some of the shorter contigs in this genome (I suspect the same issue is giving me trouble outside of R as well, with GATK). I'd prefer to do that kind of filtering further downstream in my analysis if I can.

Anyway, it does seem worth an update. Not urgent from my perspective now that you've told me that nice trick.

thanks again!