Speeding up `BSgenome::snpsById`

bschilder commented 2 years ago

Hello,

I was just wondering if you had any tricks for speeding up BSgenome::snpsById. This is currently one of the slower steps in our MungeSumstats pipeline (@Al-Murphy). Specifically, here.

The slowness is especially pronounced once we scale up to many millions of SNPs. So I was wondering if there were ways this function could be accelerated. For example, making use of BiocParallel to iterate across SNPs, or some sort of chunking procedure (either internally within BSgenome or as a wrapper).

Many thanks in advance, Brian

Reprex

sumstats_dt <- MungeSumstats::formatted_example()
snps <- sumstats_dt$SNP

SNP_LOC_DATA <- SNPlocs.Hsapiens.dbSNP144.GRCh37::SNPlocs.Hsapiens.dbSNP144.GRCh37
genome <- BSgenome.Hsapiens.1000genomes.hs37d5::BSgenome.Hsapiens.1000genomes.hs37d5

system.time({
        gr_rsids <- BSgenome::snpsById(
            x = SNP_LOC_DATA,
            id = snps,
            genome = genome,
            ifnotfound = "drop"
        )
    })

#   user  system elapsed 
#  9.761   0.752  10.692

Session info

``` R version 4.2.0 (2022-04-22) Platform: x86_64-apple-darwin17.0 (64-bit) Running under: macOS Monterey 12.3.1 Matrix products: default LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib locale: [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8 attached base packages: [1] stats4 stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_1.0.9 MungeSumstats_1.4.1 [3] BSgenome.Hsapiens.1000genomes.hs37d5_0.99.1 BSgenome_1.64.0 [5] rtracklayer_1.56.0 Biostrings_2.64.0 [7] XVector_0.36.0 GenomicRanges_1.48.0 [9] GenomeInfoDb_1.32.1 IRanges_2.30.0 [11] S4Vectors_0.34.0 BiocGenerics_0.42.0 loaded via a namespace (and not attached): [1] colorspace_2.0-3 rjson_0.2.21 [3] class_7.3-20 ellipsis_0.3.2 [5] rprojroot_2.0.3 fs_1.5.2 [7] gld_2.6.4 proxy_0.4-26 [9] rstudioapi_0.13 roxygen2_7.1.2 [11] remotes_2.4.2 waldo_0.4.0 [13] bit64_4.0.5 mvtnorm_1.1-3 [15] AnnotationDbi_1.58.0 fansi_1.0.3 [17] diffobj_0.3.5 xml2_1.3.3 [19] R.methodsS3_1.8.1 rootSolve_1.8.2.3 [21] cachem_1.0.6 knitr_1.39 [23] pkgload_1.2.4 jsonlite_1.8.0 [25] Rsamtools_2.12.0 dbplyr_2.1.1 [27] png_0.1-7 SNPlocs.Hsapiens.dbSNP144.GRCh37_0.99.20 [29] R.oo_1.24.0 compiler_4.2.0 [31] httr_1.4.3 assertthat_0.2.1 [33] Matrix_1.4-1 fastmap_1.1.0 [35] gargle_1.2.0 cli_3.3.0 [37] prettyunits_1.1.1 tools_4.2.0 [39] gtable_0.3.0 lmom_2.8 [41] glue_1.6.2 GenomeInfoDbData_1.2.8 [43] rappdirs_0.3.3 Rcpp_1.0.8.3 [45] Biobase_2.56.0 vctrs_0.4.1 [47] xfun_0.30 stringr_1.4.0 [49] ps_1.7.0 brio_1.1.3 [51] testthat_3.1.4 lifecycle_1.0.1 [53] restfulr_0.0.13 devtools_2.4.3 [55] XML_3.99-0.9 googleAuthR_2.0.0 [57] scales_1.2.0 MASS_7.3-57 [59] zlibbioc_1.42.0 VariantAnnotation_1.42.0 [61] hms_1.1.1 MatrixGenerics_1.8.0 [63] parallel_4.2.0 SummarizedExperiment_1.26.1 [65] expm_0.999-6 rematch2_2.1.2 [67] Exact_3.1 yaml_2.3.5 [69] curl_4.3.2 memoise_2.0.1 [71] ggplot2_3.3.6 biomaRt_2.52.0 [73] stringi_1.7.6 RSQLite_2.2.14 [75] BiocIO_1.6.0 desc_1.4.1 [77] e1071_1.7-9 GenomicFeatures_1.48.0 [79] filelock_1.0.2 boot_1.3-28 [81] pkgbuild_1.3.1 BiocParallel_1.30.0 [83] rlang_1.0.2 pkgconfig_2.0.3 [85] commonmark_1.8.0 GenomicFiles_1.32.0 [87] matrixStats_0.62.0 bitops_1.0-7 [89] lattice_0.20-45 scKirby_0.1.0 [91] purrr_0.3.4 GenomicAlignments_1.32.0 [93] bit_4.0.4 tidyselect_1.1.2 [95] processx_3.5.3 magrittr_2.0.3 [97] R6_2.5.1 DescTools_0.99.44 [99] generics_0.1.2 DelayedArray_0.22.0 [101] DBI_1.1.2 pillar_1.7.0 [103] withr_2.5.0 KEGGREST_1.36.0 [105] RCurl_1.98-1.6 tibble_3.1.7 [107] seqminer_8.4 crayon_1.5.1 [109] utf8_1.2.2 BiocFileCache_2.4.0 [111] progress_1.2.2 usethis_2.1.5 [113] grid_4.2.0 data.table_1.14.2 [115] blob_1.2.3 callr_3.7.0 [117] digest_0.6.29 R.utils_2.11.0 [119] munsell_0.5.0 sessioninfo_1.2.2 ```

hpages commented 2 years ago

Yes, the speed is not particularly impressive in your example where you're looking up 93 rs ids only but the good news is that BSgenome::snpsById() scales really well and only takes a little bit more time if the number of rs ids is 1 million.

For example, with SNPlocs.Hsapiens.dbSNP144.GRCh38:

library(SNPlocs.Hsapiens.dbSNP144.GRCh38)
snps <- SNPlocs.Hsapiens.dbSNP144.GRCh38

## Using a hack to extract all rs ids (without the "rs" prefix) from SNPlocs.Hsapiens.dbSNP144.GRCh38.
## Not something that the end-user should ever do:
all_rsids <- rowids(snps@snp_table)

## Lookup 100 rs ids (90 valid, 10 random):
my_rsids <- paste0("rs", sample(c(sample(all_rsids, 90), sample(999999999, 10))))
system.time(gpos <- BSgenome::snpsById(snps, my_rsids, ifnotfound = "drop"))
#    user  system elapsed 
#  11.034   0.471  11.554

## Lookup 1 million rs ids (0.9 million valid, 0.1 million random):
my_rsids <- paste0("rs", sample(c(sample(all_rsids, 9e5), sample(999999999, 1e5))))
system.time(gpos <- BSgenome::snpsById(snps, my_rsids, ifnotfound="drop"))
#    user  system elapsed 
#  16.521   0.752  17.570

So it takes only about 17 sec. on my laptop to lookup 1 million rs ids!

Note that supplying a BSgenome object via the genome argument slows down things a little because of the additional work that takes place internally to extract the ref and alt alleles:

library(BSgenome.Hsapiens.UCSC.hg38)
genome <- BSgenome.Hsapiens.UCSC.hg38
seqlevelsStyle(genome) <- "NCBI"
system.time(gpos <- BSgenome::snpsById(snps, my_rsids, genome=genome, ifnotfound="drop"))
#   user  system elapsed 
# 22.720   5.524  37.520

but it still takes less than a minute to lookup the 1 million rs ids and extract their ref and alt alleles from the chromosome sequences in BSgenome.Hsapiens.UCSC.hg38.

Now if we do this with 10 times more (i.e. 10 millions) rs ids:

my_rsids <- paste0("rs", sample(c(sample(all_rsids, 9e6), sample(999999999, 1e6))))
system.time(gpos <- BSgenome::snpsById(snps, my_rsids, genome=genome, ifnotfound="drop"))
#    user  system elapsed 
#  88.672  17.916 106.737

it takes less than 2 minutes, confirming that the more rs ids you supply to BSgenome::snpsById(), the better it performs.

I'm actually surprised by your claim that "the slowness is especially pronounced once we scale up to many millions of SNPs". If you have millions of SNPs of interest, it's best to call BSgenome::snpsById() on all of them at once. Making many calls to BSgenome::snpsById() on small subsets of your SNPs of interest is indeed going to be quite inefficient. In other words, this is a case where a "divide and conquer" strategy (a.k.a. chunking strategy) would actually hurt.

However, you need to make sure that you have enough memory to handle the huge GPos object that is returned by BSgenome::snpsById() when called on millions of SNPs. When I tried to do the above with 50 million rs ids, my laptop ran out-of-memory (it only has 16 Gb of RAM) and it started to consume all its resources in performing very inefficient memory swapping. I had to kill the process after 10 min.! :disappointed:

Hope this helps,

H.

bschilder commented 2 years ago

@hpages Thank you so much for taking the time to explain this so thoroughly!

The scaling is actually quite impressive when you put it that way. Sorry, I think i misspoke when i said "especially pronounced" when scaling up to millions of SNPs. What I meant was, it takes longer with lots of SNPs (albeit not nearly as long as you'd expect given the time it takes to query <100 SNPs). Also, we're calling the function a handful of time throughout our pipeline.

I've just added a timer to BSgenome::snpsById() in our pipeline and I'm in the process of munging many different GWAS, so I can share those numbers with you if you think they might be helpful. I'm using a 64-core / 128 Gb machine.

At this stage I'm just trying to rack my brain about any way I can speed up MungeSumstats further, but it seems you've already optimised BSgenome::snpsById() pretty near the limit. So the quest continues!

Thanks again, Brian

hpages commented 1 year ago

Hi @bschilder,

Is it ok to close this issue?

Thanks, H.

bschilder commented 1 year ago

Yes, of course! I hadn't realized this was still open. Thanks for checking @hpages :)

Bioconductor / BSgenome

Speeding up `BSgenome::snpsById` #31

Reprex

Session info