snpsById call does not complete for dbSNP155

MattBrauer commented 1 year ago

Following up on a previous issue (#31, closed), I'm finding that snpsById does not complete within any reasonable time.

REPREX

library(SNPlocs.Hsapiens.dbSNP155.GRCh38)
snps <- SNPlocs.Hsapiens.dbSNP155.GRCh38

my_rsids <- c("rs2639606", "rs75264089")
system.time(gpos <- BSgenome::snpsById(snps, my_rsids, ifnotfound = "drop"))

# not completed within 1 hour
# ETA: 3 hours

Running on M1 Pro Mac, Monterey (12.4), 16GB RAM.

> sessionInfo()
R version 4.2.2 (2022-10-31)
Platform: aarch64-apple-darwin20 (64-bit)
Running under: macOS Monterey 12.4

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRblas.0.dylib
LAPACK: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/lib/libRlapack.dylib

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats4    stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
 [1] SNPlocs.Hsapiens.dbSNP155.GRCh38_0.99.22
 [2] BSgenome_1.66.1
 [3] rtracklayer_1.58.0
 [4] Biostrings_2.66.0
 [5] XVector_0.38.0
 [6] GenomicRanges_1.50.2
 [7] GenomeInfoDb_1.34.4
 [8] IRanges_2.32.0
 [9] S4Vectors_0.36.1
[10] BiocGenerics_0.44.0
[11] BiocManager_1.30.19

loaded via a namespace (and not attached):
 [1] zlibbioc_1.44.0             GenomicAlignments_1.34.0
 [3] BiocParallel_1.32.4         lattice_0.20-45
 [5] rjson_0.2.21                tools_4.2.2
 [7] grid_4.2.2                  SummarizedExperiment_1.28.0
 [9] parallel_4.2.2              Biobase_2.58.0
[11] matrixStats_0.63.0          yaml_2.3.6
[13] crayon_1.5.2                BiocIO_1.8.0
[15] Matrix_1.5-3                GenomeInfoDbData_1.2.9
[17] restfulr_0.0.15             bitops_1.0-7
[19] codetools_0.2-18            RCurl_1.98-1.9
[21] DelayedArray_0.24.0         compiler_4.2.2
[23] MatrixGenerics_1.10.0       Rsamtools_2.14.0
[25] XML_3.99-0.13

MattBrauer commented 1 year ago

Note that the response time is reasonable for v.144:

> snps <- SNPlocs.Hsapiens.dbSNP144.GRCh38
> my_rsids <- c("rs2639606", "rs75264089")
> system.time(gpos <- BSgenome::snpsById(snps, my_rsids, ifnotfound = "drop"))

   user  system elapsed
 11.584   1.493  14.051

vjcitn commented 1 year ago

Right, these are very different variant catalogues; dbSNP155 has close to 1 billion SNPs. I am running your code on a large machine, and the RAM consumption definitely exceeded 15GB ... it finished in under 2 min with

> gpos
UnstitchedGPos object with 2 positions and 2 metadata columns:
      seqnames       pos strand |   RefSNP_id alleles_as_ambig
         <Rle> <integer>  <Rle> | <character>      <character>
  [1]        9  68413211      * |   rs2639606                V
  [2]        6  25056560      * |  rs75264089                B
  -------
  seqinfo: 25 sequences (1 circular) from GRCh38.p13 genome

I am not sure a lookup like this can be done on a laptop; I would look for an NCBI API to grab the relevant info.

vjcitn commented 1 year ago

For the record:

> SNPlocs.Hsapiens.dbSNP144.GRCh38
# SNPlocs object for Homo sapiens (dbSNP Human BUILD 144)
# reference genome: GRCh38.p2
# nb of SNPs: 133030779
> SNPlocs.Hsapiens.dbSNP155.GRCh38
# SNPlocs object for Homo sapiens (dbSNP Human Build 155)
# reference genome: GRCh38.p13
# nb of SNPs: 949021448

MattBrauer commented 1 year ago

Thanks, Vince. That dramatic scale-up probably accounts for the performance difference. I'll presume it's ok for me to close the issue.

hpages commented 1 year ago

Hi @MattBrauer , @vjcitn ,

Some basic experimenting with match() seems to indicate the memory footprint of the lookup could be reduced by about 50% by using a divide-and-conquer approach. I'll try to commit something in the next few days.

hpages commented 1 year ago

Hi @MattBrauer , @vjcitn ,

I suspect that Matt's M1 Pro Mac entered into crazy swapping mode.

The good news is that some basic experimentation with match() seems to indicate that the memory footprint of this lookup could be reduced by about 50% by using a divide-and-conquer approach. This should make snpsById() usable on SNPlocs.Hsapiens.dbSNP155.GRCh38 on a laptop with 16GB.

I'll try to commit something in the next few days.

H.

MattBrauer commented 1 year ago

Thanks, @hpages!

hpages commented 1 year ago

The divide-and-conquer approach is implemented in:

BSgenome 1.66.2
SNPlocs.Hsapiens.dbSNP155.GRCh38 0.99.23 (requires BSgenome >= 1.66.2)
SNPlocs.Hsapiens.dbSNP155.GRCh37 0.99.23 (requires BSgenome >= 1.66.2)

if you're using BioC 3.16 (current release).

And in:

BSgenome 1.67.2
SNPlocs.Hsapiens.dbSNP155.GRCh38 0.99.24 (requires BSgenome >= 1.67.2)
SNPlocs.Hsapiens.dbSNP155.GRCh37 0.99.24 (requires BSgenome >= 1.67.2)

if you're using BioC 3.17 (current devel).

It will take between 2 and 3 days before all these new versions become available via BiocManager::install().

With this improvement the memory footprint of the lookup is reduced to 6.7G (from 15.7G), so you shouldn't have any problem running this on your Mac Pro M1 @MattBrauer.

Also, on a machine with enough memory to run the lookup before this improvement (like @vjcitn's large machine), the divide-and-conquer approach should boost the speed by 40%-50%.

Cheers, H.

Bioconductor / BSgenome

snpsById call does not complete for dbSNP155 #57