Closed MattBrauer closed 1 year ago
Note that the response time is reasonable for v.144:
> snps <- SNPlocs.Hsapiens.dbSNP144.GRCh38
> my_rsids <- c("rs2639606", "rs75264089")
> system.time(gpos <- BSgenome::snpsById(snps, my_rsids, ifnotfound = "drop"))
user system elapsed
11.584 1.493 14.051
Right, these are very different variant catalogues; dbSNP155 has close to 1 billion SNPs. I am running your code on a large machine, and the RAM consumption definitely exceeded 15GB ... it finished in under 2 min with
> gpos
UnstitchedGPos object with 2 positions and 2 metadata columns:
seqnames pos strand | RefSNP_id alleles_as_ambig
<Rle> <integer> <Rle> | <character> <character>
[1] 9 68413211 * | rs2639606 V
[2] 6 25056560 * | rs75264089 B
-------
seqinfo: 25 sequences (1 circular) from GRCh38.p13 genome
I am not sure a lookup like this can be done on a laptop; I would look for an NCBI API to grab the relevant info.
For the record:
> SNPlocs.Hsapiens.dbSNP144.GRCh38
# SNPlocs object for Homo sapiens (dbSNP Human BUILD 144)
# reference genome: GRCh38.p2
# nb of SNPs: 133030779
> SNPlocs.Hsapiens.dbSNP155.GRCh38
# SNPlocs object for Homo sapiens (dbSNP Human Build 155)
# reference genome: GRCh38.p13
# nb of SNPs: 949021448
Thanks, Vince. That dramatic scale-up probably accounts for the performance difference. I'll presume it's ok for me to close the issue.
Hi @MattBrauer , @vjcitn ,
Some basic experimenting with match()
seems to indicate the memory footprint of the lookup could be reduced by about 50% by using a divide-and-conquer approach. I'll try to commit something in the next few days.
Hi @MattBrauer , @vjcitn ,
I suspect that Matt's M1 Pro Mac entered into crazy swapping mode.
The good news is that some basic experimentation with match()
seems to indicate that the memory footprint of this lookup could be reduced by about 50% by using a divide-and-conquer approach. This should make snpsById()
usable on SNPlocs.Hsapiens.dbSNP155.GRCh38
on a laptop with 16GB.
I'll try to commit something in the next few days.
H.
Thanks, @hpages!
The divide-and-conquer approach is implemented in:
if you're using BioC 3.16 (current release).
And in:
if you're using BioC 3.17 (current devel).
It will take between 2 and 3 days before all these new versions become available via BiocManager::install()
.
With this improvement the memory footprint of the lookup is reduced to 6.7G (from 15.7G), so you shouldn't have any problem running this on your Mac Pro M1 @MattBrauer.
Also, on a machine with enough memory to run the lookup before this improvement (like @vjcitn's large machine), the divide-and-conquer approach should boost the speed by 40%-50%.
Cheers, H.
Following up on a previous issue (#31, closed), I'm finding that
snpsById
does not complete within any reasonable time.REPREX
Running on M1 Pro Mac, Monterey (12.4), 16GB RAM.