Closed bschilder closed 1 year ago
Yes, the speed is not particularly impressive in your example where you're looking up 93 rs ids only but the good news is that BSgenome::snpsById()
scales really well and only takes a little bit more time if the number of rs ids is 1 million.
For example, with SNPlocs.Hsapiens.dbSNP144.GRCh38:
library(SNPlocs.Hsapiens.dbSNP144.GRCh38)
snps <- SNPlocs.Hsapiens.dbSNP144.GRCh38
## Using a hack to extract all rs ids (without the "rs" prefix) from SNPlocs.Hsapiens.dbSNP144.GRCh38.
## Not something that the end-user should ever do:
all_rsids <- rowids(snps@snp_table)
## Lookup 100 rs ids (90 valid, 10 random):
my_rsids <- paste0("rs", sample(c(sample(all_rsids, 90), sample(999999999, 10))))
system.time(gpos <- BSgenome::snpsById(snps, my_rsids, ifnotfound = "drop"))
# user system elapsed
# 11.034 0.471 11.554
## Lookup 1 million rs ids (0.9 million valid, 0.1 million random):
my_rsids <- paste0("rs", sample(c(sample(all_rsids, 9e5), sample(999999999, 1e5))))
system.time(gpos <- BSgenome::snpsById(snps, my_rsids, ifnotfound="drop"))
# user system elapsed
# 16.521 0.752 17.570
So it takes only about 17 sec. on my laptop to lookup 1 million rs ids!
Note that supplying a BSgenome object via the genome
argument slows down things a little because of the additional work that takes place internally to extract the ref and alt alleles:
library(BSgenome.Hsapiens.UCSC.hg38)
genome <- BSgenome.Hsapiens.UCSC.hg38
seqlevelsStyle(genome) <- "NCBI"
system.time(gpos <- BSgenome::snpsById(snps, my_rsids, genome=genome, ifnotfound="drop"))
# user system elapsed
# 22.720 5.524 37.520
but it still takes less than a minute to lookup the 1 million rs ids and extract their ref and alt alleles from the chromosome sequences in BSgenome.Hsapiens.UCSC.hg38.
Now if we do this with 10 times more (i.e. 10 millions) rs ids:
my_rsids <- paste0("rs", sample(c(sample(all_rsids, 9e6), sample(999999999, 1e6))))
system.time(gpos <- BSgenome::snpsById(snps, my_rsids, genome=genome, ifnotfound="drop"))
# user system elapsed
# 88.672 17.916 106.737
it takes less than 2 minutes, confirming that the more rs ids you supply to BSgenome::snpsById()
, the better it performs.
I'm actually surprised by your claim that "the slowness is especially pronounced once we scale up to many millions of SNPs". If you have millions of SNPs of interest, it's best to call BSgenome::snpsById()
on all of them at once. Making many calls to BSgenome::snpsById()
on small subsets of your SNPs of interest is indeed going to be quite inefficient. In other words, this is a case where a "divide and conquer" strategy (a.k.a. chunking strategy) would actually hurt.
However, you need to make sure that you have enough memory to handle the huge GPos object that is returned by BSgenome::snpsById()
when called on millions of SNPs. When I tried to do the above with 50 million rs ids, my laptop ran out-of-memory (it only has 16 Gb of RAM) and it started to consume all its resources in performing very inefficient memory swapping. I had to kill the process after 10 min.! :disappointed:
Hope this helps,
H.
@hpages Thank you so much for taking the time to explain this so thoroughly!
The scaling is actually quite impressive when you put it that way. Sorry, I think i misspoke when i said "especially pronounced" when scaling up to millions of SNPs. What I meant was, it takes longer with lots of SNPs (albeit not nearly as long as you'd expect given the time it takes to query <100 SNPs). Also, we're calling the function a handful of time throughout our pipeline.
I've just added a timer to BSgenome::snpsById()
in our pipeline and I'm in the process of munging many different GWAS, so I can share those numbers with you if you think they might be helpful. I'm using a 64-core / 128 Gb machine.
At this stage I'm just trying to rack my brain about any way I can speed up MungeSumstats
further, but it seems you've already optimised BSgenome::snpsById()
pretty near the limit. So the quest continues!
Thanks again, Brian
Hi @bschilder,
Is it ok to close this issue?
Thanks, H.
Yes, of course! I hadn't realized this was still open. Thanks for checking @hpages :)
Hello,
I was just wondering if you had any tricks for speeding up
BSgenome::snpsById
. This is currently one of the slower steps in ourMungeSumstats
pipeline (@Al-Murphy). Specifically, here.The slowness is especially pronounced once we scale up to many millions of SNPs. So I was wondering if there were ways this function could be accelerated. For example, making use of
BiocParallel
to iterate across SNPs, or some sort of chunking procedure (either internally withinBSgenome
or as a wrapper).Many thanks in advance, Brian
Reprex
Session info