readVcf is Slow if ScanVcfParam which Regions is Lengthy

DarioS commented 1 year ago

Recently, U.S. Food and Drug Administration, National Institute of Standards and Technology and Illumina researchers defined highly reproducible regions (H.R.R.s) of the human genome and made available BED files to define a set of regions in which variant callers consistently call variants on technical replicate samples, effectively defining a while-list for whole genome sequencing data. readVcf takes a long time if which of ScanVcfParam is specified. Importing the V.C.F. takes about five minutes if which not specified but I terminated it after one hour when which was specified.

library(rtracklayer)
goldStandards <- list.files("HRR/", "bed", full.names = TRUE)
goldStandards <- lapply(goldStandards, import.bed)
goldStandards <- unlist(goldStandards)
goldStandards <- reduce(goldStandards)
> goldStandards
GRanges object with 2988875 ranges and 0 metadata columns:
    ...        ...
variants <- readVcf("DRAGENgermline.vcf.gz", param = ScanVcfParam(which = goldStandards)) # Stopped after one hour.

Importing the whole V.C.F. file into the session and then using subsetByOverlaps seems a reasonable and fast workaround. To best help other users, would a documentation change or code change be better to make this user experience nicer?

vjcitn commented 1 year ago

Interesting observation @DarioS . This topic (using the HRR resource) could merit a workflow package, which could then be referenced by VariantAnnotation documentation. Would you be up for this? A public VCF could be used to demonstrate the approach. Or a PR to the VariantAnnotation doc could be a rapid resolution of your concern. Did you consider breaking up goldStandards and performing the read on sorted, modest-sized, single-chromosome chunks?

mtmorgan commented 1 year ago

Does the filterVcf vignette address this use case?

DarioS commented 1 year ago

Indeed, Section 5.2 of filterVcf cignette recommends readVcf followed by findOverlaps. But Section 7 Appendix seems to advocate for using ScanVcfParam(which = theRegions). It's unclear which approach is better and for which scenarios.

Bioconductor / VariantAnnotation

readVcf is Slow if ScanVcfParam which Regions is Lengthy #70