Open DarioS opened 1 year ago
Interesting observation @DarioS . This topic (using the HRR resource) could merit a workflow package, which could then be referenced by VariantAnnotation documentation. Would you be up for this? A public VCF could be used to demonstrate the approach. Or a PR to the VariantAnnotation doc could be a rapid resolution of your concern. Did you consider breaking up goldStandards and performing the read on sorted, modest-sized, single-chromosome chunks?
Indeed, Section 5.2 of filterVcf cignette recommends readVcf
followed by findOverlaps
. But Section 7 Appendix seems to advocate for using ScanVcfParam(which = theRegions)
. It's unclear which approach is better and for which scenarios.
Recently, U.S. Food and Drug Administration, National Institute of Standards and Technology and Illumina researchers defined highly reproducible regions (H.R.R.s) of the human genome and made available BED files to define a set of regions in which variant callers consistently call variants on technical replicate samples, effectively defining a while-list for whole genome sequencing data.
readVcf
takes a long time ifwhich
ofScanVcfParam
is specified. Importing the V.C.F. takes about five minutes ifwhich
not specified but I terminated it after one hour whenwhich
was specified.Importing the whole V.C.F. file into the session and then using
subsetByOverlaps
seems a reasonable and fast workaround. To best help other users, would a documentation change or code change be better to make this user experience nicer?