Bioconductor / VariantAnnotation

Annotation of Genetic Variants
https://bioconductor.org/packages/VariantAnnotation
24 stars 20 forks source link

Data Representation Efficiency of VCF #51

Closed DarioS closed 2 years ago

DarioS commented 2 years ago

I imported a 14 GB VCF (uncompressed) and after a while I noticed it finally took 228 GB RAM when stored in memory (server has 512 GB RAM, so didn't access swap space). Could the package provide a more efficient representation of lots of variants in R?

> system.time(variants <- readVcf("test.vcf"))
    user   system  elapsed 
3314.009  140.141 3455.743
mtmorgan commented 2 years ago

The implemented way to deal with large VCF files is to iterate through them with something like

vcf_file <- open(VcfFile("...", yieldSize = 100000))
while (length(vcf <- reaadVcf(vcf_filie)) {
    ## ... work on chunk
}
close(vcf_file)

Remembering to use ScanVcfParam(what = ...) or perhaps readGT() or similar to selectively input just the fields of interest.