broadinstitute / gatk-sv

A structural variation pipeline for short-read sequencing
BSD 3-Clause "New" or "Revised" License
169 stars 71 forks source link

Handling large VCFs for BAF from SNP VCF #147

Closed epiercehoffman closed 2 years ago

epiercehoffman commented 3 years ago

Feature request

Module(s) or script(s) involved

Module00c when run with snp_vcfs instead of gvcfs --> BAFFromShardedVCF

Description

Background: When a SNP VCF is used for BAF and is very large and not pre-sharded or filtered to PASS variants, it can take a long time and a lot of disk to run GenerateBAF. For example, a 419 GB un-sharded SNP VCF took 27 hours and 452 GB to run GenerateBAF; given the long runtime, a non-preemptible VM was required. The cost was relatively low (~$3.70), but if we start to encounter even larger SNP VCFs this may become untenable.

Proposed solutions: We should consider adding a separate task/workflow to shard the SNP VCF, or, alternatively, filter it. (Note that if we start to generate BAF from BAM instead, this will become unnecessary.)

mwalker174 commented 2 years ago

This should be addressed by #18